Preface |
|
xi | |
1 Introduction |
|
1 | (20) |
|
1.1 DNA, RNA, protein and cells |
|
|
1 | (2) |
|
1.2 Sequencing technologies |
|
|
3 | (1) |
|
1.3 First-generation sequencing |
|
|
4 | (2) |
|
1.4 Second-generation sequencing |
|
|
6 | (6) |
|
1.4.1 Template preparation |
|
|
6 | (1) |
|
|
7 | (1) |
|
1.4.3 Polymerase-mediated methods based on reversible terminator nucleotides |
|
|
7 | (3) |
|
1.4.4 Polymerase-mediated methods based on unmodified nucleotides |
|
|
10 | (1) |
|
1.4.5 Ligase-mediated method |
|
|
11 | (1) |
|
1.5 Third-generation sequencing |
|
|
12 | (4) |
|
1.5.1 Single-molecule real-time sequencing |
|
|
12 | (1) |
|
1.5.2 Nanopore sequencing method |
|
|
13 | (2) |
|
1.5.3 Direct imaging of DNA using electron microscopy |
|
|
15 | (1) |
|
1.6 Comparison of the three generations of sequencing |
|
|
16 | (1) |
|
1.7 Applications of sequencing |
|
|
17 | (2) |
|
1.8 Summary and further reading |
|
|
19 | (1) |
|
|
19 | (2) |
2 NGS file formats |
|
21 | (14) |
|
|
21 | (1) |
|
2.2 Raw data files: fasta and fastq |
|
|
22 | (2) |
|
2.3 Alignment files: SAM and BAM |
|
|
24 | (3) |
|
|
26 | (1) |
|
|
26 | (1) |
|
|
27 | (2) |
|
2.5 Variant Call Format (VCF) |
|
|
29 | (2) |
|
2.6 Format for representing density data |
|
|
31 | (2) |
|
|
33 | (2) |
3 Related algorithms and data structures |
|
35 | (34) |
|
|
35 | (1) |
|
3.2 Recursion and dynamic programming |
|
|
35 | (3) |
|
3.2.1 Key searching problem |
|
|
36 | (1) |
|
3.2.2 Edit-distance problem |
|
|
37 | (1) |
|
|
38 | (5) |
|
|
39 | (1) |
|
3.3.2 Unobserved variable and EM algorithm |
|
|
40 | (3) |
|
|
43 | (6) |
|
3.4.1 Maintain an associative array by simple hashing |
|
|
43 | (2) |
|
3.4.2 Maintain a set using a Bloom filter |
|
|
45 | (1) |
|
3.4.3 Maintain a multiset using a counting Bloom filter |
|
|
46 | (1) |
|
3.4.4 Estimating the similarity of two sets using minHash |
|
|
47 | (2) |
|
|
49 | (9) |
|
3.5.1 Suffix trie and suffix tree |
|
|
49 | (1) |
|
|
50 | (1) |
|
|
51 | (4) |
|
3.5.3.1 Inverting the BWT B to the original text T |
|
|
53 | (1) |
|
3.5.3.2 Simulate a suffix array using the FM-index |
|
|
54 | (1) |
|
|
55 | (1) |
|
3.5.4 Simulate a suffix trie using the FM-index |
|
|
55 | (1) |
|
|
56 | (2) |
|
3.6 Data compression techniques |
|
|
58 | (8) |
|
3.6.1 Data compression and entropy |
|
|
58 | (1) |
|
3.6.2 Unary, gamma, and delta coding |
|
|
59 | (1) |
|
|
60 | (1) |
|
|
60 | (2) |
|
|
62 | (2) |
|
3.6.6 Order-k Markov Chain |
|
|
64 | (1) |
|
3.6.7 Run-length encoding |
|
|
65 | (1) |
|
|
66 | (3) |
4 NGS read mapping |
|
69 | (54) |
|
|
69 | (1) |
|
4.2 Overview of the read mapping problem |
|
|
70 | (6) |
|
4.2.1 Mapping reads with no quality score |
|
|
70 | (1) |
|
4.2.2 Mapping reads with a quality score |
|
|
71 | (1) |
|
4.2.3 Brute-force solution |
|
|
72 | (2) |
|
|
74 | (1) |
|
|
75 | (1) |
|
4.3 Align reads allowing a small number of mismatches |
|
|
76 | (21) |
|
4.3.1 Mismatch seed hashing approach |
|
|
77 | (1) |
|
4.3.2 Read hashing with a spaced seed |
|
|
78 | (4) |
|
4.3.3 Reference hashing approach |
|
|
82 | (2) |
|
4.3.4 Suffix trie-based approaches |
|
|
84 | (13) |
|
4.3.4.1 Estimating the lower bound of the number of mismatches |
|
|
87 | (2) |
|
4.3.4.2 Divide and conquer with the enhanced pigeon- hole principle |
|
|
89 | (3) |
|
4.3.4.3 Aligning a set of reads together |
|
|
92 | (2) |
|
4.3.4.4 Speed up utilizing the quality score |
|
|
94 | (3) |
|
4.4 Aligning reads allowing a small number of mismatches and indels |
|
|
97 | (8) |
|
|
97 | (2) |
|
4.4.2 Computing alignment using a suffix trie |
|
|
99 | (6) |
|
4.4.2.1 Computing the edit distance using a suffix trie |
|
|
100 | (3) |
|
4.4.2.2 Local alignment using a suffix trie |
|
|
103 | (2) |
|
4.5 Aligning reads in general |
|
|
105 | (11) |
|
4.5.1 Seed-and-extension approach |
|
|
107 | (7) |
|
|
108 | (1) |
|
|
109 | (1) |
|
|
110 | (1) |
|
|
111 | (1) |
|
|
112 | (1) |
|
|
113 | (1) |
|
4.5.2 Filter-based approach |
|
|
114 | (2) |
|
|
116 | (1) |
|
|
117 | (1) |
|
|
118 | (5) |
5 Genome assembly |
|
123 | (52) |
|
|
123 | (1) |
|
5.2 Whole genome shotgun sequencing |
|
|
124 | (2) |
|
5.2.1 Whole genome sequencing |
|
|
124 | (2) |
|
5.2.2 Mate-pair sequencing |
|
|
126 | (1) |
|
5.3 De novo genome assembly for short reads |
|
|
126 | (28) |
|
5.3.1 Read error correction |
|
|
128 | (10) |
|
5.3.1.1 Spectral alignment problem (SAP) |
|
|
129 | (4) |
|
|
133 | (5) |
|
5.3.2 Base-by-base extension approach |
|
|
138 | (3) |
|
5.3.3 De Bruijn graph approach |
|
|
141 | (9) |
|
5.3.3.1 De Bruijn assembler (no sequencing error) |
|
|
143 | (1) |
|
5.3.3.2 De Bruijn assembler (with sequencing errors) |
|
|
144 | (2) |
|
|
146 | (1) |
|
5.3.3.4 Additional issues of the de Bruijn graph approach |
|
|
147 | (3) |
|
|
150 | (3) |
|
|
153 | (1) |
|
5.4 Genome assembly for long reads |
|
|
154 | (14) |
|
5.4.1 Assemble long reads assuming long reads have a low sequencing error rate |
|
|
155 | (2) |
|
|
157 | (4) |
|
5.4.2.1 Use mate-pair reads and long reads to improve the assembly from short reads |
|
|
160 | (1) |
|
5.4.2.2 Use short reads to correct errors in long reads |
|
|
160 | (1) |
|
|
161 | (14) |
|
5.4.3.1 MinHash for all-versus-all pairwise alignment |
|
|
162 | (1) |
|
5.4.3.2 Computing consensus using Falcon Sense |
|
|
163 | (2) |
|
5.4.3.3 Quiver consensus algorithm |
|
|
165 | (3) |
|
5.5 How to evaluate the goodness of an assembly |
|
|
168 | (1) |
|
5.6 Discussion and further reading |
|
|
168 | (2) |
|
|
170 | (5) |
6 Single nucleotide variation (SNV) calling |
|
175 | (34) |
|
|
175 | (3) |
|
6.1.1 What are SNVs and small indels? |
|
|
175 | (3) |
|
6.1.2 Somatic and germline mutations |
|
|
178 | (1) |
|
6.2 Determine variations by resequencing |
|
|
178 | (2) |
|
6.2.1 Exome/Targeted sequencing |
|
|
179 | (1) |
|
6.2.2 Detection of somatic and germline variations |
|
|
180 | (1) |
|
6.3 Single locus SNV calling |
|
|
180 | (7) |
|
6.3.1 Identifying SNVs by counting alleles |
|
|
181 | (1) |
|
6.3.2 Identify SNVs by binomial distribution |
|
|
182 | (2) |
|
6.3.3 Identify SNVs by Poisson-binomial distribution |
|
|
184 | (1) |
|
6.3.4 Identifying SNVs by the Bayesian approach |
|
|
185 | (2) |
|
6.4 Single locus somatic SNV calling |
|
|
187 | (5) |
|
6.4.1 Identify somatic SNVs by the Fisher exact test |
|
|
187 | (1) |
|
6.4.2 Identify somatic SNVs by verifying that the SNVs appear in the tumor only |
|
|
188 | (11) |
|
6.4.2.1 Identify SNVs in the tumor sample by posterior odds ratio |
|
|
188 | (3) |
|
6.4.2.2 Verify if an SNV is somatic by the posterior odds ratio |
|
|
191 | (1) |
|
6.5 General pipeline for calling SNVs |
|
|
192 | (1) |
|
|
193 | (2) |
|
6.7 Duplicate read marking |
|
|
195 | (1) |
|
6.8 Base quality score recalibration |
|
|
195 | (3) |
|
|
198 | (1) |
|
6.10 Computational methods to identify small indels |
|
|
199 | (5) |
|
6.10.1 Split-read approach |
|
|
199 | (1) |
|
6.10.2 Span distribution-based clustering approach |
|
|
200 | (3) |
|
6.10.3 Local assembly approach |
|
|
203 | (1) |
|
6.11 Correctness of existing SNV and indel callers |
|
|
204 | (1) |
|
|
205 | (1) |
|
|
206 | (3) |
7 Structural variation calling |
|
209 | (36) |
|
|
209 | (2) |
|
|
211 | (3) |
|
7.3 Clinical effects of structural variations |
|
|
214 | (1) |
|
7.4 Methods for determining structural variations |
|
|
215 | (2) |
|
|
217 | (5) |
|
7.5.1 Computing the raw read count |
|
|
218 | (1) |
|
7.5.2 Normalize the read counts |
|
|
219 | (1) |
|
|
219 | (3) |
|
|
222 | (1) |
|
7.6.1 Insert size estimation |
|
|
222 | (1) |
|
7.7 Classifying the paired-end read alignments |
|
|
223 | (3) |
|
7.8 Identifying candidate SVs from paired-end reads |
|
|
226 | (13) |
|
7.8.1 Clustering approach |
|
|
227 | (9) |
|
7.8.1.1 Clique-finding approach |
|
|
228 | (1) |
|
7.8.1.2 Confidence interval overlapping approach |
|
|
229 | (4) |
|
7.8.1.3 Set cover approach |
|
|
233 | (3) |
|
7.8.1.4 Performance of the clustering approach |
|
|
236 | (1) |
|
7.8.2 Split-mapping approach |
|
|
236 | (1) |
|
|
237 | (1) |
|
|
238 | (1) |
|
|
239 | (3) |
|
|
242 | (1) |
|
|
242 | (3) |
8 RNA-seq |
|
245 | (26) |
|
|
245 | (2) |
|
8.2 High-throughput methods to study the transcriptome |
|
|
247 | (1) |
|
8.3 Application of RNA-seq |
|
|
248 | (2) |
|
8.4 Computational Problems of RNA-seq |
|
|
250 | (1) |
|
|
250 | (10) |
|
8.5.1 Features used in RNA-seq read mapping |
|
|
250 | (3) |
|
|
250 | (2) |
|
8.5.1.2 Splice junction signals |
|
|
252 | (1) |
|
8.5.2 Exon-first approach |
|
|
253 | (3) |
|
8.5.3 Seed-and-extend approach |
|
|
256 | (4) |
|
8.6 Construction of isoforms |
|
|
260 | (1) |
|
8.7 Estimating expression level of each transcript |
|
|
261 | (7) |
|
8.7.1 Estimating transcript abundances when every read maps to exactly one transcript |
|
|
261 | (3) |
|
8.7.2 Estimating transcript abundances when a read maps to multiple isoforms |
|
|
264 | (2) |
|
8.7.3 Estimating gene abundance |
|
|
266 | (2) |
|
8.8 Summary and further reading |
|
|
268 | (1) |
|
|
268 | (3) |
9 Peak calling methods |
|
271 | (18) |
|
|
271 | (1) |
|
9.2 Techniques that generate density-based datasets |
|
|
271 | (3) |
|
9.2.1 Protein DNA interaction |
|
|
271 | (2) |
|
9.2.2 Epigenetics of our genome |
|
|
273 | (1) |
|
|
274 | (1) |
|
|
274 | (11) |
|
9.3.1 Model fragment length |
|
|
276 | (3) |
|
9.3.2 Modeling noise using a control library |
|
|
279 | (1) |
|
9.3.3 Noise in the sample library |
|
|
280 | (1) |
|
9.3.4 Determination if a peak is significant |
|
|
281 | (2) |
|
9.3.5 Unannotated high copy number regions |
|
|
283 | (1) |
|
9.3.6 Constructing a signal profile by Kernel methods |
|
|
284 | (1) |
|
9.4 Sequencing depth of the ChIP-seq libraries |
|
|
285 | (1) |
|
|
286 | (1) |
|
|
287 | (2) |
10 Data compression techniques used in NGS files |
|
289 | (18) |
|
|
289 | (1) |
|
10.2 Strategies for compressing fasta/fastq files |
|
|
290 | (1) |
|
10.3 Techniques to compress identifiers |
|
|
290 | (1) |
|
10.4 Techniques to compress DNA bases |
|
|
291 | (8) |
|
10.4.1 Statistical-based approach |
|
|
291 | (1) |
|
10.4.2 BWT-based approach |
|
|
292 | (3) |
|
10.4.3 Reference-based approach |
|
|
295 | (2) |
|
10.4.4 Assembly-based approach |
|
|
297 | (2) |
|
10.5 Quality score compression methods |
|
|
299 | (3) |
|
10.5.1 Lossless compression |
|
|
300 | (1) |
|
|
301 | (1) |
|
10.6 Compression of other NGS data |
|
|
302 | (2) |
|
|
304 | (3) |
References |
|
307 | (32) |
Index |
|
339 | |