Preface |
|
xv | |
About the Authors |
|
xxi | |
1 Introduction to Genomics |
|
1 | (22) |
|
1.1 Genes, DNA and central dogma |
|
|
1 | (5) |
|
|
1 | (1) |
|
|
2 | (2) |
|
1.1.3 How are genes controlled? Transcriptional and posttranscriptional regulation |
|
|
4 | (1) |
|
1.1.4 What does a gene look like? |
|
|
5 | (1) |
|
1.2 Elements of gene regulation |
|
|
6 | (7) |
|
1.2.1 Transcriptional regulation |
|
|
6 | (5) |
|
1.2.2 Post-transcriptional regulation |
|
|
11 | (2) |
|
1.3 Shaping the genome: DNA mutation |
|
|
13 | (2) |
|
1.4 High-throughput experimental methods in genomics |
|
|
15 | (4) |
|
1.4.1 The general idea behind high-throughput techniques |
|
|
15 | (1) |
|
1.4.2 High-throughput sequencing |
|
|
16 | (3) |
|
1.5 Visualization and data repositories for genomics |
|
|
19 | (4) |
2 Introduction to R for Genomic Data Analysis |
|
23 | (44) |
|
2.1 Steps of (genomic) data analysis |
|
|
23 | (3) |
|
|
24 | (1) |
|
2.1.2 Data quality check and cleaning |
|
|
24 | (1) |
|
|
24 | (1) |
|
2.1.4 Exploratory data analysis and modeling |
|
|
24 | (1) |
|
2.1.5 Visualization and reporting |
|
|
25 | (1) |
|
2.1.6 Why use R for genomics 2 |
|
|
25 | (1) |
|
2.2 Getting started with R |
|
|
26 | (3) |
|
2.2.1 Installing packages |
|
|
27 | (1) |
|
2.2.2 Installing packages in custom locations |
|
|
27 | (1) |
|
2.2.3 Getting help on functions and packages |
|
|
28 | (1) |
|
|
29 | (1) |
|
|
30 | (6) |
|
|
30 | (1) |
|
|
31 | (2) |
|
|
33 | (1) |
|
|
34 | (1) |
|
|
35 | (1) |
|
|
36 | (1) |
|
2.6 Reading and writing data |
|
|
37 | (2) |
|
2.6.1 Reading large files |
|
|
38 | (1) |
|
2.7 Plotting in R with base graphics |
|
|
39 | (5) |
|
2.7.1 Combining multiple plots |
|
|
42 | (1) |
|
|
43 | (1) |
|
2.8 Plotting in R with ggplotz |
|
|
44 | (5) |
|
2.8.1 Combining multiple plots |
|
|
46 | (2) |
|
2.8.2 ggplot2 and tidyverse |
|
|
48 | (1) |
|
2.9 Functions and control structures (for, if/else, etc.) |
|
|
49 | (7) |
|
2.9.1 User-defined functions |
|
|
49 | (1) |
|
2.9.2 Loops and looping structures in R |
|
|
50 | (6) |
|
|
56 | (11) |
|
|
56 | (1) |
|
2.10.2 Data structures in R |
|
|
56 | (3) |
|
2.10.3 Reading in and writing data out in R |
|
|
59 | (1) |
|
|
60 | (3) |
|
2.10.5 Functions and control structures (for, if/else, etc.) |
|
|
63 | (4) |
3 Statistics for Genomics |
|
67 | (44) |
|
3.1 How to summarize collection of data points: The idea behind statistical distributions |
|
|
67 | (12) |
|
3.1.1 Describing the central tendency: Mean and median |
|
|
67 | (2) |
|
3.1.2 Describing the spread: Measurements of variation |
|
|
69 | (5) |
|
3.1.3 Precision of estimates: Confidence intervals |
|
|
74 | (5) |
|
3.2 How to test for differences between samples |
|
|
79 | (10) |
|
3.2.1 Randomization-based testing for difference of the means |
|
|
79 | (1) |
|
3.2.2 Using t-test for difference of the means between two samples |
|
|
80 | (3) |
|
3.2.3 Multiple testing correction |
|
|
83 | (3) |
|
3.2.4 Moderated t-tests: Using information from multiple comparisons |
|
|
86 | (3) |
|
3.3 Relationship between variables: Linear models and correlation |
|
|
89 | (17) |
|
|
92 | (4) |
|
3.3.2 How to estimate the error of the coefficients |
|
|
96 | (3) |
|
3.3.3 Accuracy of the model |
|
|
99 | (3) |
|
3.3.4 Regression with categorical variables |
|
|
102 | (2) |
|
3.3.5 Regression pitfalls |
|
|
104 | (2) |
|
|
106 | (5) |
|
3.4.1 How to summarize collection of data points: The idea behind statistical distributions |
|
|
106 | (1) |
|
3.4.2 How to test for differences in samples |
|
|
107 | (1) |
|
3.4.3 Relationship between variables: Linear models and correlation |
|
|
108 | (3) |
4 Exploratory Data Analysis with Unsupervised Machine Learning |
|
111 | (36) |
|
4.1 Clustering: Grouping samples based on their similarity |
|
|
111 | (16) |
|
|
111 | (6) |
|
4.1.2 Hiearchical clustering |
|
|
117 | (2) |
|
|
119 | (1) |
|
4.1.4 How to choose "k", the number of clusters |
|
|
120 | (7) |
|
4.2 Dimensionality reduction techniques: Visualizing complex data sets in 2D |
|
|
127 | (17) |
|
4.2.1 Principal component analysis |
|
|
127 | (7) |
|
4.2.2 Other matrix factorization methods for dimensionality reduction |
|
|
134 | (5) |
|
4.2.3 Multi-dimensional scaling |
|
|
139 | (1) |
|
4.2.4 t-Distributed Stochastic Neighbor Embedding (t-SNE) |
|
|
140 | (4) |
|
|
144 | (3) |
|
|
144 | (1) |
|
4.3.2 Dimension reduction |
|
|
145 | (2) |
5 Predictive Modeling with Supervised Machine Learning |
|
147 | (56) |
|
5.1 How are machine learning models fit? |
|
|
148 | (1) |
|
5.1.1 Machine learning vs. statistics |
|
|
149 | (1) |
|
5.2 Steps in supervised machine learning |
|
|
149 | (1) |
|
5.3 Use case: Disease subtype from genomics data |
|
|
150 | (1) |
|
|
151 | (5) |
|
5.4.1 Data transformation |
|
|
152 | (2) |
|
5.4.2 Filtering data and scaling |
|
|
154 | (1) |
|
5.4.3 Dealing with missing values |
|
|
155 | (1) |
|
|
156 | (2) |
|
5.5.1 Holdout test dataset |
|
|
156 | (1) |
|
|
157 | (1) |
|
5.5.3 Bootstrap resampling |
|
|
158 | (1) |
|
5.6 Predicting the subtype with k-nearest neighbors |
|
|
158 | (1) |
|
5.7 Assessing the performance of our model |
|
|
159 | (5) |
|
5.7.1 Receiver Operating Characteristic (ROC) curves |
|
|
162 | (2) |
|
5.8 Model tuning and avoiding overfitting |
|
|
164 | (8) |
|
5.8.1 Model complexity and bias variance trade-off |
|
|
167 | (3) |
|
5.8.2 Data split strategies for model tuning and testing |
|
|
170 | (2) |
|
|
172 | (2) |
|
5.10 How to deal with class imbalance |
|
|
174 | (1) |
|
5.10.1 Sampling for class balance |
|
|
174 | (1) |
|
5.10.2 Altering case weights |
|
|
175 | (1) |
|
5.10.3 Selecting different classification score cutoffs |
|
|
175 | (1) |
|
5.11 Dealing with correlated predictors |
|
|
175 | (1) |
|
5.12 Trees and forests: Random forests in action |
|
|
176 | (4) |
|
|
176 | (1) |
|
|
177 | (2) |
|
5.12.3 Variable importance |
|
|
179 | (1) |
|
5.13 Logistic regression and regularization |
|
|
180 | (8) |
|
5.13.1 Regularization in order to avoid overfitting |
|
|
184 | (2) |
|
5.13.2 Variable importance |
|
|
186 | (2) |
|
5.14 Other supervised algorithms |
|
|
188 | (8) |
|
|
188 | (2) |
|
5.14.2 Support Vector Machines (SVM) |
|
|
190 | (3) |
|
5.14.3 Neural networks and deep versions of it |
|
|
193 | (2) |
|
|
195 | (1) |
|
5.15 Predicting continuous variables: Regression with machine learning |
|
|
196 | (4) |
|
5.15.1 Use case: Predicting age from DNA methylation |
|
|
196 | (1) |
|
5.15.2 Reading and processing the data |
|
|
197 | (1) |
|
5.15.3 Running random forest regression |
|
|
198 | (2) |
|
|
200 | (3) |
|
|
200 | (1) |
|
|
201 | (2) |
6 Operations on Genomic Intervals and Genome Arithmetic |
|
203 | (34) |
|
6.1 Operations on genomic intervals with Genomi cRanges package |
|
|
204 | (10) |
|
6.1.1 How to create and manipulate a GRanges object |
|
|
204 | (3) |
|
6.1.2 Getting genomic regions into R as GRanges objects |
|
|
207 | (3) |
|
6.1.3 Finding regions that do/do not overlap with another set of regions |
|
|
210 | (4) |
|
6.2 Dealing with mapped high-throughput sequencing reads |
|
|
214 | (1) |
|
6.2.1 Counting mapped reads for a set of regions |
|
|
214 | (1) |
|
6.3 Dealing with continuous scores over the genome |
|
|
215 | (6) |
|
6.3.1 Extracting subsections of Rle and RleList objects |
|
|
218 | (3) |
|
6.4 Genomic intervals with more information: Summarized Experiment class |
|
|
221 | (4) |
|
6.4.1 Create a SummarizedExperiment object |
|
|
221 | (1) |
|
6.4.2 Subset and manipulate the SummarizedExperiment object |
|
|
222 | (3) |
|
6.5 Visualizing and summarizing genomic intervals |
|
|
225 | (9) |
|
6.5.1 Visualizing intervals on a locus of interest |
|
|
225 | (1) |
|
6.5.2 Summaries of genomic intervals on multiple loci |
|
|
226 | (4) |
|
6.5.3 Making karyograms and circos plots |
|
|
230 | (4) |
|
|
234 | (3) |
|
6.6.1 Operations on genomic intervals with the GenomicRanges package |
|
|
234 | (1) |
|
6.6.2 Dealing with mapped high-throughput sequencing reads |
|
|
235 | (1) |
|
6.6.3 Dealing with contiguous scores over the genome |
|
|
235 | (1) |
|
6.6.4 Visualizing and summarizing genomic intervals |
|
|
235 | (2) |
7 Quality Check, Processing and Alignment of High-throughput Sequencing Reads |
|
237 | (14) |
|
7.1 FASTA and FASTQ formats |
|
|
237 | (2) |
|
7.2 Quality check on sequencing reads |
|
|
239 | (4) |
|
7.2.1 Sequence quality per base/cycle |
|
|
239 | (1) |
|
7.2.2 Sequence content per base/cycle |
|
|
240 | (1) |
|
7.2.3 Read frequency plot |
|
|
241 | (1) |
|
7.2.4 Other quality metrics and QC tools |
|
|
241 | (2) |
|
7.3 Filtering and trimming reads |
|
|
243 | (3) |
|
7.4 Mapping/aligning reads to the genome |
|
|
246 | (2) |
|
7.5 Further processing of aligned reads |
|
|
248 | (1) |
|
|
248 | (3) |
8 RNA-seq Analysis |
|
251 | (44) |
|
8.1 What is gene expression? |
|
|
251 | (1) |
|
8.2 Methods to detect gene expression |
|
|
252 | (1) |
|
8.3 Gene expression analysis using high-throughput sequencing technologies |
|
|
252 | (38) |
|
8.3.1 Processing raw data |
|
|
253 | (1) |
|
|
253 | (1) |
|
|
254 | (1) |
|
8.3.4 Within sample normalization of the read counts |
|
|
255 | (1) |
|
8.3.5 Computing different normalization schemes in R |
|
|
256 | (3) |
|
8.3.6 Exploratory analysis of the read count table |
|
|
259 | (5) |
|
8.3.7 Differential expression analysis |
|
|
264 | (9) |
|
8.3.8 Functional enrichment analysis |
|
|
273 | (4) |
|
8.3.9 Accounting for additional sources of variation |
|
|
277 | (13) |
|
8.4 Other applications of RNA-seq |
|
|
290 | (1) |
|
|
291 | (4) |
|
8.5.1 Exploring the count tables |
|
|
291 | (1) |
|
8.5.2 Differential expression analysis |
|
|
292 | (1) |
|
8.5.3 Functional enrichment analysis |
|
|
292 | (1) |
|
8.5.4 Removing unwanted variation from the expression data |
|
|
293 | (2) |
9 ChIP-seq analysis |
|
295 | (72) |
|
9.1 Regulatory protein-DNA interactions |
|
|
295 | (1) |
|
9.2 Measuring protein-DNA interactions with ChIP-seq |
|
|
296 | (2) |
|
9.3 Factors that affect ChIP-seq experiment and analysis quality |
|
|
298 | (3) |
|
9.3.1 Antibody specificity |
|
|
298 | (1) |
|
|
298 | (1) |
|
|
299 | (1) |
|
9.3.4 Biological replicates |
|
|
299 | (1) |
|
9.3.5 Control experiments |
|
|
299 | (2) |
|
9.3.6 Using tagged proteins |
|
|
301 | (1) |
|
9.4 Pre-processing ChIP data |
|
|
301 | (2) |
|
9.4.1 Mapping of ChIP-seq data |
|
|
301 | (2) |
|
|
303 | (24) |
|
|
303 | (1) |
|
|
304 | (4) |
|
9.5.3 Visualization in the genome browser |
|
|
308 | (4) |
|
9.5.4 Plus and minus strand cross-correlation |
|
|
312 | (4) |
|
9.5.5 GC bias quantification |
|
|
316 | (4) |
|
9.5.6 Sequence read genomic distribution |
|
|
320 | (7) |
|
|
327 | (32) |
|
9.6.1 Types of ChIP-seq experiments |
|
|
327 | (5) |
|
9.6.2 Peak calling: Sharp peaks |
|
|
332 | (9) |
|
9.6.3 Peak calling: Broad regions |
|
|
341 | (3) |
|
9.6.4 Peak quality control |
|
|
344 | (12) |
|
|
356 | (3) |
|
|
359 | (4) |
|
|
361 | (2) |
|
|
363 | (1) |
|
|
364 | (3) |
|
|
364 | (3) |
10 DNA methylation analysis using bisulfite sequencing data |
|
367 | (26) |
|
10.1 What is DNA methylation? |
|
|
367 | (1) |
|
10.1.1 How DNA methylation is set? |
|
|
368 | (1) |
|
10.1.2 How to measure DNA methylation with bisulfite sequencing |
|
|
368 | (1) |
|
10.2 Analyzing DNA methylation data |
|
|
368 | (1) |
|
10.3 Processing raw data and getting data into R |
|
|
369 | (1) |
|
10.4 Data filtering and exploratory analysis |
|
|
370 | (9) |
|
10.4.1 Reading methylation call files |
|
|
370 | (2) |
|
10.4.2 Further quality check |
|
|
372 | (1) |
|
10.4.3 Merging samples into a single table |
|
|
373 | (1) |
|
|
374 | (2) |
|
10.4.5 Clustering samples |
|
|
376 | (2) |
|
10.4.6 Principal component analysis |
|
|
378 | (1) |
|
10.5 Extracting interesting regions: Differential methylation and segmentation |
|
|
379 | (9) |
|
10.5.1 Differential methylation |
|
|
379 | (5) |
|
10.5.2 Methylation segmentation |
|
|
384 | (3) |
|
10.5.3 Working with large files |
|
|
387 | (1) |
|
10.6 Annotation of DMRs/DMCs and segments |
|
|
388 | (2) |
|
10.6.1 Further annotation with genes or gene sets |
|
|
390 | (1) |
|
10.7 Other R packages that can be used for methylation analysis |
|
|
390 | (1) |
|
|
390 | (3) |
|
10.8.1 Differential methylation |
|
|
390 | (1) |
|
10.8.2 Methylome segmentation |
|
|
391 | (2) |
11 Multi-omics Analysis |
|
393 | (32) |
|
11.1 Use case: Multi-omics data from colorectal cancer |
|
|
393 | (6) |
|
11.2 Latent variable models for multi-omics integration |
|
|
399 | (1) |
|
11.3 Matrix factorization methods for unsupervised multi-omics data integration |
|
|
400 | (13) |
|
11.3.1 Multiple factor analysis |
|
|
401 | (3) |
|
11.3.2 Joint non-negative matrix factorization |
|
|
404 | (5) |
|
|
409 | (4) |
|
11.4 Clustering using latent factors |
|
|
413 | (3) |
|
11.4.1 One-hot clustering |
|
|
414 | (1) |
|
11.4.2 K-means clustering |
|
|
415 | (1) |
|
11.5 Biological interpretation of latent factors |
|
|
416 | (6) |
|
11.5.1 Inspection of feature weights in loading vectors |
|
|
416 | (2) |
|
11.5.2 Making sense of factors using enrichment analysis |
|
|
418 | (2) |
|
11.5.3 Interpretation using additional covariates |
|
|
420 | (2) |
|
|
422 | (3) |
|
11.6.1 Matrix factorization methods |
|
|
422 | (1) |
|
11.6.2 Clustering using latent factors |
|
|
423 | (1) |
|
11.6.3 Biological interpretation of latent factors |
|
|
423 | (2) |
Bibliography |
|
425 | (12) |
Index |
|
437 | |