Muutke küpsiste eelistusi

E-raamat: Computational Genomics with R

  • Formaat - EPUB+DRM
  • Hind: 57,19 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

Computational Genomics with R provides a starting point for beginners in genomic data analysis and also guides more advanced practitioners to sophisticated data analysis techniques in genomics. The book covers topics from R programming, to machine learning and statistics, to the latest genomic data analysis techniques. The text provides accessible information and explanations, always with the genomics context in the background. This also contains practical and well-documented examples in R so readers can analyze their data by simply reusing the code presented. As the field of computational genomics is interdisciplinary, it requires different starting points for people with different backgrounds. For example, a biologist might skip sections on basic genome biology and start with R programming, whereas a computer scientist might want to start with genome biology.

After reading:





You will have the basics of R and be able to dive right into specialized uses of R for computational genomics such as using Bioconductor packages. You will be familiar with statistics, supervised and unsupervised learning techniques that are important in data modeling, and exploratory analysis of high-dimensional data. You will understand genomic intervals and operations on them that are used for tasks such as aligned read counting and genomic feature annotation. You will know the basics of processing and quality checking high-throughput sequencing data. You will be able to do sequence analysis, such as calculating GC content for parts of a genome or finding transcription factor binding sites. You will know about visualization techniques used in genomics, such as heatmaps, meta-gene plots, and genomic track visualization. You will be familiar with analysis of different high-throughput sequencing data sets, such as RNA-seq, ChIP-seq, and BS-seq. You will know basic techniques for integrating and interpreting multi-omics datasets.

Altuna Akalin is a group leader and head of the Bioinformatics and Omics Data Science Platform at the Berlin Institute of Medical Systems Biology, Max Delbrück Center, Berlin. He has been developing computational methods for analyzing and integrating large-scale genomics data sets since 2002. He has published an extensive body of work in this area. The framework for this book grew out of the yearly computational genomics courses he has been organizing and teaching since 2015.

Arvustused

'This book provides a basic overview of computational tools developed in R for carrying out data analyses in genomics. It can be a valuable companion for anyone whowants to utilise the computational tools developed within the Bioconductor and R environments for education and research. This books main target audience are students of computational biology to get a first look at the diversity of machine learning methods. Thebook will also servewell biomedical researchers needing a guide to packages that can help them with the analysis of data that they encounter in their work.'

- Krzysztof Podgórski, International Statistical Review (2021) doi: 10.1111/insr.12453

Preface xv
About the Authors xxi
1 Introduction to Genomics 1(22)
1.1 Genes, DNA and central dogma
1(5)
1.1.1 What is a genome?
1(1)
1.1.2 What is a gene?
2(2)
1.1.3 How are genes controlled? Transcriptional and posttranscriptional regulation
4(1)
1.1.4 What does a gene look like?
5(1)
1.2 Elements of gene regulation
6(7)
1.2.1 Transcriptional regulation
6(5)
1.2.2 Post-transcriptional regulation
11(2)
1.3 Shaping the genome: DNA mutation
13(2)
1.4 High-throughput experimental methods in genomics
15(4)
1.4.1 The general idea behind high-throughput techniques
15(1)
1.4.2 High-throughput sequencing
16(3)
1.5 Visualization and data repositories for genomics
19(4)
2 Introduction to R for Genomic Data Analysis 23(44)
2.1 Steps of (genomic) data analysis
23(3)
2.1.1 Data collection
24(1)
2.1.2 Data quality check and cleaning
24(1)
2.1.3 Data processing
24(1)
2.1.4 Exploratory data analysis and modeling
24(1)
2.1.5 Visualization and reporting
25(1)
2.1.6 Why use R for genomics 2
25(1)
2.2 Getting started with R
26(3)
2.2.1 Installing packages
27(1)
2.2.2 Installing packages in custom locations
27(1)
2.2.3 Getting help on functions and packages
28(1)
2.3 Computations in R
29(1)
2.4 Data structures
30(6)
2.4.1 Vectors
30(1)
2.4.2 Matrices
31(2)
2.4.3 Data frames
33(1)
2.4.4 Lists
34(1)
2.4.5 Factors
35(1)
2.5 Data types
36(1)
2.6 Reading and writing data
37(2)
2.6.1 Reading large files
38(1)
2.7 Plotting in R with base graphics
39(5)
2.7.1 Combining multiple plots
42(1)
2.7.2 Saving plots
43(1)
2.8 Plotting in R with ggplotz
44(5)
2.8.1 Combining multiple plots
46(2)
2.8.2 ggplot2 and tidyverse
48(1)
2.9 Functions and control structures (for, if/else, etc.)
49(7)
2.9.1 User-defined functions
49(1)
2.9.2 Loops and looping structures in R
50(6)
2.10 Exercises
56(11)
2.10.1 Computations in R
56(1)
2.10.2 Data structures in R
56(3)
2.10.3 Reading in and writing data out in R
59(1)
2.10.4 Plotting in R
60(3)
2.10.5 Functions and control structures (for, if/else, etc.)
63(4)
3 Statistics for Genomics 67(44)
3.1 How to summarize collection of data points: The idea behind statistical distributions
67(12)
3.1.1 Describing the central tendency: Mean and median
67(2)
3.1.2 Describing the spread: Measurements of variation
69(5)
3.1.3 Precision of estimates: Confidence intervals
74(5)
3.2 How to test for differences between samples
79(10)
3.2.1 Randomization-based testing for difference of the means
79(1)
3.2.2 Using t-test for difference of the means between two samples
80(3)
3.2.3 Multiple testing correction
83(3)
3.2.4 Moderated t-tests: Using information from multiple comparisons
86(3)
3.3 Relationship between variables: Linear models and correlation
89(17)
3.3.1 How to fit a line
92(4)
3.3.2 How to estimate the error of the coefficients
96(3)
3.3.3 Accuracy of the model
99(3)
3.3.4 Regression with categorical variables
102(2)
3.3.5 Regression pitfalls
104(2)
3.4 Exercises
106(5)
3.4.1 How to summarize collection of data points: The idea behind statistical distributions
106(1)
3.4.2 How to test for differences in samples
107(1)
3.4.3 Relationship between variables: Linear models and correlation
108(3)
4 Exploratory Data Analysis with Unsupervised Machine Learning 111(36)
4.1 Clustering: Grouping samples based on their similarity
111(16)
4.1.1 Distance metrics
111(6)
4.1.2 Hiearchical clustering
117(2)
4.1.3 K-means clustering
119(1)
4.1.4 How to choose "k", the number of clusters
120(7)
4.2 Dimensionality reduction techniques: Visualizing complex data sets in 2D
127(17)
4.2.1 Principal component analysis
127(7)
4.2.2 Other matrix factorization methods for dimensionality reduction
134(5)
4.2.3 Multi-dimensional scaling
139(1)
4.2.4 t-Distributed Stochastic Neighbor Embedding (t-SNE)
140(4)
4.3 Exercises
144(3)
4.3.1 Clustering
144(1)
4.3.2 Dimension reduction
145(2)
5 Predictive Modeling with Supervised Machine Learning 147(56)
5.1 How are machine learning models fit?
148(1)
5.1.1 Machine learning vs. statistics
149(1)
5.2 Steps in supervised machine learning
149(1)
5.3 Use case: Disease subtype from genomics data
150(1)
5.4 Data preprocessing
151(5)
5.4.1 Data transformation
152(2)
5.4.2 Filtering data and scaling
154(1)
5.4.3 Dealing with missing values
155(1)
5.5 Splitting the data
156(2)
5.5.1 Holdout test dataset
156(1)
5.5.2 Cross-validation
157(1)
5.5.3 Bootstrap resampling
158(1)
5.6 Predicting the subtype with k-nearest neighbors
158(1)
5.7 Assessing the performance of our model
159(5)
5.7.1 Receiver Operating Characteristic (ROC) curves
162(2)
5.8 Model tuning and avoiding overfitting
164(8)
5.8.1 Model complexity and bias variance trade-off
167(3)
5.8.2 Data split strategies for model tuning and testing
170(2)
5.9 Variable importance
172(2)
5.10 How to deal with class imbalance
174(1)
5.10.1 Sampling for class balance
174(1)
5.10.2 Altering case weights
175(1)
5.10.3 Selecting different classification score cutoffs
175(1)
5.11 Dealing with correlated predictors
175(1)
5.12 Trees and forests: Random forests in action
176(4)
5.12.1 Decision trees
176(1)
5.12.2 Trees to forests
177(2)
5.12.3 Variable importance
179(1)
5.13 Logistic regression and regularization
180(8)
5.13.1 Regularization in order to avoid overfitting
184(2)
5.13.2 Variable importance
186(2)
5.14 Other supervised algorithms
188(8)
5.14.1 Gradient boosting
188(2)
5.14.2 Support Vector Machines (SVM)
190(3)
5.14.3 Neural networks and deep versions of it
193(2)
5.14.4 Ensemble learning
195(1)
5.15 Predicting continuous variables: Regression with machine learning
196(4)
5.15.1 Use case: Predicting age from DNA methylation
196(1)
5.15.2 Reading and processing the data
197(1)
5.15.3 Running random forest regression
198(2)
5.16 Exercises
200(3)
5.16.1 Classification
200(1)
5.16.2 Regression
201(2)
6 Operations on Genomic Intervals and Genome Arithmetic 203(34)
6.1 Operations on genomic intervals with Genomi cRanges package
204(10)
6.1.1 How to create and manipulate a GRanges object
204(3)
6.1.2 Getting genomic regions into R as GRanges objects
207(3)
6.1.3 Finding regions that do/do not overlap with another set of regions
210(4)
6.2 Dealing with mapped high-throughput sequencing reads
214(1)
6.2.1 Counting mapped reads for a set of regions
214(1)
6.3 Dealing with continuous scores over the genome
215(6)
6.3.1 Extracting subsections of Rle and RleList objects
218(3)
6.4 Genomic intervals with more information: Summarized Experiment class
221(4)
6.4.1 Create a SummarizedExperiment object
221(1)
6.4.2 Subset and manipulate the SummarizedExperiment object
222(3)
6.5 Visualizing and summarizing genomic intervals
225(9)
6.5.1 Visualizing intervals on a locus of interest
225(1)
6.5.2 Summaries of genomic intervals on multiple loci
226(4)
6.5.3 Making karyograms and circos plots
230(4)
6.6 Exercises
234(3)
6.6.1 Operations on genomic intervals with the GenomicRanges package
234(1)
6.6.2 Dealing with mapped high-throughput sequencing reads
235(1)
6.6.3 Dealing with contiguous scores over the genome
235(1)
6.6.4 Visualizing and summarizing genomic intervals
235(2)
7 Quality Check, Processing and Alignment of High-throughput Sequencing Reads 237(14)
7.1 FASTA and FASTQ formats
237(2)
7.2 Quality check on sequencing reads
239(4)
7.2.1 Sequence quality per base/cycle
239(1)
7.2.2 Sequence content per base/cycle
240(1)
7.2.3 Read frequency plot
241(1)
7.2.4 Other quality metrics and QC tools
241(2)
7.3 Filtering and trimming reads
243(3)
7.4 Mapping/aligning reads to the genome
246(2)
7.5 Further processing of aligned reads
248(1)
7.6 Exercises
248(3)
8 RNA-seq Analysis 251(44)
8.1 What is gene expression?
251(1)
8.2 Methods to detect gene expression
252(1)
8.3 Gene expression analysis using high-throughput sequencing technologies
252(38)
8.3.1 Processing raw data
253(1)
8.3.2 Alignment
253(1)
8.3.3 Quantification
254(1)
8.3.4 Within sample normalization of the read counts
255(1)
8.3.5 Computing different normalization schemes in R
256(3)
8.3.6 Exploratory analysis of the read count table
259(5)
8.3.7 Differential expression analysis
264(9)
8.3.8 Functional enrichment analysis
273(4)
8.3.9 Accounting for additional sources of variation
277(13)
8.4 Other applications of RNA-seq
290(1)
8.5 Exercises
291(4)
8.5.1 Exploring the count tables
291(1)
8.5.2 Differential expression analysis
292(1)
8.5.3 Functional enrichment analysis
292(1)
8.5.4 Removing unwanted variation from the expression data
293(2)
9 ChIP-seq analysis 295(72)
9.1 Regulatory protein-DNA interactions
295(1)
9.2 Measuring protein-DNA interactions with ChIP-seq
296(2)
9.3 Factors that affect ChIP-seq experiment and analysis quality
298(3)
9.3.1 Antibody specificity
298(1)
9.3.2 Sequencing depth
298(1)
9.3.3 PCR duplication
299(1)
9.3.4 Biological replicates
299(1)
9.3.5 Control experiments
299(2)
9.3.6 Using tagged proteins
301(1)
9.4 Pre-processing ChIP data
301(2)
9.4.1 Mapping of ChIP-seq data
301(2)
9.5 ChIP quality control
303(24)
9.5.1 The data
303(1)
9.5.2 Sample clustering
304(4)
9.5.3 Visualization in the genome browser
308(4)
9.5.4 Plus and minus strand cross-correlation
312(4)
9.5.5 GC bias quantification
316(4)
9.5.6 Sequence read genomic distribution
320(7)
9.6 Peak calling
327(32)
9.6.1 Types of ChIP-seq experiments
327(5)
9.6.2 Peak calling: Sharp peaks
332(9)
9.6.3 Peak calling: Broad regions
341(3)
9.6.4 Peak quality control
344(12)
9.6.5 Peak annotation
356(3)
9.7 Motif discovery
359(4)
9.7.1 Motif comparison
361(2)
9.8 What to do next?
363(1)
9.9 Exercises
364(3)
9.9.1 Quality control
364(3)
10 DNA methylation analysis using bisulfite sequencing data 367(26)
10.1 What is DNA methylation?
367(1)
10.1.1 How DNA methylation is set?
368(1)
10.1.2 How to measure DNA methylation with bisulfite sequencing
368(1)
10.2 Analyzing DNA methylation data
368(1)
10.3 Processing raw data and getting data into R
369(1)
10.4 Data filtering and exploratory analysis
370(9)
10.4.1 Reading methylation call files
370(2)
10.4.2 Further quality check
372(1)
10.4.3 Merging samples into a single table
373(1)
10.4.4 Filtering CpGs
374(2)
10.4.5 Clustering samples
376(2)
10.4.6 Principal component analysis
378(1)
10.5 Extracting interesting regions: Differential methylation and segmentation
379(9)
10.5.1 Differential methylation
379(5)
10.5.2 Methylation segmentation
384(3)
10.5.3 Working with large files
387(1)
10.6 Annotation of DMRs/DMCs and segments
388(2)
10.6.1 Further annotation with genes or gene sets
390(1)
10.7 Other R packages that can be used for methylation analysis
390(1)
10.8 Exercises
390(3)
10.8.1 Differential methylation
390(1)
10.8.2 Methylome segmentation
391(2)
11 Multi-omics Analysis 393(32)
11.1 Use case: Multi-omics data from colorectal cancer
393(6)
11.2 Latent variable models for multi-omics integration
399(1)
11.3 Matrix factorization methods for unsupervised multi-omics data integration
400(13)
11.3.1 Multiple factor analysis
401(3)
11.3.2 Joint non-negative matrix factorization
404(5)
11.3.3 iCluster
409(4)
11.4 Clustering using latent factors
413(3)
11.4.1 One-hot clustering
414(1)
11.4.2 K-means clustering
415(1)
11.5 Biological interpretation of latent factors
416(6)
11.5.1 Inspection of feature weights in loading vectors
416(2)
11.5.2 Making sense of factors using enrichment analysis
418(2)
11.5.3 Interpretation using additional covariates
420(2)
11.6 Exercises
422(3)
11.6.1 Matrix factorization methods
422(1)
11.6.2 Clustering using latent factors
423(1)
11.6.3 Biological interpretation of latent factors
423(2)
Bibliography 425(12)
Index 437
Dr. Altuna Akalin is a bioinformatics scientist and the head of Bioinformatics and Omics Data Science Platform at the Berlin Institute of Medical Systems Biology, Max Delbrück Center in Berlin. He has been developing computational methods for analyzing and integrating large-scale genomics data sets since 2002. His interest is in using machine learning and statistics to uncover patterns related to important biological variables such as disease state and type. He has lived in the USA, Norway, Turkey, Japan, and Switzerland in order to pursue research work and education related to computational genomics.