Muutke küpsiste eelistusi

Multivariate Data Integration Using R: Methods and Applications with the mixOmics Package [Kõva köide]

  • Formaat: Hardback, 298 pages, kõrgus x laius: 254x178 mm, kaal: 825 g, 16 Tables, black and white; 121 Line drawings, color; 121 Illustrations, color
  • Sari: Chapman & Hall/CRC Computational Biology Series
  • Ilmumisaeg: 09-Nov-2021
  • Kirjastus: Chapman & Hall/CRC
  • ISBN-10: 0367460947
  • ISBN-13: 9780367460945
Teised raamatud teemal:
  • Formaat: Hardback, 298 pages, kõrgus x laius: 254x178 mm, kaal: 825 g, 16 Tables, black and white; 121 Line drawings, color; 121 Illustrations, color
  • Sari: Chapman & Hall/CRC Computational Biology Series
  • Ilmumisaeg: 09-Nov-2021
  • Kirjastus: Chapman & Hall/CRC
  • ISBN-10: 0367460947
  • ISBN-13: 9780367460945
Teised raamatud teemal:
Large biological data, which are often noisy and high-dimensional, have become increasingly prevalent in biology and medicine. There is a real need for good training in statistics, from data exploration through to analysis and interpretation. This book provides an overview of statistical and dimension reduction methods for high-throughput biological data, with a specific focus on data integration. It starts with some biological background, key concepts underlying the multivariate methods, and then covers an array of methods implemented using the mixOmics package in R.

Features:











Provides a broad and accessible overview of methods for multi-omics data integration





Covers a wide range of multivariate methods, each designed to answer specific biological questions





Includes comprehensive visualisation techniques to aid in data interpretation





Includes many worked examples and case studies using real data





Includes reproducible R code for each multivariate method, using the mixOmics package

The book is suitable for researchers from a wide range of scientific disciplines wishing to apply these methods to obtain new and deeper insights into biological mechanisms and biomedical problems. The suite of tools introduced in this book will enable students and scientists to work at the interface between, and provide critical collaborative expertise to, biologists, bioinformaticians, statisticians and clinicians.

Arvustused

The value of the book is at least two-fold. First, it provides a compact but well-balanced introduction to the methodology of multivariate analysis in the context of omics data. Second, it instructs with the hands-on approach how the mixOmics R-package can be effectively used to perform suitable statistical analyses involving data in which several variables of different types (e.g. genes, proteins and metabolites) must be integrated into one analytic workflowThe authors not only lead through mixOmics but also provide very accurate and valuable references whenever a less well-known method or technique is discussed. Because the book is essentially a presentation of the methodology and its applications for the mixOmics project, the projects webpage http://www.mixOmics.org can be considered complementary to the book with its rich additional materiala well-written book, a properly balanced and designed mix of methodology and applications, meeting all the standards of exposition on modern computationally assisted inference methodsshould have a broad appeal to those wanting to learn dimension reduction methodology, to practitioners in omics research area who want to use them, and even to general experts in the field of high-dimensional multivariate analysisI highly recommend Multivariate Data Integration Using R to these audiences. - Krzysztof Podgórski, Lund University, Sweden; International Statistical Review, Oct 2024

"This book was eagerly awaited both to bring together numerous research works published in recent years and to support the use of the Mixomics software which has become an essential tool for data integration and exploration when dealing with multiple types of high-dimensional biological data. It is the result of many years of research on cutting-edge developments in this domain as for sparsity. The book is very pleasant to read and well-structured around the different multivariate approaches. It is well documented with many recent references on the statistical methods and is very didactic through numerous examples accompanied by R codes and illustrations. It can be used by a large audience of statisticians and biologists to process, analyze, visualize, and interpret their multivariate microbiome and multi-omics data, but also as a basis for a course. I highly recommend this book." - Philippe Bastien, Senior Research Associate - L'Oréal R&I

"The book belongs to the Computational Biology Series and presents a wide spectrum of modern methods of multivariate statistical analysis, integration and high-dimension reduction for biological data evaluated via the specialized R package. The neologism Omic is used as a root related to constellations of objects with biological information, for instance, in genomes and proteinsgenomics and proteomics (in studying proteins expressed by cells and tissues), metabolic and transcription productsmetabolomics and transcriptomics (in studying messenger RNA molecules expressed from the gens of an organism), or also in economicsReaganomics, etc.

[ . . . ] Numerous links to the internet websites related to the considered methods of multi-omics data integration are suggested, particularly, the mixOmics project is described at the link http://www.mixOmics.org, and the package is available at Install |mixOmics. The developed methods and software are suitable not only for biologists and bioinformaticians students and researchers, but can be useful for solving computational and content problems in many other fields as well." Technometrics

"This is an excellent book for computational biologists, bioinformaticians, statisticians, data scientists, and graduate students who work with high-throughput omics data. The book covers most fundamental concepts of multi-omics data integration, while focusing on their implementations through hands-on examples implemented in the mixOmics R package." - Yuehua Cui, Michigan State University, Biometrics, September 2022

Preface xv
Authors xxi
I Modern biology and multivariate analysis 1(44)
1 Multi-omics and biological systems
3(8)
1.1 Statistical approaches for reductionist or holistic analyses
3(1)
1.2 Multi-omics and multivariate analyses
4(1)
1.2.1 More than a 'scale up' of univariate analyses
5(1)
1.2.2 More than a fishing expedition
5(1)
1.3 Shifting the analysis paradigm
5(1)
1.4 Challenges with high-throughput data
6(2)
1.4.1 Overfitting
7(1)
1.4.2 Multi-collinearity and ill-posed problems
7(1)
1.4.3 Zero values and missing values
7(1)
1.5 Challenges with multi-omics integration
8(1)
1.5.1 Data heterogeneity
8(1)
1.5.2 Data size
8(1)
1.5.3 Platforms
8(1)
1.5.4 Expectations for analysis
8(1)
1.5.5 Variety of analytical frameworks
9(1)
1.6 Summary
9(2)
2 The cycle of analysis
11(8)
2.1 The Problem guides the analysis
11(1)
2.2 Plan in advance
12(2)
2.2.1 What affects statistical power?
12(1)
2.2.2 Sample size
12(1)
2.2.3 Identify covariates and confounders
13(1)
2.2.4 Identify batch effects
13(1)
2.3 Data cleaning and pre-processing
14(1)
2.3.1 Normalisation
14(1)
2.3.2 Filtering
15(1)
2.3.3 Missing values
15(1)
2.4 Analysis: Choose the right approach
15(3)
2.4.1 Descriptive statistics
15(1)
2.4.2 Exploratory statistics
15(1)
2.4.3 Inferential statistics
16(1)
2.4.4 Univariate or multivariate modelling?
16(1)
2.4.5 Prediction
17(1)
2.5 Conclusion and start the cycle again
18(1)
2.6 Summary
18(1)
3 Key multivariate concepts and dimension reduction in mixOmics
19(10)
3.1 Measures of dispersion and association
19(4)
3.1.1 Random variables and biological variation
19(1)
3.1.2 Variance
20(1)
3.1.3 Covariance
20(1)
3.1.4 Correlation
21(1)
3.1.5 Covariance and correlation in mixOmics context
22(1)
3.1.6 R examples
22(1)
3.2 Dimension reduction
23(2)
3.2.1 Matrix factorisation
23(1)
3.2.2 Factorisation with components and loading vectors
24(1)
3.2.3 Data visualisation using components
24(1)
3.3 Variable selection
25(2)
3.3.1 Ridge penalty
26(1)
3.3.2 Lasso penalty
26(1)
3.3.3 Elastic net
26(1)
3.3.4 Visualisation of the selected variables
26(1)
3.4 Summary
27(2)
4 Choose the right method for the right question in mixOmics
29(9)
4.1 Types of analyses and methods
29(4)
4.1.1 Single or multiple omics analysis?
29(1)
4.1.2 N- or P-integration?
30(1)
4.1.3 Unsupervised or supervised analyses?
31(1)
4.1.4 Repeated measures analyses
32(1)
4.1.5 Compositional data
32(1)
4.2 Types of data
33(1)
4.2.1 Classical omics
33(1)
4.2.2 Microbiome data: A special case
33(1)
4.2.3 Genotype data: A special case
33(1)
4.2.4 Clinical variables that are categorical: A special case
33(1)
4.3 Types of biological questions
34(3)
4.3.1 A PCA type of question (one data set, unsupervised)
34(1)
4.3.2 A PLS type of question (two data sets, regression or unsupervised)
34(1)
4.3.3 A CCA type of question (two data sets, unsupervised)
35(1)
4.3.4 A PLS-DA type of question (one data set, classification)
35(1)
4.3.5 A multiblock PLS type of question (more than two data sets, supervised or unsupervised)
36(1)
4.3.6 An N-integration type of question (several data sets, supervised)
36(1)
4.3.7 A P-integration type of question (several studies of the same omit type, supervised or unsupervised)
37(1)
4.4 Examplar data sets in mixOmics
37(1)
4.5 Summary
37(1)
4.A Appendix: Data transformations in mixOmics
38(3)
4.A.1 Multilevel decomposition
38(1)
4.A.2 Mixed-effect model context
39(1)
4.A.3 Split-up variation
39(1)
4.A.4 Example of multilevel decomposition in mixOmics
40(1)
4.B Centered log ratio transformation
41(1)
4.C Creating dummy variables
42(3)
II mixOmics under the hood 45(48)
5 Projection to latent structures
47(12)
5.1 PCA as a projection algorithm
47(3)
5.1.1 Overview
47(1)
5.1.2 Calculating the components
48(1)
5.1.3 Meaning of the loading vectors
49(1)
5.1.4 Example using the 1 innerud data in mixOmics
49(1)
5.2 Singular Value Decomposition (SVD)
50(4)
5.2.1 SVD algorithm
50(2)
5.2.2 Example in R
52(2)
5.2.3 Matrix approximation
54(1)
5.3 Non-linear Iterative Partial Least Squares (NIPALS)
54(3)
5.3.1 NIPALS pseudo algorithm
55(1)
5.3.2 Local regressions
55(1)
5.3.3 Deflation
56(1)
5.3.4 Missing values
57(1)
5.4 Other matrix factorisation methods in mixOmics
57(1)
5.5 Summary
57(2)
6 Visualisation for data integration
59(20)
6.1 Sample plots using components
59(6)
6.1.1 Example with PCA and plot Indiv
59(1)
6.1.2 Sample plot for the integration of two or more data sets
60(3)
6.1.3 Representing paired coordinates using plotArrow
63(2)
6.2 Variable plots using components and loading vectors
65(10)
6.2.1 Loading plots
65(1)
6.2.2 Correlation circle plots
66(3)
6.2.3 Biplots
69(1)
6.2.4 Relevance networks
70(3)
6.2.5 Clustered Image Maps (CIM)
73(1)
6.2.6 Circos plots
74(1)
6.3 Summary
75(1)
6.A Appendix: Similarity matrix in relevance networks and CIM
76(3)
6.A.1 Pairwise variable associations for CCA
76(1)
6.A.2 Pairwise variable associations for PLS
76(1)
6.A.3 Constructing relevance networks and displaying CIM
77(2)
7 Performance assessment in multivariate analyses
79(14)
7.1 Main parameters to choose
79(1)
7.2 Performance assessment
80(2)
7.2.1 Training and testing: If we were rich
80(1)
7.2.2 Cross-validation: When we are poor
81(1)
7.3 Performance measures
82(4)
7.3.1 Evaluation measures for regression
82(1)
7.3.2 Evaluation measures for classification
83(1)
7.3.3 Details of the tuning process
83(3)
7.4 Final model assessment
86(1)
7.4.1 Assessment of the performance
86(1)
7.4.2 Assessment of the signature
86(1)
7.5 Prediction
87(3)
7.5.1 Prediction of a continuous response
87(1)
7.5.2 Prediction of a categorical response
88(2)
7.5.3 Prediction is related to the number of components
90(1)
7.6 Summary and roadmap of analysis
90(3)
III mixOmics in action 93(190)
8 mixOmics: Get started
95(14)
8.1 Prepare the data
95(7)
8.1.1 Normalisation
95(1)
8.1.2 Filtering variables
96(1)
8.1.3 Centering and scaling the data
96(4)
8.1.4 Managing missing values
100(1)
8.1.5 Managing batch effects
101(1)
8.1.6 Data format
101(1)
8.2 Get ready with the software
102(1)
8.2.1 R installation
102(1)
8.2.2 Pre-requisites
102(1)
8.2.3 mixOmics download
102(1)
8.2.4 Load the package
103(1)
8.3 Coding practices
103(1)
8.3.1 Set the working directory
103(1)
8.3.2 Good coding practices
104(1)
8.4 Upload data
104(2)
8.4.1 Data sets
104(1)
8.4.2 Dependent variables
104(1)
8.4.3 Set up the outcome for supervised classification analyses
105(1)
8.4.4 Check data upload
106(1)
8.5 Structure of the following chapters
106(3)
9 Principal Component Analysis (PCA)
109(28)
9.1 Why use PCA?
109(1)
9.1.1 Biological questions
109(1)
9.1.2 Statistical point of view
109(1)
9.2 Principle
110(2)
9.2.1 PCA
110(1)
9.2.2 Sparse PCA
111(1)
9.3 Input arguments
112(1)
9.3.1 Center or scale the data?
112(1)
9.3.2 Number of components (choice of dimensions)
112(1)
9.3.3 Number of variables to select in sPCA
113(1)
9.4 Key outputs
113(1)
9.5 Case study: Multidrug
114(15)
9.5.1 Load the data
114(1)
9.5.2 Quick start
115(1)
9.5.3 Example: PCA
116(5)
9.5.4 Example: Sparse PCA
121(4)
9.5.5 Example: Missing values imputation
125(4)
9.6 To go further
129(2)
9.6.1 Additional processing steps
129(1)
9.6.2 Independent component analysis
129(1)
9.6.3 Incorporating biological information
130(1)
9.7 FAQ
131(1)
9.8 Summary
132(1)
9.A Appendix: Non-linear Iterative Partial Least Squares
132(1)
9.A.1 Solving PCA with NIPALS
132(1)
9.A.2 Estimating missing values with NIPALS
132(1)
9.B Appendix: sparse PCA
133(4)
9.B.1 sparse PCA-SVD
133(1)
9.B.2 sPCA pseudo algorithm
134(1)
9.B.3 Other sPCA methods
134(3)
10 Projection to Latent Structure (PLS)
137(40)
10.1 Why use PLS?
137(1)
10.1.1 Biological questions
137(1)
10.1.2 Statistical point of view
137(1)
10.2 Principle
138(4)
10.2.1 Univariate PLS1 and multivariate PLS2
139(1)
10.2.2 PLS deflation modes
140(2)
10.2.3 sparse PLS
142(1)
10.3 Input arguments and tuning
142(2)
10.3.1 The deflation mode
142(1)
10.3.2 The number of dimensions
143(1)
10.3.3 Number of variables to select
143(1)
10.4 Key outputs
144(1)
10.4.1 Graphical outputs
144(1)
10.4.2 Numerical outputs
144(1)
10.5 Case study: Liver toxicity
145(18)
10.5.1 Load the data
146(1)
10.5.2 Quick start
146(1)
10.5.3 Example: PLS1 regression
147(5)
10.5.4 Example: PLS2 regression
152(11)
10.6 Take a detour: PLS2 regression for prediction
163(2)
10.7 To go further
165(2)
10.7.1 Orthogonal projections to latent structures
165(1)
10.7.2 Redundancy analysis
166(1)
10.7.3 Group PLS
166(1)
10.7.4 PLS path modelling
166(1)
10.7.5 Other sPLS variants
167(1)
10.8 FAQ
167(1)
10.9 Summary
168(1)
10.A Appendix: PLS algorithm
169(2)
10.A.1 PLS Pseudo algorithm
169(1)
10.A.2 Convergence of the PLS iterative algorithm
170(1)
10.A.3 PLS-SVD method
170(1)
10.B Appendix: sparse PLS
171(1)
10.B.1 sparse PLS-SVD
171(1)
10.B.2 sparse PLS pseudo algorithm
171(1)
10.C Appendix: Tuning the number of components
172(5)
10.C.1 In PLS1
172(3)
10.C.2 In PLS2
175(2)
11 Canonical Correlation Analysis (CCA) A)
177(24)
11.1 Why use CCA?
177(1)
11.1.1 Biological question
177(1)
11.1.2 Statistical point of view
177(1)
11.2 Principle
178(1)
11.2.1 CCA
178(1)
11.2.2 rCCA
179(1)
11.3 Input arguments and tuning
179(1)
11.3.1 CCA
179(1)
11.3.2 rCCA
180(1)
11.4 Key outputs
180(1)
11.4.1 Graphical outputs
180(1)
11.4.2 Numerical outputs
181(1)
11.5 Case study: Nutrimouse
181(12)
11.5.1 Load the data
182(1)
11.5.2 Quick start
182(1)
11.5.3 Example: CCA
183(1)
11.5.4 Example: rCCA
184(9)
11.6 To go further
193(1)
11.7 FAQ
194(1)
11.8 Summary
195(1)
11.A Appendix: CCA and variants
196(5)
11.A.1 Solving classical CCA
196(1)
11.A.2 Regularised CCA
197(4)
12 PLS-Discriminant Analysis (PLS-DA)
201(32)
12.1 Why use PLS-DA?
201(1)
12.1.1 Biological question
201(1)
12.1.2 Statistical point of view
201(1)
12.2 Principle
202(2)
12.2.1 PLS-DA
203(1)
12.2.2 sparse PLS-DA
204(1)
12.3 Input arguments and tuning
204(2)
12.3.1 PLS-DA
204(1)
12.3.2 sPLS-DA
205(1)
12.3.3 Framework to manage overfitting
205(1)
12.4 Key outputs
206(1)
12.4.1 Numerical outputs
207(1)
12.4.2 Graphical outputs
207(1)
12.5 Case study: SRBCT
207(19)
12.5.1 Load the data
208(1)
12.5.2 Quick start
208(1)
12.5.3 Example: PLS-DA
209(5)
12.5.4 Example: sPLS-DA
214(9)
12.5.5 Take a detour: Prediction
223(2)
12.5.6 AUROC outputs complement performance evaluation
225(1)
12.6 To go further
226(2)
12.6.1 Microbiome
226(1)
12.6.2 Multilevel
227(1)
12.6.3 Other related methods and packages
228(1)
12.7 FAQ
228(1)
12.8 Summary
229(1)
12.A Appendix: Prediction in PLS-DA
229(4)
12.A.1 Prediction distances
229(2)
12.A.2 Background area
231(2)
13 N-data integration
233(28)
13.1 Why use N-integration methods?
233(1)
13.1.1 Biological question
233(1)
13.1.2 Statistical point of view and analytical challenges
234(1)
13.2 Principle
234(3)
13.2.1 Multiblock sPLS-DA
234(2)
13.2.2 Prediction in multiblock sPLS-DA
236(1)
13.3 Input arguments and tuning
237(1)
13.4 Key outputs
238(1)
13.4.1 Graphical outputs
238(1)
13.4.2 Numerical outputs
238(1)
13.5 Case Study: breast . TCGA
239(16)
13.5.1 Load the data
239(1)
13.5.2 Quick start
240(1)
13.5.3 Parameter choice
241(3)
13.5.4 Final model
244(1)
13.5.5 Sample plots
245(2)
13.5.6 Variable plots
247(4)
13.5.7 Model performance and prediction
251(4)
13.6 To go further
255(2)
13.6.1 Additional data transformation for special cases
255(1)
13.6.2 Other N-integration frameworks in mixOmics
255(1)
13.6.3 Supervised classification analyses: concatenation and ensemble methods
256(1)
13.6.4 Unsupervised analyses: JIVE and MOFA
256(1)
13.7 FAQ
257(1)
13.8 Additional resources
258(1)
13.9 Summary
258(1)
13.A Appendix: Generalised CCA and variants
258(3)
13.A.1 regularised GCCA
258(1)
13.A.2 sparse GCCA
259(1)
13.A.3 sparse multiblock sPLS-DA
260(1)
14 P-data integration
261(22)
14.1 Why use P-integration methods?
261(1)
14.1.1 Biological question
261(1)
14.1.2 Statistical point of view
261(1)
14.2 Principle
262(2)
14.2.1 Motivation
262(1)
14.2.2 Multi-group sPLS-DA
263(1)
14.3 Input arguments and tuning
264(1)
14.3.1 Data input checks
264(1)
14.3.2 Number of components
265(1)
14.3.3 Number of variables to select per component
265(1)
14.4 Key outputs
265(1)
14.4.1 Graphical outputs
265(1)
14.4.2 Numerical outputs
266(1)
14.5 Case Study: stemcells
266(14)
14.5.1 Load the data
266(1)
14.5.2 Quick start
267(1)
14.5.3 Example: MINT PLS-DA
268(3)
14.5.4 Example: MINT sPLS-DA
271(6)
14.5.5 Take a detour
277(3)
14.6 Examples of application
280(1)
14.6.1 16S rRNA gene data
280(1)
14.6.2 Single cell transcriptomics
280(1)
14.7 To go further
280(1)
14.8 Summary
280(3)
Glossary of terms 283(2)
Key publications 285(2)
Bibliography 287(12)
Index 299
Dr Kim-Anh Lê Cao develops novel methods, software and tools to interpret big biological data and answer research questions efficiently. She is committed to statistical education to instill best analytical practice and has taught numerous statistical workshops for biologists and leads collaborative projects in medicine, fundamental biology or microbiology disciplines. Dr Kim-Anh Lê Cao has a mathematical engineering background and graduated with a PhD in Statistics from the Université de Toulouse, France. She then moved to Australia first as a biostatistician consultant at QFAB Bioinformatics, then as a research group leader at the biomedical University of Queensland Diamantina Institute. She currently is Associate Professor in Statistical Genomics at the University of Melbourne. In 2019, Kim-Anh received the Australian Academy of Sciences Moran Medal for her contributions to Applied Statistics in multidisciplinary collaborations. She has been part of leadership program for women in STEMM, including the international Homeward Bound which culminated in a trip to Antarctica, and Superstars of STEM from Science Technology Australia.

Zoe Welham completed a BSc in molecular biology and during this time developed a keen interest in the analysis of big data. She completed a Masters of Bioinformatics with a focus on the statistical integration of different omics data in bowel cancer. She is currently a PhD candidate at the Kolling Institute in Sydney where she is furthering her research into bowel cancer with a focus on integrating microbiome data with other omics to characterise early bowel polyps. Her research interests include bioinformatics and biostatistics for many areas of biology and disseminating that information to the general public through reader-friendly writing.