Muutke küpsiste eelistusi

Introduction to Statistical Genetic Data Analysis [Pehme köide]

(École Nationale de la Statistique et de L'administration Économique (ENSAE)), (University of Essex), (University of Oxford)
  • Formaat: Paperback / softback, 432 pages, kõrgus x laius x paksus: 229x178x25 mm, 72 b&w illus.; 144 Illustrations
  • Sari: The MIT Press
  • Ilmumisaeg: 18-Feb-2020
  • Kirjastus: MIT Press
  • ISBN-10: 0262538385
  • ISBN-13: 9780262538381
  • Formaat: Paperback / softback, 432 pages, kõrgus x laius x paksus: 229x178x25 mm, 72 b&w illus.; 144 Illustrations
  • Sari: The MIT Press
  • Ilmumisaeg: 18-Feb-2020
  • Kirjastus: MIT Press
  • ISBN-10: 0262538385
  • ISBN-13: 9780262538381

A comprehensive introduction to modern applied statistical genetic data analysis, accessible to those without a background in molecular biology or genetics.

Human genetic research is now relevant beyond biology, epidemiology, and the medical sciences, with applications in such fields as psychology, psychiatry, statistics, demography, sociology, and economics. With advances in computing power, the availability of data, and new techniques, it is now possible to integrate large-scale molecular genetic information into research across a broad range of topics. This book offers the first comprehensive introduction to modern applied statistical genetic data analysis that covers theory, data preparation, and analysis of molecular genetic data, with hands-on computer exercises. It is accessible to students and researchers in any empirically oriented medical, biological, or social science discipline; a background in molecular biology or genetics is not required.

The book first provides foundations for statistical genetic data analysis, including a survey of fundamental concepts, primers on statistics and human evolution, and an introduction to polygenic scores. It then covers the practicalities of working with genetic data, discussing such topics as analytical challenges and data management. Finally, the book presents applications and advanced topics, including polygenic score and gene-environment interaction applications, Mendelian Randomization and instrumental variables, and ethical issues. The software and data used in the book are freely available and can be found on the book's website.



A comprehensive introduction to modern applied statistical genetic data analysis, accessible to those without a background in molecular biology or genetics.
Preface xiii
I Foundations
1(150)
1 Introduction: Fundamental Concepts and the Human Genome
3(30)
Objectives
3(1)
1.1 Introduction
3(6)
1.1.1 Motivation and aim of this book
3(3)
1.1.2 Overview of topics covered in this book
6(2)
1.1.3 What are DNA, the genome, a gene, and a chromosome?
8(1)
1.2 Mendel's laws, sexual reproduction, and genetic recombination
9(3)
1.3 Genetic polymorphisms
12(3)
1.3.1 Alleles, single-nucleotide polymorphisms (SNPs), and minor allele frequency (MAF)
12(1)
1.3.2 Monogenic, polygenic, and omnigenic effects
13(2)
1.4 From genes to protein and the central dogma of molecular biology
15(5)
1.4.1 From genes to protein: Genes, amino acids, nucleotides, and proteins
15(3)
1.4.2 The central dogma of molecular biology: Transcription and translation
18(2)
1.5 Homozygous and heterozygous alleles, dominant and recessive traits
20(2)
1.6 Heritability
22(6)
1.6.1 Defining heritability: Broad- and narrow-sense heritability
22(1)
1.6.2 Common misconceptions about heritability
23(1)
1.6.3 Twin, SNP, and GWAS heritability
24(3)
1.6.4 Missing and hidden heritability
27(1)
1.7 Conclusion
28(5)
Exercises
28(1)
Further reading and resources
29(1)
References
30(3)
2 A Statistical Primer for Genetic Data Analysis
33(22)
Objectives
33(1)
2.1 Introduction
33(1)
2.2 Basic statistical concepts
34(4)
2.2.1 Mean, standard deviation, and variance
34(2)
2.2.2 Covariance and the variance-covariance matrix
36(2)
2.3 Statistical models
38(2)
2.3.1 Regression models
38(1)
2.3.2 The null and alternative hypothesis and significance thresholds
39(1)
2.4 Correlation, causation, and multivariate causal models
40(7)
2.4.1 Correlation versus causation
40(2)
2.4.2 Multivariate causal models
42(5)
2.5 Fixed-effects models, random-effects models, and mixed models
47(1)
2.6 Replication of results and overfitting
48(1)
2.7 Conclusion
49(6)
Exercises
50(2)
Further reading
52(1)
Software for mixed-model analyses
52(1)
Appendix
52(2)
References
54(1)
3 A Primer in Human Evolution
55(22)
Objectives
55(1)
3.1 Introduction
55(1)
3.2 Human dispersal out of Africa
56(2)
3.3 Population structure and stratification
58(5)
3.3.1 Population structure, genetic admixture, and Principal Component Analysis (PCA)
58(1)
3.3.2 Common misnomers of population structure: Ancestry is not race
59(1)
3.3.3 Genetic scores cannot be transferred across ancestry groups
59(2)
3.3.4 How genes mirror geography
61(2)
3.4 Human evolution, selection, and adaptation
63(6)
3.4.1 Evolution, fitness, and natural selection
63(5)
3.4.2 Genetic drift
68(1)
3.5 The Hardy--Weinberg equilibrium
69(2)
3.5.1 Assumptions of the HWE
69(1)
3.5.2 Understanding the notation of the HWE
70(1)
3.6 Linkage disequilibrium and haplotype blocks
71(2)
3.7 Conclusion
73(4)
Exercises
73(1)
Further reading and resources
74(1)
References
74(3)
4 Genome-Wide Association Studies
77(24)
Objectives
77(1)
4.1 Introduction and background
77(2)
4.2 GWAS research design and meta-analysis
79(4)
4.2.1 GWAS research design
79(2)
4.2.2 Data analysis plan
81(1)
4.2.3 Meta-analysis
82(1)
4.3 Statistical inference, methods, and heterogeneity
83(7)
4.3.1 Nature of the phenotype
83(1)
4.3.2 P-values and Z-scores
83(1)
4.3.3 Correcting for multiple testing in a GWAS
84(1)
4.3.4 Manhattan plots
85(2)
4.3.5 Evaluating dichotomous versus quantitative traits
87(1)
4.3.6 Fixed-effects versus random-effects models
88(1)
4.3.7 Weighting, false discovery rate (FDR), and imputation
89(1)
4.3.8 Sources of heterogeneity
89(1)
4.4 Quality control (QC) of genetic data
90(1)
4.5 The NHCRI-EBI GWAS Catalog
91(6)
4.5.1 What is the NHGRI-EBI GWAS Catalog?
91(1)
4.5.2 A brief history of the GWAS
91(2)
4.5.3 Lack of diversity in GWASs
93(4)
4.6 Conclusion and future directions
97(4)
Exercises
98(1)
Further reading
98(1)
References
99(2)
5 Introduction to Polygenic Scores and Genetic Architecture
101(28)
Objectives
101(1)
5.1 Introduction
101(6)
5.1.1 What is a polygenic score?
105(1)
5.1.2 The origins of polygenic scores
105(2)
5.2 Construction of polygenic scores
107(1)
5.2.1 Large sample sizes required in GWAS discovery
108(1)
5.2.2 Selection of SNPs to include
108(1)
5.3 Validation and prediction of polygenic scores
108(5)
5.3.1 Independent target sample
109(1)
5.3.2 Similar ancestry in target sample
110(1)
5.3.3 Relatedness, population stratification, and differential bias
110(1)
5.3.4 Variance explained only by common genetic markers missing rare variants
111(1)
5.3.5 Missing and hidden heritability in prediction of phenotypes from genetic markers (SNPs)
111(1)
5.3.6 Trade-off between prediction and understanding biological mechanisms
112(1)
5.4 Shared genetic architecture of phenotypes
113(6)
5.4.1 Predicting other phenotypes
113(1)
5.4.2 Phenotypic and genetic correlation
114(1)
5.4.3 Pleiotropy
115(4)
5.4.4 Multitrait analysis
119(1)
5.5 Causal modeling with polygenic scores
119(4)
5.5.1 Genetic confounding
119(1)
5.5.2 Mendelian Randomization
120(1)
5.5.3 Controlling for confounders
120(2)
5.5.4 Gene-environment interaction and heterogeneity
122(1)
5.6 Conclusion
123(6)
Exercises
124(1)
Further reading
124(1)
References
125(4)
6 Gene-Environment Interplay
129(22)
Objectives
129(1)
6.1 Introduction: What is gene-environment (GxE) interplay?
129(1)
6.2 Defining the environment in GxE research
130(3)
6.2.1 Nature and scope of E: Multilevel, multidomain, and multitemporal
131(1)
6.2.2 Interdependence of environmental risk factors
132(1)
6.3 A brief history of GxE research
133(3)
6.3.1 Classic approaches
133(1)
6.3.2 Candidate gene cGxE approaches
134(1)
6.3.3 Genome-wide polygenic score GxE approaches
135(1)
6.4 Conceptual GxE models
136(7)
6.4.1 Diathesis-stress, vulnerability, or contextual triggering model
136(1)
6.4.2 Bioecological or social compensation model
137(2)
6.4.3 Differential susceptibility model
139(1)
6.4.4 Social control or social push model
140(1)
6.4.5 Research designs to study GxE
140(3)
6.5 Gene-environment correlation (rGE)
143(3)
6.5.1 Passive gene-environment correlation (rGE)
144(1)
6.5.2 Evocative (or reactive) rGE
145(1)
6.5.3 Active rGE
145(1)
6.5.4 Why are models of rGE important?
145(1)
6.5.5 Research designs to study rGE
146(1)
6.6 Conclusion and future directions
146(5)
6.6.1 Why haven't many GxEs been identified?
146(1)
Exercises
147(1)
Further reading
147(1)
References
147(4)
II Working with Genetic Data
151(124)
7 Genetic Data and Analytical Challenges
153(30)
Objectives
153(1)
7.1 Introduction
153(1)
7.2 Genotyping and sequencing array
154(6)
7.2.1 Genotyping and sequencing technologies
154(1)
7.2.2 Linkage disequilibrium and imputation
155(3)
7.2.3 Limitations of genotyping arrays and next-generation sequencing
158(1)
7.2.4 Drop in costs per genome
159(1)
7.3 Overview of human genetic data for analysis
160(5)
7.3.1 Prominently used genetic data
161(2)
7.3.2 Sources that archive and distribute data
163(1)
7.3.3 Obtaining GWAS summary statistics
164(1)
7.4 Different formats in genomics data
165(6)
7.4.1 Genomics data is big data
165(1)
7.4.2 PLINK software and genotype formats
166(4)
7.4.3 PLINK binary files
170(1)
7.5 Genetic formats for imputed data
171(4)
7.5.1 PLINK 2.0
171(1)
7.5.2 Oxford file formats
172(2)
7.5.3 The variant call format (VCF)
174(1)
7.6 Data used in this book
175(1)
7.7 Data transfer, storage, size, and computing power
176(3)
7.7.1 Data storage
176(1)
7.7.2 Data sharing, transfer across borders, and cloud storage
177(1)
7.7.3 Size of data and computational power
178(1)
7.8 Conclusion
179(4)
Exercises
179(1)
Further reading and resources
179(1)
References
180(3)
8 Working with Genetic Data, Part I: Data Management, Descriptive Statistics, and Quality Control
183(34)
Objectives
183(1)
8.1 Introduction: Working with genetic data
183(1)
8.2 Getting started with PLINK
184(9)
8.2.1 The command line
184(2)
8.2.2 Calling PLINK and the PLINK command line
186(2)
8.2.3 Running scripts in terminal
188(1)
8.2.4 Opening PLINK files
189(1)
8.2.5 Recode binary files to create new readable dataset with .ped and .map files
189(2)
8.2.6 Import data from other formats
191(2)
8.3 Data management
193(6)
8.3.1 Select individuals and markers
193(3)
8.3.2 Merge different genetic files and attaching a phenotype
196(3)
8.4 Descriptive statistics
199(3)
8.4.1 Allele frequency
199(1)
8.4.2 Missing values
200(2)
8.5 Quality control of genetic data
202(9)
8.5.1 Per-individual QC
203(3)
8.5.2 Per-marker QC
206(3)
8.5.3 Genome-wide association meta-analysis QC
209(2)
8.6 Conclusion
211(6)
Exercises
214(1)
Further reading and resources
214(1)
References
214(3)
9 Working with Genetic Data, Part II: Association Analysis, Population Stratification, and Genetic Relatedness
217(26)
Objectives
217(1)
9.1 Introduction
217(1)
9.1.1 Aim of this chapter
217(1)
9.12 Data and computer programs used in this chapter
218(1)
9.2 Association analysis
218(5)
9.3 Linkage disequilibrium
223(3)
9.4 Population stratification
226(10)
9.5 Genetic relatedness
236(2)
9.6 Relatedness matrix and heritability with GCTA
238(2)
9.7 Conclusion
240(3)
Exercises
241(1)
Further reading and resources
241(1)
References
241(2)
10 An Applied Guide to Creating and Validating Polygenic Scores
243(32)
Objectives
243(1)
10.1 Introduction
243(2)
10.1.1 Creating a polygenic score
243(1)
10.1.2 Data used in this chapter
244(1)
10.2 How to construct a score with selected variants (monogenic)
245(2)
10.3 Pruning and thresholding method
247(4)
10.4 How to calculate a polygenic score using PRSice 2.0
251(9)
10.5 Validating the PGS
260(7)
10.6 LDpred: Accounting for LD in polygenic score calculations
267(5)
10.6.1 Introduction and three steps
267(5)
10.7 Conclusion
272(3)
Exercises
273(1)
Further reading and resources
273(1)
References
274(1)
III Applications and Advanced Topics
275(106)
11 Polygenic Score and Gene-Environment Interaction (GxE) Applications
277(38)
Objectives
277(1)
11.1 Introduction
277(1)
11.2 Polygenic score applications: (Cross-trait) prediction and confounding
278(21)
11.2.1 Out-of-sample prediction
278(10)
11.2.2 Cross-trait prediction and genetic covariation
288(7)
11.2.3 Genetic confounding
295(4)
11.3 Gene-environment interaction
299(9)
11.3.1 Application: BMIx birth cohort
300(8)
11.4 Challenges in gene-environment interaction research
308(2)
11.5 Conclusion and future directions
310(5)
Exercises
311(1)
Further reading
311(1)
References
311(4)
12 Applying Genome-Wide Association Results
315(24)
Objectives
315(1)
12.1 Introduction
315(1)
12.2 Plotting association results
316(8)
12.2.1 Manhattan plots
316(4)
12.1.2 Regional association plots
320(1)
12.1.3 Quantile-Quantile plots and the λ statistic
320(4)
12.2 Estimating heritability from summary statistics
324(4)
12.3 Estimating genetic correlations from summary statistics
328(5)
12.4 MTAC: Multi-Trait Analysis of Genome-wide association summary statistics
333(3)
12.5 Conclusion
336(3)
Exercises
336(1)
Further reading and resources
336(1)
References
337(2)
13 Mendelian Randomization and Instrumental Variables
339(20)
Objectives
339(1)
13.1 Introduction
339(2)
13.2 Randomized control trials and causality
341(1)
13.3 Mendelian Randomization
341(2)
13.4 Instrumental variables and Mendelian Randomization
343(6)
13.4.1 The IV model in an MR framework
343(4)
13.4.2 Violation of statistical assumptions of the IV approach
347(2)
13.5 Extensions of standard MR
349(3)
13.5.1 Using multiple markers as independent instruments
351(1)
13.5.2 Using polygenic scores as IVs
351(1)
13.5.3 Bidirectional MR analyses
352(1)
13.6 Applications of MR
352(3)
13.6.1 Consequences of alcohol consumption
352(1)
13.6.2 Body mass index and mortality
353(1)
13.6.3 Causes of dementia and Alzheimer's disease
354(1)
13.7 Conclusion
355(4)
Exercises
355(1)
Further reading
356(1)
References
356(3)
14 Ethical Issues in Genomics Research
359(18)
Objectives
359(1)
14.1 Introduction
359(2)
14.2 Genetics is not destiny: Genetic determinism
361(2)
14.2.1 Variation in traits and ability to use individual PGSs as predictors
361(1)
14.2.2 Heritability and missing heritability
362(1)
14.3 Clinical use of PGSs
363(4)
14.3.1 Genetics and family history
363(1)
14.3.2 Genetic scores for screening, intervention, and life planning
364(1)
14.3.3 Pharmacogenetics
365(1)
14.3.4 Public understanding of genetic information and information risks
366(1)
14.4 Lack of diversity in genomics
367(1)
14.4.1 Lack of diversity in GWASs
367(1)
14.4.2 European ancestry bias related to PGS construction
367(1)
14.5 Privacy, consent, legal issues, insurance, and General Data Protection Regulation
367(5)
14.5.1 Privacy in the age of public genetics: Solving crimes and finding people
367(1)
14.5.2 The changing nature of informed consent in genomic research
368(1)
14.5.3 Insurance and genetics
369(1)
14.5.4 GDPR and genetics
370(2)
14.6 Conclusion and future directions
372(5)
Further reading and resources
373(1)
References
373(4)
15 Conclusions and Future Directions
377(4)
15.1 Summary and reflection
377(1)
15.2 Future directions
377(4)
References
380(1)
Appendix 1 Software Used in This Book
381(8)
A1.1 Introduction
381(1)
A1.2 RStudio and R
381(1)
A1.3 PLINK
382(1)
A1.4 GCTA
382(1)
A1.5 PRSice
382(1)
A1.6 Python
383(2)
A1.6.1 How to switch from Python 3 to Python 2
384(1)
A1.6.2 Installing packages in Python
385(1)
A1.7 Git
385(1)
A1.8 LDpred
386(1)
A1.9 LDSC
386(1)
A1.10 MTAG
387(1)
A1.11 Using Windows for this book
388(1)
References
388(1)
Appendix 2 Data Used in This Book
389(10)
A2.1 Introduction
389(1)
A2.2 Description of simulated data
389(2)
A2.3 Health and Retirement Study
391(4)
A2.4 Data used by chapter
395(4)
References
397(2)
Glossary 399(6)
Notes 405(4)
Index 409