Foreword |
|
xiii | |
Preface |
|
xvii | |
|
|
1 | (12) |
|
The Promises and Challenges of Big Data in Biology and Life Sciences |
|
|
3 | (1) |
|
Infrastructure Challenges |
|
|
3 | (1) |
|
Toward a Cloud-Based Ecosystem for Data Sharing and Analysis |
|
|
4 | (6) |
|
Cloud-Hosted Data and Compute |
|
|
5 | (1) |
|
Platforms for Research in the Life Sciences |
|
|
6 | (2) |
|
Standardization and Reuse of Infrastructure |
|
|
8 | (2) |
|
|
10 | (1) |
|
|
11 | (2) |
|
2 Genomics in a Nutshell: A Primer for Newcomers to the Field |
|
|
13 | (40) |
|
|
13 | (8) |
|
The Gene as a Discrete Unit of Inheritance (Sort Of) |
|
|
14 | (2) |
|
The Central Dogma of Biology: DNA to RNA to Protein |
|
|
16 | (2) |
|
The Origins and Consequences of DNA Mutations |
|
|
18 | (1) |
|
Genomics as an Inventory of Variation in and Among Genomes |
|
|
19 | (1) |
|
The Challenge of Genomic Scale, by the Numbers |
|
|
20 | (1) |
|
|
21 | (11) |
|
The Reference Genome as Common Framework |
|
|
21 | (3) |
|
Physical Classification of Variants |
|
|
24 | (5) |
|
Germline Variants Versus Somatic Alterations |
|
|
29 | (3) |
|
High-Throughput Sequencing Data Generation |
|
|
32 | (9) |
|
From Biological Sample to Huge Pile of Read Data |
|
|
33 | (4) |
|
Types of DNA Libraries: Choosing the Right Experimental Design |
|
|
37 | (4) |
|
Data Processing and Analysis |
|
|
41 | (11) |
|
Mapping Reads to the Reference Genome |
|
|
41 | (2) |
|
|
43 | (5) |
|
Data Quality and Sources of Error |
|
|
48 | (3) |
|
Functional Equivalence Pipeline Specification |
|
|
51 | (1) |
|
|
52 | (1) |
|
3 Computing Technology Basics for Life Scientists |
|
|
53 | (26) |
|
Basic Infrastructure Components and Performance Bottlenecks |
|
|
54 | (6) |
|
Types of Processor Hardware: CPU, GPU, TPU, FPGA, OMG |
|
|
54 | (1) |
|
Levels of Compute Organization: Core, Node, Cluster, and Cloud |
|
|
55 | (1) |
|
Addressing Performance Bottlenecks |
|
|
56 | (4) |
|
|
60 | (4) |
|
Parallelizing a Simple Analysis |
|
|
60 | (1) |
|
From Cores to Clusters and Clouds: Many Levels of Parallelism |
|
|
61 | (2) |
|
Trade-Offs of Parallelism: Speed, Efficiency, and Cost |
|
|
63 | (1) |
|
Pipelining for Parallelization and Automation |
|
|
64 | (4) |
|
|
66 | (1) |
|
Popular Pipelining Languages for Genomics |
|
|
66 | (1) |
|
Workflow Management Systems |
|
|
67 | (1) |
|
Virtualization and the Cloud |
|
|
68 | (9) |
|
|
69 | (3) |
|
|
72 | (2) |
|
Categories of Research Use Cases for Cloud Services |
|
|
74 | (3) |
|
|
77 | (2) |
|
4 First Steps in the Cloud |
|
|
79 | (36) |
|
Setting Up Your Google Cloud Account and First Project |
|
|
79 | (5) |
|
|
80 | (1) |
|
Checking Your Billing Account and Activating Free Credits |
|
|
81 | (3) |
|
Running Basic Commands in Google Cloud Shell |
|
|
84 | (10) |
|
Logging in to the Cloud Shell VM |
|
|
84 | (1) |
|
Using gsutil to Access and Manage Files |
|
|
85 | (4) |
|
Pulling a Docker Image and Spinning Up the Container |
|
|
89 | (3) |
|
Mounting a Volume to Access the Filesystem from Within the Container |
|
|
92 | (2) |
|
Setting Up Your Own Custom VM |
|
|
94 | (15) |
|
Creating and Configuring Your VM Instance |
|
|
94 | (6) |
|
Logging into Your VM by Using SSH |
|
|
100 | (2) |
|
Checking Your Authentication |
|
|
102 | (1) |
|
Copying the Book Materials to Your VM |
|
|
103 | (2) |
|
Installing Docker on Your VM |
|
|
105 | (1) |
|
Setting Up the GATK Container Image |
|
|
106 | (2) |
|
Stopping Your VM...to Stop It from Costing You Money |
|
|
108 | (1) |
|
Configuring IGV to Read Data from GCS Buckets |
|
|
109 | (4) |
|
|
113 | (2) |
|
|
115 | (32) |
|
Getting Started with GATK |
|
|
115 | (10) |
|
|
116 | (1) |
|
|
117 | (1) |
|
Multithreading with Spark |
|
|
118 | (3) |
|
|
121 | (4) |
|
Getting Started with Variant Discovery |
|
|
125 | (18) |
|
Calling Germline SNPs and Indels with HaplotypeCaller |
|
|
125 | (10) |
|
Filtering Based on Variant Context Annotations |
|
|
135 | (8) |
|
Introducing the GATK Best Practices |
|
|
143 | (2) |
|
Best Practices Workflows Covered in This Book |
|
|
145 | (1) |
|
|
145 | (1) |
|
|
145 | (2) |
|
6 GATK Best Practices for Germline Short Variant Discovery |
|
|
147 | (36) |
|
|
147 | (8) |
|
Mapping Reads to the Genome Reference |
|
|
149 | (2) |
|
|
151 | (2) |
|
Recalibrating Base Quality Scores |
|
|
153 | (2) |
|
|
155 | (18) |
|
Overview of the Joint Calling Workflow |
|
|
156 | (4) |
|
Calling Variants per Sample to Generate GVCFs |
|
|
160 | (2) |
|
|
162 | (2) |
|
Applying Joint Genotyping to Multiple Samples |
|
|
164 | (2) |
|
Filtering the Joint Callset with Variant Quality Score Recalibration |
|
|
166 | (5) |
|
Refining Genotype Assignments and Adjusting Genotype Confidence |
|
|
171 | (1) |
|
Next Steps and Further Reading |
|
|
172 | (1) |
|
Single-Sample Calling with CNN Filtering |
|
|
173 | (7) |
|
Overview of the CNN Single-Sample Workflow |
|
|
175 | (1) |
|
Applying 1D CNN to Filter a Single-Sample WGS Callset |
|
|
176 | (2) |
|
Applying 2D CNN to Include Read Data in the Modeling |
|
|
178 | (2) |
|
|
180 | (3) |
|
7 GATK Best Practices for Somatic Variant Discovery |
|
|
183 | (26) |
|
Challenges in Cancer Genomics |
|
|
183 | (2) |
|
Somatic Short Variants (SNVs and Indels) |
|
|
185 | (12) |
|
Overview of the Tumor-Normal Pair Analysis Workflow |
|
|
186 | (2) |
|
|
188 | (2) |
|
Running Mutect2 on the Tumor-Normal Pair |
|
|
190 | (1) |
|
Estimating Cross-Sample Contamination |
|
|
191 | (2) |
|
|
193 | (2) |
|
Annotating Predicted Functional Effects with Funcotator |
|
|
195 | (2) |
|
Somatic Copy-Number Alterations |
|
|
197 | (11) |
|
Overview of the Tumor-Only Analysis Workflow |
|
|
198 | (3) |
|
Creating a Somatic CNA PoN |
|
|
201 | (1) |
|
|
202 | (2) |
|
Performing Segmentation and Call CNAs |
|
|
204 | (3) |
|
Additional Analysis Options |
|
|
207 | (1) |
|
|
208 | (1) |
|
8 Automating Analysis Execution with Workflows |
|
|
209 | (36) |
|
Introducing WDL and Cromwell |
|
|
210 | (2) |
|
Installing and Setting Up Cromwell |
|
|
212 | (4) |
|
Your First WDL: Hello World |
|
|
216 | (10) |
|
Learning Basic WDL Syntax Through a Minimalist Example |
|
|
216 | (2) |
|
Running a Simple WDL with Cromwell on Your Google VM |
|
|
218 | (1) |
|
Interpreting the Important Parts of Cromwell's Logging Output |
|
|
219 | (3) |
|
Adding a Variable and Providing Inputs via JSON |
|
|
222 | (1) |
|
Adding Another Task to Make It a Proper Workflow |
|
|
223 | (3) |
|
Your First GATK Workflow: Hello HaplotypeCaller |
|
|
226 | (10) |
|
|
226 | (3) |
|
Generating the Inputs JSON |
|
|
229 | (2) |
|
|
231 | (2) |
|
Breaking the Workflow to Test Syntax Validation and Error Messaging |
|
|
233 | (3) |
|
Introducing Scatter-Gather Parallelism |
|
|
236 | (8) |
|
|
237 | (5) |
|
Generating a Graph Diagram for Visualization |
|
|
242 | (2) |
|
|
244 | (1) |
|
9 Deciphering Real Genomics Workflows |
|
|
245 | (24) |
|
Mystery Workflow #1 Flexibility Through Conditionals |
|
|
245 | (12) |
|
|
246 | (5) |
|
Reverse Engineering the Conditional Switch |
|
|
251 | (6) |
|
Mystery Workflow #2 Modularity and Code Reuse |
|
|
257 | (11) |
|
|
257 | (5) |
|
Unpacking the Nesting Dolls |
|
|
262 | (6) |
|
|
268 | (1) |
|
10 Running Single Workflows at Scale with Pipelines API |
|
|
269 | (26) |
|
Introducing the GCP Genomics Pipelines API Service |
|
|
269 | (2) |
|
Enabling Genomics API and Related APIs in Your Google Cloud Project |
|
|
270 | (1) |
|
Directly Dispatching Cromwell Jobs to PAPI |
|
|
271 | (9) |
|
Configuring Cromwell to Communicate with PAPI |
|
|
272 | (3) |
|
Running Scattered HaplotypeCaller via PAPI |
|
|
275 | (2) |
|
Monitoring Workflow Execution on Google Compute Engine |
|
|
277 | (3) |
|
Understanding and Optimizing Workflow Efficiency |
|
|
280 | (7) |
|
Granularity of Operations |
|
|
281 | (1) |
|
Balance of Time Versus Money |
|
|
282 | (2) |
|
Suggested Cost-Saving Optimizations |
|
|
284 | (2) |
|
Platform-Specific Optimization Versus Portability |
|
|
286 | (1) |
|
Wrapping Cromwell and PAPI Execution with WDL Runner |
|
|
287 | (6) |
|
|
288 | (1) |
|
Running the Scattered HaplotypeCaller Workflow with WDL Runner |
|
|
288 | (2) |
|
Monitoring WDL Runner Execution |
|
|
290 | (3) |
|
|
293 | (2) |
|
11 Running Many Workflows Conveniently in Terra |
|
|
295 | (36) |
|
Getting Started with Terra |
|
|
295 | (7) |
|
|
296 | (2) |
|
Creating a Billing Project |
|
|
298 | (3) |
|
Cloning the Preconfigured Workspace |
|
|
301 | (1) |
|
Running Workflows with the Cromwell Server in Terra |
|
|
302 | (18) |
|
Running a Workflow on a Single Sample |
|
|
302 | (3) |
|
Running a Workflow on Multiple Samples in a Data Table |
|
|
305 | (6) |
|
Monitoring Workflow Execution |
|
|
311 | (5) |
|
Locating Workflow Outputs in the Data Table |
|
|
316 | (2) |
|
Running the Same Workflow Again to Demonstrate Call Caching |
|
|
318 | (2) |
|
Running a Real GATK Best Practices Pipeline at Full Scale |
|
|
320 | (8) |
|
Finding and Cloning the GATK Best Practices Workspace for Germline Short Variant Discovery |
|
|
320 | (1) |
|
Examining the Preloaded Data |
|
|
321 | (2) |
|
Selecting Data and Configuring the Full-Scale Workflow |
|
|
323 | (1) |
|
Launching the Full-Scale Workflow and Monitoring Execution |
|
|
324 | (3) |
|
Options for Downloading Output Data---or Not |
|
|
327 | (1) |
|
|
328 | (3) |
|
12 Interactive Analysis in Jupyter Notebook |
|
|
331 | (42) |
|
Introduction to Jupyter in Terra |
|
|
332 | (8) |
|
Jupyter Notebooks in General |
|
|
332 | (2) |
|
How Jupyter Notebooks Work in Terra |
|
|
334 | (6) |
|
Getting Started with Jupyter in Terra |
|
|
340 | (13) |
|
Inspecting and Customizing the Notebook Runtime Configuration |
|
|
341 | (5) |
|
Opening Notebook in Edit Mode and Checking the Kernel |
|
|
346 | (1) |
|
Running the Hello World Cells |
|
|
347 | (3) |
|
Using gsutil to Interact with Google Cloud Storage Buckets |
|
|
350 | (1) |
|
Setting Up a Variable Pointing to the Germline Data in the Book Bucket |
|
|
351 | (1) |
|
Setting Up a Sandbox and Saving Output Files to the Workspace Bucket |
|
|
352 | (1) |
|
Visualizing Genomic Data in an Embedded IGV Window |
|
|
353 | (5) |
|
Setting Up the Embedded IGV Browser |
|
|
354 | (1) |
|
Adding Data to the IGV Browser |
|
|
355 | (2) |
|
Setting Up an Access Token to View Private Data |
|
|
357 | (1) |
|
Running GATK Commands to Learn, Test, or Troubleshoot |
|
|
358 | (7) |
|
Running a Basic GATK Command: HaplotypeCaller |
|
|
359 | (1) |
|
Loading the Data (BAM and VCF) into IGV |
|
|
360 | (3) |
|
Troubleshooting a Questionable Variant Call in the Embedded IGV Browser |
|
|
363 | (2) |
|
Visualizing Variant Context Annotation Data |
|
|
365 | (7) |
|
Exporting Annotations of Interest with VariantsToTable |
|
|
365 | (1) |
|
Loading R Script to Make Plotting Functions Available |
|
|
366 | (1) |
|
Making Density Plots for QUAL by Using makeDensityPlot |
|
|
367 | (3) |
|
Making a Scatter Plot of QUAL Versus DP |
|
|
370 | (1) |
|
Making a Scatter Plot Flanked by Marginal Density Plots |
|
|
371 | (1) |
|
|
372 | (1) |
|
13 Assembling Your Own Workspace in Terra |
|
|
373 | (40) |
|
Managing Data Inside and Outside of Workspaces |
|
|
373 | (5) |
|
The Workspace Bucket as Data Repository |
|
|
374 | (1) |
|
Accessing Private Data That You Manage Outside of Terra |
|
|
374 | (3) |
|
Accessing Data in the Terra Data Library |
|
|
377 | (1) |
|
Re-Creating the Tutorial Workspace from Base Components |
|
|
378 | (12) |
|
|
378 | (2) |
|
Adding the Workflow to the Methods Repository and Importing It into the Workspace |
|
|
380 | (2) |
|
Creating a Configuration Quickly with a JSON File |
|
|
382 | (2) |
|
|
384 | (2) |
|
Filling in the Workspace Resource Data Table |
|
|
386 | (1) |
|
Creating a Workflow Configuration That Uses the Data Tables |
|
|
387 | (2) |
|
Adding the Notebook and Checking the Runtime Environment |
|
|
389 | (1) |
|
Documenting Your Workspace and Sharing It |
|
|
390 | (1) |
|
Starting from a GATK Best Practices Workspace |
|
|
390 | (17) |
|
Cloning a GATK Best Practices Workspace |
|
|
391 | (1) |
|
Examining GATK Workspace Data Tables to Understand How the Data Is Structured |
|
|
391 | (3) |
|
Getting to Know the 1000 Genomes High Coverage Dataset |
|
|
394 | (2) |
|
Copying Data Tables from the 1000 Genomes Workspace |
|
|
396 | (2) |
|
Using TSV Load Files to Import Data from the 1000 Genomes Workspace |
|
|
398 | (2) |
|
Running a Joint-Calling Analysis on the Federated Dataset |
|
|
400 | (7) |
|
Building a Workspace Around a Dataset |
|
|
407 | (5) |
|
Cloning the 1000 Genomes Data Workspace |
|
|
407 | (1) |
|
Importing a Workflow from Dockstore |
|
|
408 | (3) |
|
Configuring the Workflow to Use the Data Tables |
|
|
411 | (1) |
|
|
412 | (1) |
|
14 Making a Fully Reproducible Paper |
|
|
413 | (28) |
|
Overview of the Case Study |
|
|
413 | (8) |
|
Computational Reproducibility and the FAIR Framework |
|
|
414 | (2) |
|
Original Research Study and History of the Case Study |
|
|
416 | (1) |
|
Assessing the Available Information and Key Challenges |
|
|
417 | (2) |
|
Designing a Reproducible Implementation |
|
|
419 | (2) |
|
Generating a Synthetic Dataset as a Stand-in for the Private Data |
|
|
421 | (11) |
|
|
422 | (2) |
|
Retrieving the Variant Data from 1000 Genomes Participants |
|
|
424 | (1) |
|
Creating Fake Exomes Based on Real People |
|
|
425 | (5) |
|
|
430 | (2) |
|
Generating the Definitive Dataset |
|
|
432 | (1) |
|
Re-Creating the Data Processing and Analysis Methodology |
|
|
432 | (6) |
|
Mapping and Variant Discovery |
|
|
433 | (2) |
|
Variant Effect Prediction, Prioritization, and Variant Load Analysis |
|
|
435 | (1) |
|
Analytical Performance of the New Implementation |
|
|
436 | (2) |
|
The Long, Winding Road to FAIRness |
|
|
438 | (2) |
|
|
440 | (1) |
Glossary |
|
441 | (4) |
Index |
|
445 | |