Muutke küpsiste eelistusi

E-raamat: Genomics in the Cloud: Using Docker, GATK, and WDL in Terra

  • Formaat: 496 pages
  • Ilmumisaeg: 02-Apr-2020
  • Kirjastus: O'Reilly Media
  • Keel: eng
  • ISBN-13: 9781491975145
  • Formaat - EPUB+DRM
  • Hind: 63,77 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
  • Formaat: 496 pages
  • Ilmumisaeg: 02-Apr-2020
  • Kirjastus: O'Reilly Media
  • Keel: eng
  • ISBN-13: 9781491975145

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

Data in the genomics field is booming. In just a few years, organizations such as the National Institutes of Health (NIH) will host 50+ petabytes-or over 50 million gigabytes-of genomic data, and they're turning to cloud infrastructure to make that data available to the research community. How do you adapt analysis tools and protocols to access and analyze that volume of data in the cloud? With this practical book, researchers will learn how to work with genomics algorithms using open source tools including the Genome Analysis Toolkit (GATK), Docker, WDL, and Terra.

With this practical book, researchers will learn how to work with genomics algorithms using open source tools including the Genome Analysis Toolkit (GATK), Docker, WDL, and Terra. Geraldine Van der Auwera, longtime custodian of the GATK user community, and Brian O'Connor of the UC Santa Cruz Genomics Institute, guide you through the process. You'll learn by working with real data and genomics algorithms from the field.

This book covers:

Essential genomics and computing technology background Basic cloud computing operations Getting started with GATK, plus three major GATK Best Practices pipelines Automating analysis with scripted workflows using WDL and Cromwell Scaling up workflow execution in the cloud, including parallelization and cost optimization Interactive analysis in the cloud using Jupyter notebooks Secure collaboration and computational reproducibility using Terra
Foreword xiii
Preface xvii
1 Introduction
1(12)
The Promises and Challenges of Big Data in Biology and Life Sciences
3(1)
Infrastructure Challenges
3(1)
Toward a Cloud-Based Ecosystem for Data Sharing and Analysis
4(6)
Cloud-Hosted Data and Compute
5(1)
Platforms for Research in the Life Sciences
6(2)
Standardization and Reuse of Infrastructure
8(2)
Being FAIR
10(1)
Wrap-Up and Next Steps
11(2)
2 Genomics in a Nutshell: A Primer for Newcomers to the Field
13(40)
Introduction to Genomics
13(8)
The Gene as a Discrete Unit of Inheritance (Sort Of)
14(2)
The Central Dogma of Biology: DNA to RNA to Protein
16(2)
The Origins and Consequences of DNA Mutations
18(1)
Genomics as an Inventory of Variation in and Among Genomes
19(1)
The Challenge of Genomic Scale, by the Numbers
20(1)
Genomic Variation
21(11)
The Reference Genome as Common Framework
21(3)
Physical Classification of Variants
24(5)
Germline Variants Versus Somatic Alterations
29(3)
High-Throughput Sequencing Data Generation
32(9)
From Biological Sample to Huge Pile of Read Data
33(4)
Types of DNA Libraries: Choosing the Right Experimental Design
37(4)
Data Processing and Analysis
41(11)
Mapping Reads to the Reference Genome
41(2)
Variant Calling
43(5)
Data Quality and Sources of Error
48(3)
Functional Equivalence Pipeline Specification
51(1)
Wrap-Up and Next Steps
52(1)
3 Computing Technology Basics for Life Scientists
53(26)
Basic Infrastructure Components and Performance Bottlenecks
54(6)
Types of Processor Hardware: CPU, GPU, TPU, FPGA, OMG
54(1)
Levels of Compute Organization: Core, Node, Cluster, and Cloud
55(1)
Addressing Performance Bottlenecks
56(4)
Parallel Computing
60(4)
Parallelizing a Simple Analysis
60(1)
From Cores to Clusters and Clouds: Many Levels of Parallelism
61(2)
Trade-Offs of Parallelism: Speed, Efficiency, and Cost
63(1)
Pipelining for Parallelization and Automation
64(4)
Workflow Languages
66(1)
Popular Pipelining Languages for Genomics
66(1)
Workflow Management Systems
67(1)
Virtualization and the Cloud
68(9)
VMs and Containers
69(3)
Introducing the Cloud
72(2)
Categories of Research Use Cases for Cloud Services
74(3)
Wrap-Up and Next Steps
77(2)
4 First Steps in the Cloud
79(36)
Setting Up Your Google Cloud Account and First Project
79(5)
Creating a Project
80(1)
Checking Your Billing Account and Activating Free Credits
81(3)
Running Basic Commands in Google Cloud Shell
84(10)
Logging in to the Cloud Shell VM
84(1)
Using gsutil to Access and Manage Files
85(4)
Pulling a Docker Image and Spinning Up the Container
89(3)
Mounting a Volume to Access the Filesystem from Within the Container
92(2)
Setting Up Your Own Custom VM
94(15)
Creating and Configuring Your VM Instance
94(6)
Logging into Your VM by Using SSH
100(2)
Checking Your Authentication
102(1)
Copying the Book Materials to Your VM
103(2)
Installing Docker on Your VM
105(1)
Setting Up the GATK Container Image
106(2)
Stopping Your VM...to Stop It from Costing You Money
108(1)
Configuring IGV to Read Data from GCS Buckets
109(4)
Wrap-Up and Next Steps
113(2)
5 First Steps with GATK
115(32)
Getting Started with GATK
115(10)
Operating Requirements
116(1)
Command-Line Syntax
117(1)
Multithreading with Spark
118(3)
Running GATK in Practice
121(4)
Getting Started with Variant Discovery
125(18)
Calling Germline SNPs and Indels with HaplotypeCaller
125(10)
Filtering Based on Variant Context Annotations
135(8)
Introducing the GATK Best Practices
143(2)
Best Practices Workflows Covered in This Book
145(1)
Other Major Use Cases
145(1)
Wrap-Up and Next Steps
145(2)
6 GATK Best Practices for Germline Short Variant Discovery
147(36)
Data Preprocessing
147(8)
Mapping Reads to the Genome Reference
149(2)
Marking Duplicates
151(2)
Recalibrating Base Quality Scores
153(2)
Joint Discovery Analysis
155(18)
Overview of the Joint Calling Workflow
156(4)
Calling Variants per Sample to Generate GVCFs
160(2)
Consolidating GVCFs
162(2)
Applying Joint Genotyping to Multiple Samples
164(2)
Filtering the Joint Callset with Variant Quality Score Recalibration
166(5)
Refining Genotype Assignments and Adjusting Genotype Confidence
171(1)
Next Steps and Further Reading
172(1)
Single-Sample Calling with CNN Filtering
173(7)
Overview of the CNN Single-Sample Workflow
175(1)
Applying 1D CNN to Filter a Single-Sample WGS Callset
176(2)
Applying 2D CNN to Include Read Data in the Modeling
178(2)
Wrap-Up and Next Steps
180(3)
7 GATK Best Practices for Somatic Variant Discovery
183(26)
Challenges in Cancer Genomics
183(2)
Somatic Short Variants (SNVs and Indels)
185(12)
Overview of the Tumor-Normal Pair Analysis Workflow
186(2)
Creating a Mutect2 PoN
188(2)
Running Mutect2 on the Tumor-Normal Pair
190(1)
Estimating Cross-Sample Contamination
191(2)
Filtering Mutect2 Calls
193(2)
Annotating Predicted Functional Effects with Funcotator
195(2)
Somatic Copy-Number Alterations
197(11)
Overview of the Tumor-Only Analysis Workflow
198(3)
Creating a Somatic CNA PoN
201(1)
Applying Denoising
202(2)
Performing Segmentation and Call CNAs
204(3)
Additional Analysis Options
207(1)
Wrap-Up and Next Steps
208(1)
8 Automating Analysis Execution with Workflows
209(36)
Introducing WDL and Cromwell
210(2)
Installing and Setting Up Cromwell
212(4)
Your First WDL: Hello World
216(10)
Learning Basic WDL Syntax Through a Minimalist Example
216(2)
Running a Simple WDL with Cromwell on Your Google VM
218(1)
Interpreting the Important Parts of Cromwell's Logging Output
219(3)
Adding a Variable and Providing Inputs via JSON
222(1)
Adding Another Task to Make It a Proper Workflow
223(3)
Your First GATK Workflow: Hello HaplotypeCaller
226(10)
Exploring the WDL
226(3)
Generating the Inputs JSON
229(2)
Running the Workflow
231(2)
Breaking the Workflow to Test Syntax Validation and Error Messaging
233(3)
Introducing Scatter-Gather Parallelism
236(8)
Exploring the WDL
237(5)
Generating a Graph Diagram for Visualization
242(2)
Wrap-Up and Next Steps
244(1)
9 Deciphering Real Genomics Workflows
245(24)
Mystery Workflow #1 Flexibility Through Conditionals
245(12)
Mapping Out the Workflow
246(5)
Reverse Engineering the Conditional Switch
251(6)
Mystery Workflow #2 Modularity and Code Reuse
257(11)
Mapping Out the Workflow
257(5)
Unpacking the Nesting Dolls
262(6)
Wrap-Up and Next Steps
268(1)
10 Running Single Workflows at Scale with Pipelines API
269(26)
Introducing the GCP Genomics Pipelines API Service
269(2)
Enabling Genomics API and Related APIs in Your Google Cloud Project
270(1)
Directly Dispatching Cromwell Jobs to PAPI
271(9)
Configuring Cromwell to Communicate with PAPI
272(3)
Running Scattered HaplotypeCaller via PAPI
275(2)
Monitoring Workflow Execution on Google Compute Engine
277(3)
Understanding and Optimizing Workflow Efficiency
280(7)
Granularity of Operations
281(1)
Balance of Time Versus Money
282(2)
Suggested Cost-Saving Optimizations
284(2)
Platform-Specific Optimization Versus Portability
286(1)
Wrapping Cromwell and PAPI Execution with WDL Runner
287(6)
Setting Up WDL Runner
288(1)
Running the Scattered HaplotypeCaller Workflow with WDL Runner
288(2)
Monitoring WDL Runner Execution
290(3)
Wrap-Up and Next Steps
293(2)
11 Running Many Workflows Conveniently in Terra
295(36)
Getting Started with Terra
295(7)
Creating an Account
296(2)
Creating a Billing Project
298(3)
Cloning the Preconfigured Workspace
301(1)
Running Workflows with the Cromwell Server in Terra
302(18)
Running a Workflow on a Single Sample
302(3)
Running a Workflow on Multiple Samples in a Data Table
305(6)
Monitoring Workflow Execution
311(5)
Locating Workflow Outputs in the Data Table
316(2)
Running the Same Workflow Again to Demonstrate Call Caching
318(2)
Running a Real GATK Best Practices Pipeline at Full Scale
320(8)
Finding and Cloning the GATK Best Practices Workspace for Germline Short Variant Discovery
320(1)
Examining the Preloaded Data
321(2)
Selecting Data and Configuring the Full-Scale Workflow
323(1)
Launching the Full-Scale Workflow and Monitoring Execution
324(3)
Options for Downloading Output Data---or Not
327(1)
Wrap-Up and Next Steps
328(3)
12 Interactive Analysis in Jupyter Notebook
331(42)
Introduction to Jupyter in Terra
332(8)
Jupyter Notebooks in General
332(2)
How Jupyter Notebooks Work in Terra
334(6)
Getting Started with Jupyter in Terra
340(13)
Inspecting and Customizing the Notebook Runtime Configuration
341(5)
Opening Notebook in Edit Mode and Checking the Kernel
346(1)
Running the Hello World Cells
347(3)
Using gsutil to Interact with Google Cloud Storage Buckets
350(1)
Setting Up a Variable Pointing to the Germline Data in the Book Bucket
351(1)
Setting Up a Sandbox and Saving Output Files to the Workspace Bucket
352(1)
Visualizing Genomic Data in an Embedded IGV Window
353(5)
Setting Up the Embedded IGV Browser
354(1)
Adding Data to the IGV Browser
355(2)
Setting Up an Access Token to View Private Data
357(1)
Running GATK Commands to Learn, Test, or Troubleshoot
358(7)
Running a Basic GATK Command: HaplotypeCaller
359(1)
Loading the Data (BAM and VCF) into IGV
360(3)
Troubleshooting a Questionable Variant Call in the Embedded IGV Browser
363(2)
Visualizing Variant Context Annotation Data
365(7)
Exporting Annotations of Interest with VariantsToTable
365(1)
Loading R Script to Make Plotting Functions Available
366(1)
Making Density Plots for QUAL by Using makeDensityPlot
367(3)
Making a Scatter Plot of QUAL Versus DP
370(1)
Making a Scatter Plot Flanked by Marginal Density Plots
371(1)
Wrap-Up and Next Steps
372(1)
13 Assembling Your Own Workspace in Terra
373(40)
Managing Data Inside and Outside of Workspaces
373(5)
The Workspace Bucket as Data Repository
374(1)
Accessing Private Data That You Manage Outside of Terra
374(3)
Accessing Data in the Terra Data Library
377(1)
Re-Creating the Tutorial Workspace from Base Components
378(12)
Creating a New Workspace
378(2)
Adding the Workflow to the Methods Repository and Importing It into the Workspace
380(2)
Creating a Configuration Quickly with a JSON File
382(2)
Adding the Data Table
384(2)
Filling in the Workspace Resource Data Table
386(1)
Creating a Workflow Configuration That Uses the Data Tables
387(2)
Adding the Notebook and Checking the Runtime Environment
389(1)
Documenting Your Workspace and Sharing It
390(1)
Starting from a GATK Best Practices Workspace
390(17)
Cloning a GATK Best Practices Workspace
391(1)
Examining GATK Workspace Data Tables to Understand How the Data Is Structured
391(3)
Getting to Know the 1000 Genomes High Coverage Dataset
394(2)
Copying Data Tables from the 1000 Genomes Workspace
396(2)
Using TSV Load Files to Import Data from the 1000 Genomes Workspace
398(2)
Running a Joint-Calling Analysis on the Federated Dataset
400(7)
Building a Workspace Around a Dataset
407(5)
Cloning the 1000 Genomes Data Workspace
407(1)
Importing a Workflow from Dockstore
408(3)
Configuring the Workflow to Use the Data Tables
411(1)
Wrap-Up and Next Steps
412(1)
14 Making a Fully Reproducible Paper
413(28)
Overview of the Case Study
413(8)
Computational Reproducibility and the FAIR Framework
414(2)
Original Research Study and History of the Case Study
416(1)
Assessing the Available Information and Key Challenges
417(2)
Designing a Reproducible Implementation
419(2)
Generating a Synthetic Dataset as a Stand-in for the Private Data
421(11)
Overall Methodology
422(2)
Retrieving the Variant Data from 1000 Genomes Participants
424(1)
Creating Fake Exomes Based on Real People
425(5)
Mutating the Fake Exomes
430(2)
Generating the Definitive Dataset
432(1)
Re-Creating the Data Processing and Analysis Methodology
432(6)
Mapping and Variant Discovery
433(2)
Variant Effect Prediction, Prioritization, and Variant Load Analysis
435(1)
Analytical Performance of the New Implementation
436(2)
The Long, Winding Road to FAIRness
438(2)
Final Conclusions
440(1)
Glossary 441(4)
Index 445
Dr. Geraldine A. Van der Auwera is the Director of Outreach and Communication for the Data Sciences Platform (DSP) at the Broad Institute of MIT and Harvard. As part of her outreach role, she serves as an educator and advocate for researchers who use DSP software and services including GATK, the Broad's industry-leading toolkit for variant discovery analysis; the Cromwell/WDL workflow management system; and Terra.bio, a cloud-based analysis platform that integrates computational resources, methods repository and data management in a user-friendly environment. Van der Auwera was originally trained as a microbiologist, earning her Ph.D. in Biological Engineering from the Universite catholique de Louvain (UCL) in Belgium in 2007, then surviving a 4-year postdoctoral stint at Harvard Medical School. She joined the Broad Institute in 2012 to become Benevolent Dictator For Life of the GATK user community, leaving behind the bench and pipette work forever. Dr. Brian O'Connor is the Director of the Computational Genomics Platform at the University of California Santa Cruz (UCSC) Genomics Institute. There, he focuses on the development and deployment of large-scale, cloud-based systems for analyzing genomic data. These include the NHGRI AnVIL and NHLBI Bio Data Catalyst platforms as well as the Dockstore site for workflow and tool sharing. Brian is active in standards efforts and is the cochair of the Global Alliance for Genomics and Health Cloud Work Stream where he works on API standards for cloud interoperability. Brian joined UCSC from the Ontario Institute for Cancer Research where his previous projects included leading the technical implementation of worldwide, cloud-based analysis systems for the PanCancer Analysis of Whole Genomes project, creating the Dockstore, and managing a successful rebuild of the International Cancer Genome Consortium's Data Portal.