Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

E-raamat: Genomics in the Cloud: Using Docker, GATK, and WDL in Terra

3.00/5 (2 hinnangut Goodreads-ist)

Geraldine van der Auwera, Brian D. O'Connor

Formaat: 496 pages
Ilmumisaeg: 02-Apr-2020
Kirjastus: O'Reilly Media
Keel: eng
ISBN-13: 9781491975145

Teised raamatud teemal:

Formaat - EPUB+DRM
Hind: 63,77 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Lisa ostukorvi
Lisa soovinimekirja
See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.

Formaat: 496 pages
Ilmumisaeg: 02-Apr-2020
Kirjastus: O'Reilly Media
Keel: eng
ISBN-13: 9781491975145

Teised raamatud teemal:

DRM piirangud

Kopeerimine (copy/paste):

ei ole lubatud
Printimine:

ei ole lubatud
Kasutamine:

Digitaalõiguste kaitse (DRM)
Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

Vajalik tarkvara
Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

Seda e-raamatut ei saa lugeda Amazon Kindle's.

Data in the genomics field is booming. In just a few years, organizations such as the National Institutes of Health (NIH) will host 50+ petabytes-or over 50 million gigabytes-of genomic data, and they're turning to cloud infrastructure to make that data available to the research community. How do you adapt analysis tools and protocols to access and analyze that volume of data in the cloud? With this practical book, researchers will learn how to work with genomics algorithms using open source tools including the Genome Analysis Toolkit (GATK), Docker, WDL, and Terra.

With this practical book, researchers will learn how to work with genomics algorithms using open source tools including the Genome Analysis Toolkit (GATK), Docker, WDL, and Terra. Geraldine Van der Auwera, longtime custodian of the GATK user community, and Brian O'Connor of the UC Santa Cruz Genomics Institute, guide you through the process. You'll learn by working with real data and genomics algorithms from the field.

This book covers:

Essential genomics and computing technology background Basic cloud computing operations Getting started with GATK, plus three major GATK Best Practices pipelines Automating analysis with scripted workflows using WDL and Cromwell Scaling up workflow execution in the cloud, including parallelization and cost optimization Interactive analysis in the cloud using Jupyter notebooks Secure collaboration and computational reproducibility using Terra

Foreword

xiii

Preface

xvii

1 Introduction

(12)

The Promises and Challenges of Big Data in Biology and Life Sciences

(1)

Infrastructure Challenges

(1)

Toward a Cloud-Based Ecosystem for Data Sharing and Analysis

(6)

Cloud-Hosted Data and Compute

(1)

Platforms for Research in the Life Sciences

(2)

Standardization and Reuse of Infrastructure

(2)

Being FAIR

(1)

Wrap-Up and Next Steps

(2)

2 Genomics in a Nutshell: A Primer for Newcomers to the Field

(40)

Introduction to Genomics

(8)

The Gene as a Discrete Unit of Inheritance (Sort Of)

(2)

The Central Dogma of Biology: DNA to RNA to Protein

(2)

The Origins and Consequences of DNA Mutations

(1)

Genomics as an Inventory of Variation in and Among Genomes

(1)

The Challenge of Genomic Scale, by the Numbers

(1)

Genomic Variation

(11)

The Reference Genome as Common Framework

(3)

Physical Classification of Variants

(5)

Germline Variants Versus Somatic Alterations

(3)

High-Throughput Sequencing Data Generation

(9)

From Biological Sample to Huge Pile of Read Data

(4)

Types of DNA Libraries: Choosing the Right Experimental Design

(4)

Data Processing and Analysis

(11)

Mapping Reads to the Reference Genome

(2)

Variant Calling

(5)

Data Quality and Sources of Error

(3)

Functional Equivalence Pipeline Specification

(1)

Wrap-Up and Next Steps

(1)

3 Computing Technology Basics for Life Scientists

(26)

Basic Infrastructure Components and Performance Bottlenecks

(6)

Types of Processor Hardware: CPU, GPU, TPU, FPGA, OMG

(1)

Levels of Compute Organization: Core, Node, Cluster, and Cloud

(1)

Addressing Performance Bottlenecks

(4)

Parallel Computing

(4)

Parallelizing a Simple Analysis

(1)

From Cores to Clusters and Clouds: Many Levels of Parallelism

(2)

Trade-Offs of Parallelism: Speed, Efficiency, and Cost

(1)

Pipelining for Parallelization and Automation

(4)

Workflow Languages

(1)

Popular Pipelining Languages for Genomics

(1)

Workflow Management Systems

(1)

Virtualization and the Cloud

(9)

VMs and Containers

(3)

Introducing the Cloud

(2)

Categories of Research Use Cases for Cloud Services

(3)

Wrap-Up and Next Steps

(2)

4 First Steps in the Cloud

(36)

Setting Up Your Google Cloud Account and First Project

(5)

Creating a Project

(1)

Checking Your Billing Account and Activating Free Credits

(3)

Running Basic Commands in Google Cloud Shell

(10)

Logging in to the Cloud Shell VM

(1)

Using gsutil to Access and Manage Files

(4)

Pulling a Docker Image and Spinning Up the Container

(3)

Mounting a Volume to Access the Filesystem from Within the Container

(2)

Setting Up Your Own Custom VM

(15)

Creating and Configuring Your VM Instance

(6)

Logging into Your VM by Using SSH

100

(2)

Checking Your Authentication

102

(1)

Copying the Book Materials to Your VM

103

(2)

Installing Docker on Your VM

105

(1)

Setting Up the GATK Container Image

106

(2)

Stopping Your VM...to Stop It from Costing You Money

108

(1)

Configuring IGV to Read Data from GCS Buckets

109

(4)

Wrap-Up and Next Steps

113

(2)

5 First Steps with GATK

115

(32)

Getting Started with GATK

115

(10)

Operating Requirements

116

(1)

Command-Line Syntax

117

(1)

Multithreading with Spark

118

(3)

Running GATK in Practice

121

(4)

Getting Started with Variant Discovery

125

(18)

Calling Germline SNPs and Indels with HaplotypeCaller

125

(10)

Filtering Based on Variant Context Annotations

135

(8)

Introducing the GATK Best Practices

143

(2)

Best Practices Workflows Covered in This Book

145

(1)

Other Major Use Cases

145

(1)

Wrap-Up and Next Steps

145

(2)

6 GATK Best Practices for Germline Short Variant Discovery

147

(36)

Data Preprocessing

147

(8)

Mapping Reads to the Genome Reference

149

(2)

Marking Duplicates

151

(2)

Recalibrating Base Quality Scores

153

(2)

Joint Discovery Analysis

155

(18)

Overview of the Joint Calling Workflow

156

(4)

Calling Variants per Sample to Generate GVCFs

160

(2)

Consolidating GVCFs

162

(2)

Applying Joint Genotyping to Multiple Samples

164

(2)

Filtering the Joint Callset with Variant Quality Score Recalibration

166

(5)

Refining Genotype Assignments and Adjusting Genotype Confidence

171

(1)

Next Steps and Further Reading

172

(1)

Single-Sample Calling with CNN Filtering

173

(7)

Overview of the CNN Single-Sample Workflow

175

(1)

Applying 1D CNN to Filter a Single-Sample WGS Callset

176

(2)

Applying 2D CNN to Include Read Data in the Modeling

178

(2)

Wrap-Up and Next Steps

180

(3)

7 GATK Best Practices for Somatic Variant Discovery

183

(26)

Challenges in Cancer Genomics

183

(2)

Somatic Short Variants (SNVs and Indels)

185

(12)

Overview of the Tumor-Normal Pair Analysis Workflow

186

(2)

Creating a Mutect2 PoN

188

(2)

Running Mutect2 on the Tumor-Normal Pair

190

(1)

Estimating Cross-Sample Contamination

191

(2)

Filtering Mutect2 Calls

193

(2)

Annotating Predicted Functional Effects with Funcotator

195

(2)

Somatic Copy-Number Alterations

197

(11)

Overview of the Tumor-Only Analysis Workflow

198

(3)

Creating a Somatic CNA PoN

201

(1)

Applying Denoising

202

(2)

Performing Segmentation and Call CNAs

204

(3)

Additional Analysis Options

207

(1)

Wrap-Up and Next Steps

208

(1)

8 Automating Analysis Execution with Workflows

209

(36)

Introducing WDL and Cromwell

210

(2)

Installing and Setting Up Cromwell

212

(4)

Your First WDL: Hello World

216

(10)

Learning Basic WDL Syntax Through a Minimalist Example

216

(2)

Running a Simple WDL with Cromwell on Your Google VM

218

(1)

Interpreting the Important Parts of Cromwell's Logging Output

219

(3)

Adding a Variable and Providing Inputs via JSON

222

(1)

Adding Another Task to Make It a Proper Workflow

223

(3)

Your First GATK Workflow: Hello HaplotypeCaller

226

(10)

Exploring the WDL

226

(3)

Generating the Inputs JSON

229

(2)

Running the Workflow

231

(2)

Breaking the Workflow to Test Syntax Validation and Error Messaging

233

(3)

Introducing Scatter-Gather Parallelism

236

(8)

Exploring the WDL

237

(5)

Generating a Graph Diagram for Visualization

242

(2)

Wrap-Up and Next Steps

244

(1)

9 Deciphering Real Genomics Workflows

245

(24)

Mystery Workflow #1 Flexibility Through Conditionals

245

(12)

Mapping Out the Workflow

246

(5)

Reverse Engineering the Conditional Switch

251

(6)

Mystery Workflow #2 Modularity and Code Reuse

257

(11)

Mapping Out the Workflow

257

(5)

Unpacking the Nesting Dolls

262

(6)

Wrap-Up and Next Steps

268

(1)

10 Running Single Workflows at Scale with Pipelines API

269

(26)

Introducing the GCP Genomics Pipelines API Service

269

(2)

Enabling Genomics API and Related APIs in Your Google Cloud Project

270

(1)

Directly Dispatching Cromwell Jobs to PAPI

271

(9)

Configuring Cromwell to Communicate with PAPI

272

(3)

Running Scattered HaplotypeCaller via PAPI

275

(2)

Monitoring Workflow Execution on Google Compute Engine

277

(3)

Understanding and Optimizing Workflow Efficiency

280

(7)

Granularity of Operations

281

(1)

Balance of Time Versus Money

282

(2)

Suggested Cost-Saving Optimizations

284

(2)

Platform-Specific Optimization Versus Portability

286

(1)

Wrapping Cromwell and PAPI Execution with WDL Runner

287

(6)

Setting Up WDL Runner

288

(1)

Running the Scattered HaplotypeCaller Workflow with WDL Runner

288

(2)

Monitoring WDL Runner Execution

290

(3)

Wrap-Up and Next Steps

293

(2)

11 Running Many Workflows Conveniently in Terra

295

(36)

Getting Started with Terra

295

(7)

Creating an Account

296

(2)

Creating a Billing Project

298

(3)

Cloning the Preconfigured Workspace

301

(1)

Running Workflows with the Cromwell Server in Terra

302

(18)

Running a Workflow on a Single Sample

302

(3)

Running a Workflow on Multiple Samples in a Data Table

305

(6)

Monitoring Workflow Execution

311

(5)

Locating Workflow Outputs in the Data Table

316

(2)

Running the Same Workflow Again to Demonstrate Call Caching

318

(2)

Running a Real GATK Best Practices Pipeline at Full Scale

320

(8)

Finding and Cloning the GATK Best Practices Workspace for Germline Short Variant Discovery

320

(1)

Examining the Preloaded Data

321

(2)

Selecting Data and Configuring the Full-Scale Workflow

323

(1)

Launching the Full-Scale Workflow and Monitoring Execution

324

(3)

Options for Downloading Output Data---or Not

327

(1)

Wrap-Up and Next Steps

328

(3)

12 Interactive Analysis in Jupyter Notebook

331

(42)

Introduction to Jupyter in Terra

332

(8)

Jupyter Notebooks in General

332

(2)

How Jupyter Notebooks Work in Terra

334

(6)

Getting Started with Jupyter in Terra

340

(13)

Inspecting and Customizing the Notebook Runtime Configuration

341

(5)

Opening Notebook in Edit Mode and Checking the Kernel

346

(1)

Running the Hello World Cells

347

(3)

Using gsutil to Interact with Google Cloud Storage Buckets

350

(1)

Setting Up a Variable Pointing to the Germline Data in the Book Bucket

351

(1)

Setting Up a Sandbox and Saving Output Files to the Workspace Bucket

352

(1)

Visualizing Genomic Data in an Embedded IGV Window

353

(5)

Setting Up the Embedded IGV Browser

354

(1)

Adding Data to the IGV Browser

355

(2)

Setting Up an Access Token to View Private Data

357

(1)

Running GATK Commands to Learn, Test, or Troubleshoot

358

(7)

Running a Basic GATK Command: HaplotypeCaller

359

(1)

Loading the Data (BAM and VCF) into IGV

360

(3)

Troubleshooting a Questionable Variant Call in the Embedded IGV Browser

363

(2)

Visualizing Variant Context Annotation Data

365

(7)

Exporting Annotations of Interest with VariantsToTable

365

(1)

Loading R Script to Make Plotting Functions Available

366

(1)

Making Density Plots for QUAL by Using makeDensityPlot

367

(3)

Making a Scatter Plot of QUAL Versus DP

370

(1)

Making a Scatter Plot Flanked by Marginal Density Plots

371

(1)

Wrap-Up and Next Steps

372

(1)

13 Assembling Your Own Workspace in Terra

373

(40)

Managing Data Inside and Outside of Workspaces

373

(5)

The Workspace Bucket as Data Repository

374

(1)

Accessing Private Data That You Manage Outside of Terra

374

(3)

Accessing Data in the Terra Data Library

377

(1)

Re-Creating the Tutorial Workspace from Base Components

378

(12)

Creating a New Workspace

378

(2)

Adding the Workflow to the Methods Repository and Importing It into the Workspace

380

(2)

Creating a Configuration Quickly with a JSON File

382

(2)

Adding the Data Table

384

(2)

Filling in the Workspace Resource Data Table

386

(1)

Creating a Workflow Configuration That Uses the Data Tables

387

(2)

Adding the Notebook and Checking the Runtime Environment

389

(1)

Documenting Your Workspace and Sharing It

390

(1)

Starting from a GATK Best Practices Workspace

390

(17)

Cloning a GATK Best Practices Workspace

391

(1)

Examining GATK Workspace Data Tables to Understand How the Data Is Structured

391

(3)

Getting to Know the 1000 Genomes High Coverage Dataset

394

(2)

Copying Data Tables from the 1000 Genomes Workspace

396

(2)

Using TSV Load Files to Import Data from the 1000 Genomes Workspace

398

(2)

Running a Joint-Calling Analysis on the Federated Dataset

400

(7)

Building a Workspace Around a Dataset

407

(5)

Cloning the 1000 Genomes Data Workspace

407

(1)

Importing a Workflow from Dockstore

408

(3)

Configuring the Workflow to Use the Data Tables

411

(1)

Wrap-Up and Next Steps

412

(1)

14 Making a Fully Reproducible Paper

413

(28)

Overview of the Case Study

413

(8)

Computational Reproducibility and the FAIR Framework

414

(2)

Original Research Study and History of the Case Study

416

(1)

Assessing the Available Information and Key Challenges

417

(2)

Designing a Reproducible Implementation

419

(2)

Generating a Synthetic Dataset as a Stand-in for the Private Data

421

(11)

Overall Methodology

422

(2)

Retrieving the Variant Data from 1000 Genomes Participants

424

(1)

Creating Fake Exomes Based on Real People

425

(5)

Mutating the Fake Exomes

430

(2)

Generating the Definitive Dataset

432

(1)

Re-Creating the Data Processing and Analysis Methodology

432

(6)

Mapping and Variant Discovery

433

(2)

Variant Effect Prediction, Prioritization, and Variant Load Analysis

435

(1)

Analytical Performance of the New Implementation

436

(2)

The Long, Winding Road to FAIRness

438

(2)

Final Conclusions

440

(1)

Glossary

441

(4)

Index

445

Dr. Geraldine A. Van der Auwera is the Director of Outreach and Communication for the Data Sciences Platform (DSP) at the Broad Institute of MIT and Harvard. As part of her outreach role, she serves as an educator and advocate for researchers who use DSP software and services including GATK, the Broad's industry-leading toolkit for variant discovery analysis; the Cromwell/WDL workflow management system; and Terra.bio, a cloud-based analysis platform that integrates computational resources, methods repository and data management in a user-friendly environment. Van der Auwera was originally trained as a microbiologist, earning her Ph.D. in Biological Engineering from the Universite catholique de Louvain (UCL) in Belgium in 2007, then surviving a 4-year postdoctoral stint at Harvard Medical School. She joined the Broad Institute in 2012 to become Benevolent Dictator For Life of the GATK user community, leaving behind the bench and pipette work forever. Dr. Brian O'Connor is the Director of the Computational Genomics Platform at the University of California Santa Cruz (UCSC) Genomics Institute. There, he focuses on the development and deployment of large-scale, cloud-based systems for analyzing genomic data. These include the NHGRI AnVIL and NHLBI Bio Data Catalyst platforms as well as the Dockstore site for workflow and tool sharing. Brian is active in standards efforts and is the cochair of the Global Alliance for Genomics and Health Cloud Work Stream where he works on API standards for cloud interoperability. Brian joined UCSC from the Ontario Institute for Cancer Research where his previous projects included leading the technical implementation of worldwide, cloud-based analysis systems for the PanCancer Analysis of Whole Genomes project, creating the Dockstore, and managing a successful rebuild of the International Cancer Genome Consortium's Data Portal.

Lisainfo e-raamatute kohta

Püsilink: https://www.kriso.ee/db/97814919751456e.html

Märksõnad:

E-raamat: Genomics in the Cloud: Using Docker, GATK, and WDL in Terra

DRM piirangud

Kopeerimine (copy/paste):

Printimine:

Kasutamine:

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad E-raamatute teemad

Vali ostukorv