Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

Data Pipelines with Apache Airflow [Pehme köide]

4.34/5 (214 hinnangut Goodreads-ist)

Julian Ruiter, Bas Harenslak

Formaat: Paperback / softback, 425 pages, kõrgus x laius x paksus: 234x186x28 mm, kaal: 860 g
Ilmumisaeg: 07-Jun-2021
Kirjastus: Manning Publications
ISBN-10: 1617296902
ISBN-13: 9781617296901

Teised raamatud teemal:

Cloud computing
Information visualization
Programming & scripting languages: general - (Hetkel poes: 1 nimetust)

Pehme köide
Hind: 56,79 €
Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Tellimisaeg 2-4 nädalat
Lisa soovinimekirja

Formaat: Paperback / softback, 425 pages, kõrgus x laius x paksus: 234x186x28 mm, kaal: 860 g
Ilmumisaeg: 07-Jun-2021
Kirjastus: Manning Publications
ISBN-10: 1617296902
ISBN-13: 9781617296901

Teised raamatud teemal:

Cloud computing
Information visualization
Programming & scripting languages: general - (Hetkel poes: 1 nimetust)

Püsilink: https://www.kriso.ee/db/9781617296901.html

Märksõnad:

Pipelines can be challenging to manage, especially when your data has to flow through a collection of application components, servers, and cloud services. Airflow lets you schedule, restart, and backfill pipelines, and its easy-to-use UI and workflows with Python scripting has users praising its incredible flexibility. Data Pipelines with Apache Airflow takes you through best practices for creating pipelines for multiple tasks, including data lakes, cloud deployments, and data science.

Data Pipelines with Apache Airflow teaches you the ins-and-outs of the Directed Acyclic Graphs (DAGs) that power Airflow, and how to write your own DAGs to meet the needs of your projects. With complete coverage of both foundational and lesser-known features, when youre done youll be set to start using Airflow for seamless data pipeline development and management.

Key Features

Framework foundation and best practices

Airflow's execution and dependency system

Testing Airflow DAGs

Running Airflow in production

For data-savvy developers, DevOps and data engineers, and system

administrators with intermediate Python skills.

About the technology

Data pipelines are used to extract, transform and load data to and from multiple sources, routing it wherever its needed -- whether thats visualisation tools, business intelligence dashboards, or machine learning models. Airflow streamlines the whole process, giving you one tool for programmatically developing and monitoring batch data pipelines, and integrating all the pieces you use in your data stack.

Bas Harenslak and Julian de Ruiter are data engineers with extensive experience using Airflow to develop pipelines for major companies including Heineken, Unilever, and Booking.com. Bas is a committer, and both Bas and Julian are active contributors to Apache Airflow.

Preface

Acknowledgments

xvii

About this book

xix

About the authors

xxiii

About the cover illustration

xxiv

PART 1 GETTING STARTED

(112)

1 Meet Apache Airflow

(10)

1.1 Introducing data pipelines

(6)

Data pipelines as graphs

(2)

Executing a pipeline graph

(1)

Pipeline graphs vs. sequential scripts

(3)

Running pipeline using workflow managers

(1)

1.2 Introducing Airflow

(3)

Defining pipelines flexibly in (Python) code 10* Scheduling and executing pipelines

(2)

2 Monitoring and handling failures

(27)

Incremental loading and backfilling

(2)

1.3 When to use Airflow

(1)

Reasons to choose Airflow

(1)

Reasons not to choose Airflow

(1)

1.4 The rest of this book

(3)

Anatomy of an Airflow DAG

(1)

2.1 Collecting data from numerous sources

(1)

Exploring the data

(1)

2.2 Writing your first Airflow DAG

(7)

Tasks vs. Operators

(1)

Running arbitrary Python code

(2)

2.3 Running a DAG in Airflow

(4)

Running Airflow in a Python environment

(1)

Running Airflow in Docker containers

(1)

Inspecting the Airflow UI

(2)

2.4 Running at regular intervals

(3)

2.5 Handling failing tasks

(4)

3 Scheduling in Airflow

(20)

3.1 An example: Processing user events

(1)

3.2 Running at regular intervals

(4)

Defining scheduling intervals

(2)

Cron-based intervals

(2)

Frequency-based intervals

(1)

3.3 Processing data incrementally

(6)

Fetching events incrementally

(2)

Dynamic time references using execution dates

(2)

Partitioning your data

(2)

3.4 Understanding Airflow's execution dates

(2)

Executing work in fixed-length intervals

(2)

3.5 Using backfilling to fill in past gaps

(1)

Executing work back in time

(1)

3.6 Best practices for designing tasks

(5)

Atomicity

(2)

Idempotency

(3)

4 Templating tasks using the Airflow context

(25)

4.1 Inspecting data for processing with Airflow

(2)

Determining how to load incremental data

(2)

4.2 Task context and Jinja templating

(14)

Templating operator arguments

(2)

What is available for templating?

(2)

Templating the PythonOperator

(5)

Providing variables to the PythonOperator

(2)

Inspecting templated arguments

(2)

4.3 Hooking up other systems

(8)

5 Defining dependencies between tasks

(28)

5.1 Basic dependencies

(4)

Linear dependencies

(1)

Fan-in/-out dependencies

(3)

5.2 Branching

(7)

Branching within tasks

(2)

Branching within the DAG

(5)

5.3 Conditional tasks

(3)

Conditions within tasks

(1)

Making tasks conditional

(2)

Using built-in operators

100

(1)

5.4 More about trigger rules

100

(4)

What is a trigger rule?

101

(1)

The effect of failures

102

(1)

Other trigger rules

103

(1)

5.5 Sharing data between tasks

104

(4)

Sharing data using XComs

104

(3)

When (not) to use XComs

107

(1)

Using custom XCom backends

108

(1)

5.6 Chaining Python tasks with the Taskflow API

108

(5)

Simplifying Python tasks with the Taskflow API

109

(2)

When (not) to use the Taskfloiv API

111

(2)

PART 2 Beyond The Basics

113

(140)

6 Triggering workflows

115

(20)

6.1 Polling conditions with sensors

116

(6)

Polling custom conditions

119

(1)

Sensors outside the happy flow

120

(2)

6.2 Triggering other DAGs

122

(9)

Backfilling with the TriggerDagRunOperator

126

(1)

Polling the stale of other DA Gs

127

(4)

6.3 Starting workflows with REST/CLI

131

(4)

7 Communicating with external systems

135

(22)

7.1 Connecting to cloud services

136

(14)

Installing extra dependencies

137

(1)

Developing a machine learning model

137

(6)

Developing locally with external systems

143

(7)

7.2 Moving data from between systems

150

(7)

Implementing a PostgresToS3Operator

151

(4)

Outsourcing the heavy work

155

(2)

8 Building custom components

157

(29)

8.1 Starting with a PythonOperator

158

(8)

Simulating a movie rating API

158

(3)

Fetching ratings from the API

161

(3)

Building the actual DAG

164

(2)

8.2 Building a custom hook

166

(7)

Designing a custom hook

166

(6)

Building our DAG with the MovielensHook

172

(1)

8.3 Building a custom operator

173

(5)

Defining a custom operator

174

(1)

Building an operator for fetching ratings

175

(3)

8.4 Building custom sensors

178

(3)

8.5 Packaging your components

181

(5)

Bootstrapping a Python package

182

(2)

Installing your package

184

(2)

9 Testing

186

(34)

9.1 Getting started with testing

187

(16)

Integrity testing all DAGs

187

(6)

Setting up a CI/CD pipeline

193

(2)

Writing unit tests

195

(1)

Pytest project structure

196

(5)

Testing with files on disk

201

(2)

9.2 Working with DAGs and task context in tests

203

(12)

Working with external systems

208

(7)

9.3 Using tests for development

215

(3)

Testing complete DAGs

217

(1)

9.4 Emulate production environments with Whirl

218

(1)

9.5 Create DTAP environments

219

(1)

10 Running tasks in containers

220

(33)

10.1 Challenges of many different operators

221

(2)

Operator interfaces and implementations

221

(1)

Complex and conflicting dependencies

222

(1)

Moving toward a generic operator

223

(1)

10.2 Introducing containers

223

(7)

What are containers'?

223

(1)

Running our first Docker container

224

(1)

Creating a Docker image

225

(2)

Persisting data using volumes

227

(3)

10.3 Containers and Airflow

230

(2)

Tasks in containers

230

(1)

Why use containers?

231

(1)

10.4 Running tasks in Docker

232

(8)

Introducing the DockerOperator

232

(1)

Creating container images for tasks

233

(3)

Building a DAG with Docker tasks

236

(3)

Docker-based workflow

239

(1)

10.5 Running tasks in Kubernetes

240

(13)

Introducing Kuberinetes

240

(2)

Setting up Kubernetes

242

(3)

Using the KubernetesPodOperator

245

(3)

Diagnosing Kubernetes-related issues

248

(2)

Differences with Docker-based workflows

250

(3)

Part 3 Airflow in practice

253

(112)

11 Best practices

255

(26)

11.1 Writing clean DAGs

256

(14)

Use style conventions

256

(4)

Manage credentials centrally

260

(1)

Specify configuration details consistently

261

(2)

Avoid doing any computation in your DAG definition

263

(2)

Use factories to generate common patterns

265

(4)

Group related tasks using task groups

269

(1)

Create new DAGs for big changes

270

(1)

11.2 Designing reproducible tasks

270

(2)

Always require tasks to be idempotent

271

(1)

Task results should be deterministic

271

(1)

Design tasks using functional paradigms

272

(1)

11.3 Handling data efficiently

272

(4)

Limit the amount of data being processed

272

(2)

Incremental loading/processing

274

(1)

Cache intermediate data

275

(1)

Don't store data on local file systems

275

(1)

Offload work to external/source systems

276

(1)

11.4 Managing your resources

276

(5)

Managing concurrency using pools

276

(2)

Detecting long-running tasks using SLAs and alerts

278

(3)

12 Operating Airflow in production

281

(41)

12.1 Airflow architectures

282

(8)

Which executor is right for me?

284

(1)

Configuring a metastore for Airflow

284

(2)

A closer look at the scheduler

286

(4)

12.2 Installing each executor

290

(12)

Setting up the SequentialExecutor

291

(1)

Setting up the LocalExecutor

292

(1)

Setting up the CeleryExecutor

293

(3)

Setting up the KubernetesExecutor

296

(6)

12.3 Capturing logs of all Airflow processes

302

(3)

Capturing the webserver output

303

(1)

Capturing the scheduler output

303

(1)

Capturing task logs

304

(1)

Sending logs to remote storage

305

(1)

12.4 Visualizing and monitoring Airflow metrics

305

(9)

Collecting metrics from Airflow

306

(1)

Configuring Airflow to send metrics

307

(1)

Configuring Prometheus to collect metrics

308

(2)

Creating dashboards with Grafana

310

(2)

What should you monitor?

312

(2)

12.5 How to get notified of a failing task

314

(8)

Alerting within DAGs and operators

314

(2)

Defining service-level agreements

316

(2)

Scalability and performance

318

(1)

Controlling the maximum number of running tasks

318

(1)

System performance configurations

319

(1)

Running multiple schedulers

320

(2)

13 Securing Airflow

322

(22)

13.1 Securing the Airflow web interface

323

(4)

Adding users to the RBAC interface

324

(3)

Configuring the RBAC interface

327

(1)

13.2 Encrypting data at rest

327

(3)

Creating a Fernet key

328

(2)

13.3 Connecting with an LDAP service

330

(3)

Understanding LDAP

330

(3)

Fetching users from an LDAP service

333

(1)

13.4 Encrypting traffic to the webserver

333

(6)

Understanding HTTPS

334

(2)

Configuring a certificate for HTTPS

336

(3)

13.5 Fetching credentials from secret management systems

339

(5)

14 Project: Finding the fastest way to get around NYC

344

(21)

14.1 Understanding the data

347

(3)

Yellow Cab file share

348

(1)

Citi Bike REST API

348

(2)

Deciding on a plan of approach

350

(1)

14.2 Extracting the data

350

(6)

Downloading Citi Bike data

351

(2)

Downloading Yellow Cab data

353

(3)

14.3 Applying similar transformations to data

356

(4)

14.4 Structuring a data pipeline

360

(1)

14.5 Developing idempotent data pipelines

361

(4)

PART 4 In the Clouds

365

(71)

15 Airflow in the clouds

367

(8)

15.1 Designing (cloud) deployment strategies

368

(1)

15.2 Cloud-specific operators and hooks

369

(1)

15.3 Managed services

370

(2)

Astronomer.io

371

(1)

Google Cloud Composer

371

(1)

Amazon Managed Workflows for Apache Airflow

372

(1)

15.4 Choosing a deployment strategy

372

(3)

16 Airflow on AWS

375

(19)

16.1 Deploying Airflow in AWS

375

(6)

Picking cloud services

376

(1)

Designing the network

377

(1)

Adding DAG syncing

378

(1)

Scaling with the CeleryExecutor

378

(2)

Further steps

380

(1)

16.2 AWS-specific hooks and operators

381

(2)

16.3 Use case: Serverless movie ranking with AWS Athena

383

(11)

Overview

383

(1)

Setting up resources

384

(3)

Building the DAG

387

(6)

Cleaning up

393

(1)

17 Airflow on Azure

394

(18)

17.1 Deploying Airflow in Azure

394

(4)

Picking services

395

(1)

Designing the network

395

(2)

Scaling with the CeleryExecutor

397

(1)

Further steps

398

(1)

17.2 Azure-specific hooks/operators

398

(2)

17.3 Example: Serverless movie ranking with Azure Synapse

400

(12)

Overview

400

(1)

Setting up resources

401

(3)

Building the DAG

404

(6)

Cleaning up

410

(2)

18 Airflow in GCP

412

(24)

18.1 Deploying Airflow in GCP

413

(9)

Picking services

413

(2)

Deploying on GKE with Helm

415

(2)

Integrating with Google services

417

(2)

Designing the network

419

(1)

Scaling with the CeleryExecutor

419

(3)

18.2 GCP-specific hooks and operators

422

(5)

18.3 Use case: Serverless movie ranking on GCP

427

(9)

Uploading to GCS

428

(1)

Getting data into BigQuery

429

(3)

Extracting top ratings

432

(4)

Appendix A Running code samples

436

(3)

Appendix B Package structures Airflow 1 and 2

439

(4)

Appendix C Prometheus metric mapping

443

(2)

Index

445

Bas Harenslak and Julian de Ruiter are data engineers with extensive experience using Airflow to develop pipelines for major companies including Heineken, Unilever, and Booking.com. Bas is a committer, and both Bas and Julian are active contributors to Apache Airflow.

Data Pipelines with Apache Airflow [Pehme köide]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv