Preface |
|
xv | |
Acknowledgments |
|
xvii | |
About this book |
|
xix | |
About the authors |
|
xxiii | |
About the cover illustration |
|
xxiv | |
|
|
1 | (112) |
|
|
3 | (10) |
|
1.1 Introducing data pipelines |
|
|
4 | (6) |
|
|
4 | (2) |
|
Executing a pipeline graph |
|
|
6 | (1) |
|
Pipeline graphs vs. sequential scripts |
|
|
6 | (3) |
|
Running pipeline using workflow managers |
|
|
9 | (1) |
|
|
10 | (3) |
|
Defining pipelines flexibly in (Python) code 10* Scheduling and executing pipelines |
|
|
11 | (2) |
|
2 Monitoring and handling failures |
|
|
13 | (27) |
|
Incremental loading and backfilling |
|
|
15 | (2) |
|
|
17 | (1) |
|
Reasons to choose Airflow |
|
|
17 | (1) |
|
Reasons not to choose Airflow |
|
|
17 | (1) |
|
1.4 The rest of this book |
|
|
18 | (3) |
|
Anatomy of an Airflow DAG |
|
|
20 | (1) |
|
2.1 Collecting data from numerous sources |
|
|
21 | (1) |
|
|
21 | (1) |
|
2.2 Writing your first Airflow DAG |
|
|
22 | (7) |
|
|
26 | (1) |
|
Running arbitrary Python code |
|
|
27 | (2) |
|
2.3 Running a DAG in Airflow |
|
|
29 | (4) |
|
Running Airflow in a Python environment |
|
|
29 | (1) |
|
Running Airflow in Docker containers |
|
|
30 | (1) |
|
Inspecting the Airflow UI |
|
|
31 | (2) |
|
2.4 Running at regular intervals |
|
|
33 | (3) |
|
2.5 Handling failing tasks |
|
|
36 | (4) |
|
|
40 | (20) |
|
3.1 An example: Processing user events |
|
|
41 | (1) |
|
3.2 Running at regular intervals |
|
|
42 | (4) |
|
Defining scheduling intervals |
|
|
42 | (2) |
|
|
44 | (2) |
|
Frequency-based intervals |
|
|
46 | (1) |
|
3.3 Processing data incrementally |
|
|
46 | (6) |
|
Fetching events incrementally |
|
|
46 | (2) |
|
Dynamic time references using execution dates |
|
|
48 | (2) |
|
|
50 | (2) |
|
3.4 Understanding Airflow's execution dates |
|
|
52 | (2) |
|
Executing work in fixed-length intervals |
|
|
52 | (2) |
|
3.5 Using backfilling to fill in past gaps |
|
|
54 | (1) |
|
Executing work back in time |
|
|
54 | (1) |
|
3.6 Best practices for designing tasks |
|
|
55 | (5) |
|
|
55 | (2) |
|
|
57 | (3) |
|
4 Templating tasks using the Airflow context |
|
|
60 | (25) |
|
4.1 Inspecting data for processing with Airflow |
|
|
61 | (2) |
|
Determining how to load incremental data |
|
|
61 | (2) |
|
4.2 Task context and Jinja templating |
|
|
63 | (14) |
|
Templating operator arguments |
|
|
64 | (2) |
|
What is available for templating? |
|
|
66 | (2) |
|
Templating the PythonOperator |
|
|
68 | (5) |
|
Providing variables to the PythonOperator |
|
|
73 | (2) |
|
Inspecting templated arguments |
|
|
75 | (2) |
|
4.3 Hooking up other systems |
|
|
77 | (8) |
|
5 Defining dependencies between tasks |
|
|
85 | (28) |
|
|
86 | (4) |
|
|
86 | (1) |
|
|
87 | (3) |
|
|
90 | (7) |
|
|
90 | (2) |
|
|
92 | (5) |
|
|
97 | (3) |
|
|
97 | (1) |
|
|
98 | (2) |
|
|
100 | (1) |
|
5.4 More about trigger rules |
|
|
100 | (4) |
|
|
101 | (1) |
|
|
102 | (1) |
|
|
103 | (1) |
|
5.5 Sharing data between tasks |
|
|
104 | (4) |
|
|
104 | (3) |
|
|
107 | (1) |
|
Using custom XCom backends |
|
|
108 | (1) |
|
5.6 Chaining Python tasks with the Taskflow API |
|
|
108 | (5) |
|
Simplifying Python tasks with the Taskflow API |
|
|
109 | (2) |
|
When (not) to use the Taskfloiv API |
|
|
111 | (2) |
|
|
113 | (140) |
|
|
115 | (20) |
|
6.1 Polling conditions with sensors |
|
|
116 | (6) |
|
Polling custom conditions |
|
|
119 | (1) |
|
Sensors outside the happy flow |
|
|
120 | (2) |
|
6.2 Triggering other DAGs |
|
|
122 | (9) |
|
Backfilling with the TriggerDagRunOperator |
|
|
126 | (1) |
|
Polling the stale of other DA Gs |
|
|
127 | (4) |
|
6.3 Starting workflows with REST/CLI |
|
|
131 | (4) |
|
7 Communicating with external systems |
|
|
135 | (22) |
|
7.1 Connecting to cloud services |
|
|
136 | (14) |
|
Installing extra dependencies |
|
|
137 | (1) |
|
Developing a machine learning model |
|
|
137 | (6) |
|
Developing locally with external systems |
|
|
143 | (7) |
|
7.2 Moving data from between systems |
|
|
150 | (7) |
|
Implementing a PostgresToS3Operator |
|
|
151 | (4) |
|
Outsourcing the heavy work |
|
|
155 | (2) |
|
8 Building custom components |
|
|
157 | (29) |
|
8.1 Starting with a PythonOperator |
|
|
158 | (8) |
|
Simulating a movie rating API |
|
|
158 | (3) |
|
Fetching ratings from the API |
|
|
161 | (3) |
|
|
164 | (2) |
|
8.2 Building a custom hook |
|
|
166 | (7) |
|
|
166 | (6) |
|
Building our DAG with the MovielensHook |
|
|
172 | (1) |
|
8.3 Building a custom operator |
|
|
173 | (5) |
|
Defining a custom operator |
|
|
174 | (1) |
|
Building an operator for fetching ratings |
|
|
175 | (3) |
|
8.4 Building custom sensors |
|
|
178 | (3) |
|
8.5 Packaging your components |
|
|
181 | (5) |
|
Bootstrapping a Python package |
|
|
182 | (2) |
|
|
184 | (2) |
|
|
186 | (34) |
|
9.1 Getting started with testing |
|
|
187 | (16) |
|
Integrity testing all DAGs |
|
|
187 | (6) |
|
Setting up a CI/CD pipeline |
|
|
193 | (2) |
|
|
195 | (1) |
|
|
196 | (5) |
|
Testing with files on disk |
|
|
201 | (2) |
|
9.2 Working with DAGs and task context in tests |
|
|
203 | (12) |
|
Working with external systems |
|
|
208 | (7) |
|
9.3 Using tests for development |
|
|
215 | (3) |
|
|
217 | (1) |
|
9.4 Emulate production environments with Whirl |
|
|
218 | (1) |
|
9.5 Create DTAP environments |
|
|
219 | (1) |
|
10 Running tasks in containers |
|
|
220 | (33) |
|
10.1 Challenges of many different operators |
|
|
221 | (2) |
|
Operator interfaces and implementations |
|
|
221 | (1) |
|
Complex and conflicting dependencies |
|
|
222 | (1) |
|
Moving toward a generic operator |
|
|
223 | (1) |
|
10.2 Introducing containers |
|
|
223 | (7) |
|
|
223 | (1) |
|
Running our first Docker container |
|
|
224 | (1) |
|
|
225 | (2) |
|
Persisting data using volumes |
|
|
227 | (3) |
|
10.3 Containers and Airflow |
|
|
230 | (2) |
|
|
230 | (1) |
|
|
231 | (1) |
|
10.4 Running tasks in Docker |
|
|
232 | (8) |
|
Introducing the DockerOperator |
|
|
232 | (1) |
|
Creating container images for tasks |
|
|
233 | (3) |
|
Building a DAG with Docker tasks |
|
|
236 | (3) |
|
|
239 | (1) |
|
10.5 Running tasks in Kubernetes |
|
|
240 | (13) |
|
|
240 | (2) |
|
|
242 | (3) |
|
Using the KubernetesPodOperator |
|
|
245 | (3) |
|
Diagnosing Kubernetes-related issues |
|
|
248 | (2) |
|
Differences with Docker-based workflows |
|
|
250 | (3) |
|
Part 3 Airflow in practice |
|
|
253 | (112) |
|
|
255 | (26) |
|
|
256 | (14) |
|
|
256 | (4) |
|
Manage credentials centrally |
|
|
260 | (1) |
|
Specify configuration details consistently |
|
|
261 | (2) |
|
Avoid doing any computation in your DAG definition |
|
|
263 | (2) |
|
Use factories to generate common patterns |
|
|
265 | (4) |
|
Group related tasks using task groups |
|
|
269 | (1) |
|
Create new DAGs for big changes |
|
|
270 | (1) |
|
11.2 Designing reproducible tasks |
|
|
270 | (2) |
|
Always require tasks to be idempotent |
|
|
271 | (1) |
|
Task results should be deterministic |
|
|
271 | (1) |
|
Design tasks using functional paradigms |
|
|
272 | (1) |
|
11.3 Handling data efficiently |
|
|
272 | (4) |
|
Limit the amount of data being processed |
|
|
272 | (2) |
|
Incremental loading/processing |
|
|
274 | (1) |
|
|
275 | (1) |
|
Don't store data on local file systems |
|
|
275 | (1) |
|
Offload work to external/source systems |
|
|
276 | (1) |
|
11.4 Managing your resources |
|
|
276 | (5) |
|
Managing concurrency using pools |
|
|
276 | (2) |
|
Detecting long-running tasks using SLAs and alerts |
|
|
278 | (3) |
|
12 Operating Airflow in production |
|
|
281 | (41) |
|
12.1 Airflow architectures |
|
|
282 | (8) |
|
Which executor is right for me? |
|
|
284 | (1) |
|
Configuring a metastore for Airflow |
|
|
284 | (2) |
|
A closer look at the scheduler |
|
|
286 | (4) |
|
12.2 Installing each executor |
|
|
290 | (12) |
|
Setting up the SequentialExecutor |
|
|
291 | (1) |
|
Setting up the LocalExecutor |
|
|
292 | (1) |
|
Setting up the CeleryExecutor |
|
|
293 | (3) |
|
Setting up the KubernetesExecutor |
|
|
296 | (6) |
|
12.3 Capturing logs of all Airflow processes |
|
|
302 | (3) |
|
Capturing the webserver output |
|
|
303 | (1) |
|
Capturing the scheduler output |
|
|
303 | (1) |
|
|
304 | (1) |
|
Sending logs to remote storage |
|
|
305 | (1) |
|
12.4 Visualizing and monitoring Airflow metrics |
|
|
305 | (9) |
|
Collecting metrics from Airflow |
|
|
306 | (1) |
|
Configuring Airflow to send metrics |
|
|
307 | (1) |
|
Configuring Prometheus to collect metrics |
|
|
308 | (2) |
|
Creating dashboards with Grafana |
|
|
310 | (2) |
|
|
312 | (2) |
|
12.5 How to get notified of a failing task |
|
|
314 | (8) |
|
Alerting within DAGs and operators |
|
|
314 | (2) |
|
Defining service-level agreements |
|
|
316 | (2) |
|
Scalability and performance |
|
|
318 | (1) |
|
Controlling the maximum number of running tasks |
|
|
318 | (1) |
|
System performance configurations |
|
|
319 | (1) |
|
Running multiple schedulers |
|
|
320 | (2) |
|
|
322 | (22) |
|
13.1 Securing the Airflow web interface |
|
|
323 | (4) |
|
Adding users to the RBAC interface |
|
|
324 | (3) |
|
Configuring the RBAC interface |
|
|
327 | (1) |
|
13.2 Encrypting data at rest |
|
|
327 | (3) |
|
|
328 | (2) |
|
13.3 Connecting with an LDAP service |
|
|
330 | (3) |
|
|
330 | (3) |
|
Fetching users from an LDAP service |
|
|
333 | (1) |
|
13.4 Encrypting traffic to the webserver |
|
|
333 | (6) |
|
|
334 | (2) |
|
Configuring a certificate for HTTPS |
|
|
336 | (3) |
|
13.5 Fetching credentials from secret management systems |
|
|
339 | (5) |
|
14 Project: Finding the fastest way to get around NYC |
|
|
344 | (21) |
|
14.1 Understanding the data |
|
|
347 | (3) |
|
|
348 | (1) |
|
|
348 | (2) |
|
Deciding on a plan of approach |
|
|
350 | (1) |
|
|
350 | (6) |
|
Downloading Citi Bike data |
|
|
351 | (2) |
|
Downloading Yellow Cab data |
|
|
353 | (3) |
|
14.3 Applying similar transformations to data |
|
|
356 | (4) |
|
14.4 Structuring a data pipeline |
|
|
360 | (1) |
|
14.5 Developing idempotent data pipelines |
|
|
361 | (4) |
|
|
365 | (71) |
|
|
367 | (8) |
|
15.1 Designing (cloud) deployment strategies |
|
|
368 | (1) |
|
15.2 Cloud-specific operators and hooks |
|
|
369 | (1) |
|
|
370 | (2) |
|
|
371 | (1) |
|
|
371 | (1) |
|
Amazon Managed Workflows for Apache Airflow |
|
|
372 | (1) |
|
15.4 Choosing a deployment strategy |
|
|
372 | (3) |
|
|
375 | (19) |
|
16.1 Deploying Airflow in AWS |
|
|
375 | (6) |
|
|
376 | (1) |
|
|
377 | (1) |
|
|
378 | (1) |
|
Scaling with the CeleryExecutor |
|
|
378 | (2) |
|
|
380 | (1) |
|
16.2 AWS-specific hooks and operators |
|
|
381 | (2) |
|
16.3 Use case: Serverless movie ranking with AWS Athena |
|
|
383 | (11) |
|
|
383 | (1) |
|
|
384 | (3) |
|
|
387 | (6) |
|
|
393 | (1) |
|
|
394 | (18) |
|
17.1 Deploying Airflow in Azure |
|
|
394 | (4) |
|
|
395 | (1) |
|
|
395 | (2) |
|
Scaling with the CeleryExecutor |
|
|
397 | (1) |
|
|
398 | (1) |
|
17.2 Azure-specific hooks/operators |
|
|
398 | (2) |
|
17.3 Example: Serverless movie ranking with Azure Synapse |
|
|
400 | (12) |
|
|
400 | (1) |
|
|
401 | (3) |
|
|
404 | (6) |
|
|
410 | (2) |
|
|
412 | (24) |
|
18.1 Deploying Airflow in GCP |
|
|
413 | (9) |
|
|
413 | (2) |
|
Deploying on GKE with Helm |
|
|
415 | (2) |
|
Integrating with Google services |
|
|
417 | (2) |
|
|
419 | (1) |
|
Scaling with the CeleryExecutor |
|
|
419 | (3) |
|
18.2 GCP-specific hooks and operators |
|
|
422 | (5) |
|
18.3 Use case: Serverless movie ranking on GCP |
|
|
427 | (9) |
|
|
428 | (1) |
|
Getting data into BigQuery |
|
|
429 | (3) |
|
|
432 | (4) |
Appendix A Running code samples |
|
436 | (3) |
Appendix B Package structures Airflow 1 and 2 |
|
439 | (4) |
Appendix C Prometheus metric mapping |
|
443 | (2) |
Index |
|
445 | |