Muutke küpsiste eelistusi

Data Pipelines with Apache Airflow [Pehme köide]

  • Formaat: Paperback / softback, 425 pages, kõrgus x laius x paksus: 234x186x28 mm, kaal: 860 g
  • Ilmumisaeg: 07-Jun-2021
  • Kirjastus: Manning Publications
  • ISBN-10: 1617296902
  • ISBN-13: 9781617296901
  • Formaat: Paperback / softback, 425 pages, kõrgus x laius x paksus: 234x186x28 mm, kaal: 860 g
  • Ilmumisaeg: 07-Jun-2021
  • Kirjastus: Manning Publications
  • ISBN-10: 1617296902
  • ISBN-13: 9781617296901
Pipelines can be challenging to manage, especially when your data has to flow through a collection of application components, servers, and cloud services. Airflow lets you schedule, restart, and backfill pipelines, and its easy-to-use UI and workflows with Python scripting has users praising its incredible flexibility. Data Pipelines with Apache Airflow takes you through best practices for creating pipelines for multiple tasks, including data lakes, cloud deployments, and data science.

 

Data Pipelines with Apache Airflow teaches you the ins-and-outs of the Directed Acyclic Graphs (DAGs) that power Airflow, and how to write your own DAGs to meet the needs of your projects. With complete coverage of both foundational and lesser-known features, when youre done youll be set to start using Airflow for seamless data pipeline development and management.

 

Key Features

Framework foundation and best practices

Airflow's execution and dependency system

Testing Airflow DAGs

Running Airflow in production

 

For data-savvy developers, DevOps and data engineers, and system

administrators with intermediate Python skills.

 

About the technology

Data pipelines are used to extract, transform and load data to and from multiple sources, routing it wherever its needed -- whether thats visualisation tools, business intelligence dashboards, or machine learning models. Airflow streamlines the whole process, giving you one tool for programmatically developing and monitoring batch data pipelines, and integrating all the pieces you use in your data stack.

 

Bas Harenslak and Julian de Ruiter are data engineers with extensive experience using Airflow to develop pipelines for major companies including Heineken, Unilever, and Booking.com. Bas is a committer, and both Bas and Julian are active contributors to Apache Airflow.
Preface xv
Acknowledgments xvii
About this book xix
About the authors xxiii
About the cover illustration xxiv
PART 1 GETTING STARTED
1(112)
1 Meet Apache Airflow
3(10)
1.1 Introducing data pipelines
4(6)
Data pipelines as graphs
4(2)
Executing a pipeline graph
6(1)
Pipeline graphs vs. sequential scripts
6(3)
Running pipeline using workflow managers
9(1)
1.2 Introducing Airflow
10(3)
Defining pipelines flexibly in (Python) code 10* Scheduling and executing pipelines
11(2)
2 Monitoring and handling failures
13(27)
Incremental loading and backfilling
15(2)
1.3 When to use Airflow
17(1)
Reasons to choose Airflow
17(1)
Reasons not to choose Airflow
17(1)
1.4 The rest of this book
18(3)
Anatomy of an Airflow DAG
20(1)
2.1 Collecting data from numerous sources
21(1)
Exploring the data
21(1)
2.2 Writing your first Airflow DAG
22(7)
Tasks vs. Operators
26(1)
Running arbitrary Python code
27(2)
2.3 Running a DAG in Airflow
29(4)
Running Airflow in a Python environment
29(1)
Running Airflow in Docker containers
30(1)
Inspecting the Airflow UI
31(2)
2.4 Running at regular intervals
33(3)
2.5 Handling failing tasks
36(4)
3 Scheduling in Airflow
40(20)
3.1 An example: Processing user events
41(1)
3.2 Running at regular intervals
42(4)
Defining scheduling intervals
42(2)
Cron-based intervals
44(2)
Frequency-based intervals
46(1)
3.3 Processing data incrementally
46(6)
Fetching events incrementally
46(2)
Dynamic time references using execution dates
48(2)
Partitioning your data
50(2)
3.4 Understanding Airflow's execution dates
52(2)
Executing work in fixed-length intervals
52(2)
3.5 Using backfilling to fill in past gaps
54(1)
Executing work back in time
54(1)
3.6 Best practices for designing tasks
55(5)
Atomicity
55(2)
Idempotency
57(3)
4 Templating tasks using the Airflow context
60(25)
4.1 Inspecting data for processing with Airflow
61(2)
Determining how to load incremental data
61(2)
4.2 Task context and Jinja templating
63(14)
Templating operator arguments
64(2)
What is available for templating?
66(2)
Templating the PythonOperator
68(5)
Providing variables to the PythonOperator
73(2)
Inspecting templated arguments
75(2)
4.3 Hooking up other systems
77(8)
5 Defining dependencies between tasks
85(28)
5.1 Basic dependencies
86(4)
Linear dependencies
86(1)
Fan-in/-out dependencies
87(3)
5.2 Branching
90(7)
Branching within tasks
90(2)
Branching within the DAG
92(5)
5.3 Conditional tasks
97(3)
Conditions within tasks
97(1)
Making tasks conditional
98(2)
Using built-in operators
100(1)
5.4 More about trigger rules
100(4)
What is a trigger rule?
101(1)
The effect of failures
102(1)
Other trigger rules
103(1)
5.5 Sharing data between tasks
104(4)
Sharing data using XComs
104(3)
When (not) to use XComs
107(1)
Using custom XCom backends
108(1)
5.6 Chaining Python tasks with the Taskflow API
108(5)
Simplifying Python tasks with the Taskflow API
109(2)
When (not) to use the Taskfloiv API
111(2)
PART 2 Beyond The Basics
113(140)
6 Triggering workflows
115(20)
6.1 Polling conditions with sensors
116(6)
Polling custom conditions
119(1)
Sensors outside the happy flow
120(2)
6.2 Triggering other DAGs
122(9)
Backfilling with the TriggerDagRunOperator
126(1)
Polling the stale of other DA Gs
127(4)
6.3 Starting workflows with REST/CLI
131(4)
7 Communicating with external systems
135(22)
7.1 Connecting to cloud services
136(14)
Installing extra dependencies
137(1)
Developing a machine learning model
137(6)
Developing locally with external systems
143(7)
7.2 Moving data from between systems
150(7)
Implementing a PostgresToS3Operator
151(4)
Outsourcing the heavy work
155(2)
8 Building custom components
157(29)
8.1 Starting with a PythonOperator
158(8)
Simulating a movie rating API
158(3)
Fetching ratings from the API
161(3)
Building the actual DAG
164(2)
8.2 Building a custom hook
166(7)
Designing a custom hook
166(6)
Building our DAG with the MovielensHook
172(1)
8.3 Building a custom operator
173(5)
Defining a custom operator
174(1)
Building an operator for fetching ratings
175(3)
8.4 Building custom sensors
178(3)
8.5 Packaging your components
181(5)
Bootstrapping a Python package
182(2)
Installing your package
184(2)
9 Testing
186(34)
9.1 Getting started with testing
187(16)
Integrity testing all DAGs
187(6)
Setting up a CI/CD pipeline
193(2)
Writing unit tests
195(1)
Pytest project structure
196(5)
Testing with files on disk
201(2)
9.2 Working with DAGs and task context in tests
203(12)
Working with external systems
208(7)
9.3 Using tests for development
215(3)
Testing complete DAGs
217(1)
9.4 Emulate production environments with Whirl
218(1)
9.5 Create DTAP environments
219(1)
10 Running tasks in containers
220(33)
10.1 Challenges of many different operators
221(2)
Operator interfaces and implementations
221(1)
Complex and conflicting dependencies
222(1)
Moving toward a generic operator
223(1)
10.2 Introducing containers
223(7)
What are containers'?
223(1)
Running our first Docker container
224(1)
Creating a Docker image
225(2)
Persisting data using volumes
227(3)
10.3 Containers and Airflow
230(2)
Tasks in containers
230(1)
Why use containers?
231(1)
10.4 Running tasks in Docker
232(8)
Introducing the DockerOperator
232(1)
Creating container images for tasks
233(3)
Building a DAG with Docker tasks
236(3)
Docker-based workflow
239(1)
10.5 Running tasks in Kubernetes
240(13)
Introducing Kuberinetes
240(2)
Setting up Kubernetes
242(3)
Using the KubernetesPodOperator
245(3)
Diagnosing Kubernetes-related issues
248(2)
Differences with Docker-based workflows
250(3)
Part 3 Airflow in practice
253(112)
11 Best practices
255(26)
11.1 Writing clean DAGs
256(14)
Use style conventions
256(4)
Manage credentials centrally
260(1)
Specify configuration details consistently
261(2)
Avoid doing any computation in your DAG definition
263(2)
Use factories to generate common patterns
265(4)
Group related tasks using task groups
269(1)
Create new DAGs for big changes
270(1)
11.2 Designing reproducible tasks
270(2)
Always require tasks to be idempotent
271(1)
Task results should be deterministic
271(1)
Design tasks using functional paradigms
272(1)
11.3 Handling data efficiently
272(4)
Limit the amount of data being processed
272(2)
Incremental loading/processing
274(1)
Cache intermediate data
275(1)
Don't store data on local file systems
275(1)
Offload work to external/source systems
276(1)
11.4 Managing your resources
276(5)
Managing concurrency using pools
276(2)
Detecting long-running tasks using SLAs and alerts
278(3)
12 Operating Airflow in production
281(41)
12.1 Airflow architectures
282(8)
Which executor is right for me?
284(1)
Configuring a metastore for Airflow
284(2)
A closer look at the scheduler
286(4)
12.2 Installing each executor
290(12)
Setting up the SequentialExecutor
291(1)
Setting up the LocalExecutor
292(1)
Setting up the CeleryExecutor
293(3)
Setting up the KubernetesExecutor
296(6)
12.3 Capturing logs of all Airflow processes
302(3)
Capturing the webserver output
303(1)
Capturing the scheduler output
303(1)
Capturing task logs
304(1)
Sending logs to remote storage
305(1)
12.4 Visualizing and monitoring Airflow metrics
305(9)
Collecting metrics from Airflow
306(1)
Configuring Airflow to send metrics
307(1)
Configuring Prometheus to collect metrics
308(2)
Creating dashboards with Grafana
310(2)
What should you monitor?
312(2)
12.5 How to get notified of a failing task
314(8)
Alerting within DAGs and operators
314(2)
Defining service-level agreements
316(2)
Scalability and performance
318(1)
Controlling the maximum number of running tasks
318(1)
System performance configurations
319(1)
Running multiple schedulers
320(2)
13 Securing Airflow
322(22)
13.1 Securing the Airflow web interface
323(4)
Adding users to the RBAC interface
324(3)
Configuring the RBAC interface
327(1)
13.2 Encrypting data at rest
327(3)
Creating a Fernet key
328(2)
13.3 Connecting with an LDAP service
330(3)
Understanding LDAP
330(3)
Fetching users from an LDAP service
333(1)
13.4 Encrypting traffic to the webserver
333(6)
Understanding HTTPS
334(2)
Configuring a certificate for HTTPS
336(3)
13.5 Fetching credentials from secret management systems
339(5)
14 Project: Finding the fastest way to get around NYC
344(21)
14.1 Understanding the data
347(3)
Yellow Cab file share
348(1)
Citi Bike REST API
348(2)
Deciding on a plan of approach
350(1)
14.2 Extracting the data
350(6)
Downloading Citi Bike data
351(2)
Downloading Yellow Cab data
353(3)
14.3 Applying similar transformations to data
356(4)
14.4 Structuring a data pipeline
360(1)
14.5 Developing idempotent data pipelines
361(4)
PART 4 In the Clouds
365(71)
15 Airflow in the clouds
367(8)
15.1 Designing (cloud) deployment strategies
368(1)
15.2 Cloud-specific operators and hooks
369(1)
15.3 Managed services
370(2)
Astronomer.io
371(1)
Google Cloud Composer
371(1)
Amazon Managed Workflows for Apache Airflow
372(1)
15.4 Choosing a deployment strategy
372(3)
16 Airflow on AWS
375(19)
16.1 Deploying Airflow in AWS
375(6)
Picking cloud services
376(1)
Designing the network
377(1)
Adding DAG syncing
378(1)
Scaling with the CeleryExecutor
378(2)
Further steps
380(1)
16.2 AWS-specific hooks and operators
381(2)
16.3 Use case: Serverless movie ranking with AWS Athena
383(11)
Overview
383(1)
Setting up resources
384(3)
Building the DAG
387(6)
Cleaning up
393(1)
17 Airflow on Azure
394(18)
17.1 Deploying Airflow in Azure
394(4)
Picking services
395(1)
Designing the network
395(2)
Scaling with the CeleryExecutor
397(1)
Further steps
398(1)
17.2 Azure-specific hooks/operators
398(2)
17.3 Example: Serverless movie ranking with Azure Synapse
400(12)
Overview
400(1)
Setting up resources
401(3)
Building the DAG
404(6)
Cleaning up
410(2)
18 Airflow in GCP
412(24)
18.1 Deploying Airflow in GCP
413(9)
Picking services
413(2)
Deploying on GKE with Helm
415(2)
Integrating with Google services
417(2)
Designing the network
419(1)
Scaling with the CeleryExecutor
419(3)
18.2 GCP-specific hooks and operators
422(5)
18.3 Use case: Serverless movie ranking on GCP
427(9)
Uploading to GCS
428(1)
Getting data into BigQuery
429(3)
Extracting top ratings
432(4)
Appendix A Running code samples 436(3)
Appendix B Package structures Airflow 1 and 2 439(4)
Appendix C Prometheus metric mapping 443(2)
Index 445
Bas Harenslak and Julian de Ruiter are data engineers with extensive experience using Airflow to develop pipelines for major companies including Heineken, Unilever, and Booking.com. Bas is a committer, and both Bas and Julian are active contributors to Apache Airflow.