Preface |
|
xiii | |
|
Part I Foundation and Building Blocks |
|
|
|
1 Data Engineering Described |
|
|
3 | (32) |
|
What Is Data Engineering? |
|
|
3 | (1) |
|
|
4 | (1) |
|
The Data Engineering Lifecycle |
|
|
5 | (1) |
|
Evolution of the Data Engineer |
|
|
6 | (5) |
|
Data Engineering and Data Science |
|
|
11 | (2) |
|
Data Engineering Skills and Activities |
|
|
13 | (1) |
|
Data Maturity and the Data Engineer |
|
|
13 | (4) |
|
The Background and Skills of a Data Engineer |
|
|
17 | (1) |
|
Business Responsibilities |
|
|
18 | (1) |
|
Technical Responsibilities |
|
|
19 | (2) |
|
The Continuum of Data Engineering Roles, from A to B |
|
|
21 | (1) |
|
Data Engineers Inside an Organization |
|
|
22 | (1) |
|
Internal-Facing Versus External-Facing Data Engineers |
|
|
23 | (1) |
|
Data Engineers and Other Technical Roles |
|
|
24 | (4) |
|
Data Engineers and Business Leadership |
|
|
28 | (3) |
|
|
31 | (1) |
|
|
32 | (3) |
|
2 The Data Engineering Lifecycle |
|
|
35 | (38) |
|
What Is the Data Engineering Lifecycle? |
|
|
35 | (1) |
|
The Data Lifecycle Versus the Data Engineering Lifecycle |
|
|
36 | (1) |
|
Generation: Source Systems |
|
|
37 | (3) |
|
|
40 | (1) |
|
|
41 | (4) |
|
|
45 | (1) |
|
|
46 | (4) |
|
Major Undercurrents Across the Data Engineering Lifecycle |
|
|
50 | (1) |
|
|
51 | (1) |
|
|
52 | (9) |
|
|
61 | (5) |
|
|
66 | (1) |
|
|
66 | (2) |
|
|
68 | (2) |
|
|
70 | (1) |
|
|
71 | (2) |
|
3 Designing Good Data Architecture |
|
|
73 | (46) |
|
What Is Data Architecture? |
|
|
73 | (1) |
|
Enterprise Architecture Defined |
|
|
74 | (3) |
|
Data Architecture Defined |
|
|
77 | (1) |
|
|
78 | (1) |
|
Principles of Good Data Architecture |
|
|
79 | (1) |
|
Principle 1 Choose Common Components Wisely |
|
|
80 | (1) |
|
Principle 2 Plan for Failure |
|
|
81 | (1) |
|
Principle 3 Architect for Scalability |
|
|
82 | (1) |
|
Principle 4 Architecture Is Leadership |
|
|
82 | (1) |
|
Principle 5 Always Be Architecting |
|
|
83 | (1) |
|
Principle 6 Build Loosely Coupled Systems |
|
|
83 | (2) |
|
Principle 7 Make Reversible Decisions |
|
|
85 | (1) |
|
Principle 8 Prioritize Security |
|
|
86 | (1) |
|
Principle 9 Embrace FinOps |
|
|
87 | (2) |
|
Major Architecture Concepts |
|
|
89 | (1) |
|
|
89 | (1) |
|
Distributed Systems, Scalability, and Designing for Failure |
|
|
90 | (2) |
|
Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices |
|
|
92 | (4) |
|
User Access: Single Versus Multitenant |
|
|
96 | (1) |
|
Event-Driven Architecture |
|
|
97 | (1) |
|
Brownfield Versus Greenfield Projects |
|
|
98 | (2) |
|
Examples and Types of Data Architecture |
|
|
100 | (1) |
|
|
100 | (3) |
|
|
103 | (1) |
|
Convergence, Next-Generation Data Lakes, and the Data Platform |
|
|
104 | (1) |
|
|
105 | (1) |
|
|
106 | (1) |
|
|
107 | (1) |
|
The Dataflow Model and Unified Batch and Streaming |
|
|
107 | (1) |
|
|
108 | (3) |
|
|
111 | (1) |
|
Other Data Architecture Examples |
|
|
112 | (1) |
|
Who's Involved with Designing a Data Architecture? |
|
|
113 | (1) |
|
|
113 | (1) |
|
|
113 | (6) |
|
4 Choosing Technologies Across the Data Engineering Lifecycle |
|
|
119 | (40) |
|
Team Size and Capabilities |
|
|
120 | (1) |
|
|
121 | (1) |
|
|
121 | (1) |
|
Cost Optimization and Business Value |
|
|
122 | (1) |
|
|
122 | (1) |
|
Total Opportunity Cost of Ownership |
|
|
123 | (1) |
|
|
124 | (1) |
|
Today Versus the Future: Immutable Versus Transitory Technologies |
|
|
124 | (2) |
|
|
126 | (1) |
|
|
127 | (1) |
|
|
127 | (1) |
|
|
128 | (3) |
|
|
131 | (1) |
|
|
132 | (1) |
|
Decentralized: Blockchain and the Edge |
|
|
133 | (1) |
|
|
133 | (1) |
|
Cloud Repatriation Arguments |
|
|
134 | (2) |
|
|
136 | (1) |
|
|
137 | (4) |
|
Proprietary Walled Gardens |
|
|
141 | (1) |
|
|
142 | (1) |
|
|
143 | (1) |
|
|
143 | (1) |
|
|
144 | (2) |
|
The Distributed Monolith Pattern |
|
|
146 | (1) |
|
|
146 | (1) |
|
Serverless Versus Servers |
|
|
147 | (1) |
|
|
147 | (1) |
|
|
148 | (1) |
|
How to Evaluate Server Versus Serverless |
|
|
149 | (1) |
|
|
150 | (1) |
|
Optimization, Performance, and the Benchmark Wars |
|
|
151 | (1) |
|
|
152 | (1) |
|
Nonsensical Cost Comparisons |
|
|
152 | (1) |
|
|
152 | (1) |
|
|
153 | (1) |
|
Undercurrents and Their Impacts on Choosing Technologies |
|
|
153 | (1) |
|
|
153 | (1) |
|
|
153 | (1) |
|
|
154 | (1) |
|
Orchestration Example: Airflow |
|
|
154 | (1) |
|
|
155 | (1) |
|
|
155 | (1) |
|
|
155 | (4) |
|
Part II The Data Engineering Lifecycle in Depth |
|
|
|
5 Data Generation in Source Systems |
|
|
159 | (34) |
|
Sources of Data: How Is Data Created? |
|
|
160 | (1) |
|
Source Systems: Main Ideas |
|
|
160 | (1) |
|
Files and Unstructured Data |
|
|
160 | (1) |
|
|
161 | (1) |
|
Application Databases (OLTP Systems) |
|
|
161 | (2) |
|
Online Analytical Processing System |
|
|
163 | (1) |
|
|
163 | (1) |
|
|
164 | (1) |
|
|
165 | (1) |
|
|
166 | (1) |
|
|
166 | (1) |
|
|
167 | (1) |
|
|
168 | (1) |
|
Source System Practical Details |
|
|
169 | (1) |
|
|
170 | (8) |
|
|
178 | (2) |
|
|
180 | (1) |
|
|
181 | (1) |
|
Message Queues and Event-Streaming Platforms |
|
|
181 | (4) |
|
|
185 | (2) |
|
Undercurrents and Their Impact on Source Systems |
|
|
187 | (1) |
|
|
187 | (1) |
|
|
188 | (1) |
|
|
188 | (1) |
|
|
189 | (1) |
|
|
190 | (1) |
|
|
191 | (1) |
|
|
191 | (1) |
|
|
192 | (1) |
|
|
193 | (44) |
|
Raw Ingredients of Data Storage |
|
|
195 | (1) |
|
|
195 | (2) |
|
|
197 | (1) |
|
|
198 | (1) |
|
|
199 | (1) |
|
|
199 | (1) |
|
|
200 | (1) |
|
|
201 | (1) |
|
|
201 | (1) |
|
Single Machine Versus Distributed Storage |
|
|
202 | (1) |
|
Eventual Versus Strong Consistency |
|
|
202 | (1) |
|
|
203 | (3) |
|
|
206 | (3) |
|
|
209 | (6) |
|
Cache and Memory-Based Storage Systems |
|
|
215 | (1) |
|
The Hadoop Distributed File System |
|
|
215 | (1) |
|
|
216 | (1) |
|
Indexes, Partitioning, and Clustering |
|
|
217 | (2) |
|
Data Engineering Storage Abstractions |
|
|
219 | (1) |
|
|
219 | (1) |
|
|
220 | (1) |
|
|
220 | (1) |
|
|
221 | (1) |
|
Stream-to-Batch Storage Architecture |
|
|
221 | (1) |
|
Big Ideas and Trends in Storage |
|
|
222 | (1) |
|
|
222 | (1) |
|
|
223 | (1) |
|
|
223 | (1) |
|
Separation of Compute from Storage |
|
|
224 | (3) |
|
Data Storage Lifecycle and Data Retention |
|
|
227 | (3) |
|
Single-Tenant Versus Multitenant Storage |
|
|
230 | (1) |
|
|
231 | (1) |
|
|
232 | (1) |
|
|
232 | (1) |
|
|
232 | (1) |
|
|
233 | (1) |
|
|
234 | (1) |
|
|
234 | (1) |
|
|
234 | (1) |
|
|
234 | (1) |
|
|
235 | (2) |
|
|
237 | (38) |
|
|
238 | (1) |
|
Key Engineering Considerations for the Ingestion Phase |
|
|
239 | (1) |
|
Bounded Versus Unbounded Data |
|
|
240 | (1) |
|
|
241 | (1) |
|
Synchronous Versus Asynchronous Ingestion |
|
|
242 | (1) |
|
Serialization and Deserialization |
|
|
243 | (1) |
|
Throughput and Scalability |
|
|
243 | (1) |
|
Reliability and Durability |
|
|
244 | (1) |
|
|
245 | (3) |
|
Push Versus Pull Versus Poll Patterns |
|
|
248 | (1) |
|
Batch Ingestion Considerations |
|
|
248 | (2) |
|
Snapshot or Differential Extraction |
|
|
250 | (1) |
|
File-Based Export and Ingestion |
|
|
250 | (1) |
|
|
250 | (1) |
|
Inserts, Updates, and Batch Size |
|
|
251 | (1) |
|
|
251 | (1) |
|
Message and Stream Ingestion Considerations |
|
|
252 | (1) |
|
|
252 | (1) |
|
|
252 | (1) |
|
Ordering and Multiple Delivery |
|
|
252 | (1) |
|
|
253 | (1) |
|
|
253 | (1) |
|
|
253 | (1) |
|
Error Handling and Dead-Letter Queues |
|
|
253 | (1) |
|
|
254 | (1) |
|
|
254 | (1) |
|
|
254 | (1) |
|
Direct Database Connection |
|
|
255 | (1) |
|
|
256 | (2) |
|
|
258 | (1) |
|
Message Queues and Event-Streaming Platforms |
|
|
259 | (1) |
|
|
260 | (1) |
|
Moving Data with Object Storage |
|
|
261 | (1) |
|
|
261 | (1) |
|
Databases and File Export |
|
|
261 | (1) |
|
Practical Issues with Common File Formats |
|
|
262 | (1) |
|
|
262 | (1) |
|
|
263 | (1) |
|
|
263 | (1) |
|
|
263 | (1) |
|
|
264 | (1) |
|
|
264 | (1) |
|
Transfer Appliances for Data Migration |
|
|
265 | (1) |
|
|
266 | (1) |
|
|
266 | (1) |
|
|
266 | (1) |
|
|
267 | (1) |
|
|
267 | (1) |
|
|
268 | (1) |
|
|
268 | (2) |
|
|
270 | (2) |
|
|
272 | (1) |
|
|
272 | (1) |
|
|
272 | (1) |
|
|
273 | (2) |
|
8 Queries, Modeling, and Transformation |
|
|
275 | (66) |
|
|
276 | (1) |
|
|
277 | (1) |
|
|
278 | (1) |
|
|
279 | (1) |
|
Improving Query Performance |
|
|
279 | (6) |
|
Queries on Streaming Data |
|
|
285 | (6) |
|
|
291 | (1) |
|
|
292 | (1) |
|
Conceptual, Logical, and Physical Data Models |
|
|
293 | (1) |
|
|
294 | (4) |
|
Techniques for Modeling Batch Analytical Data |
|
|
298 | (13) |
|
|
311 | (2) |
|
|
313 | (1) |
|
|
314 | (13) |
|
Materialized Views, Federation, and Query Virtualization |
|
|
327 | (3) |
|
Streaming Transformations and Processing |
|
|
330 | (3) |
|
|
333 | (1) |
|
|
333 | (1) |
|
|
334 | (1) |
|
|
334 | (1) |
|
|
334 | (1) |
|
|
335 | (1) |
|
|
336 | (1) |
|
|
337 | (1) |
|
|
337 | (1) |
|
|
337 | (1) |
|
|
338 | (1) |
|
|
339 | (2) |
|
9 Serving Data for Analytics, Machine Learning, and Reverse ETL |
|
|
341 | (32) |
|
General Considerations for Serving Data |
|
|
342 | (1) |
|
|
342 | (1) |
|
What's the Use Case, and Who's the User? |
|
|
343 | (1) |
|
|
344 | (1) |
|
|
345 | (1) |
|
Data Definitions and Logic |
|
|
346 | (1) |
|
|
347 | (1) |
|
|
348 | (1) |
|
|
348 | (2) |
|
|
350 | (2) |
|
|
352 | (1) |
|
|
353 | (1) |
|
What a Data Engineer Should Know About ML |
|
|
354 | (1) |
|
Ways to Serve Data for Analytics and ML |
|
|
355 | (1) |
|
|
355 | (1) |
|
|
356 | (2) |
|
|
358 | (1) |
|
|
358 | (1) |
|
|
359 | (1) |
|
Semantic and Metrics Layers |
|
|
359 | (1) |
|
Serving Data in Notebooks |
|
|
360 | (2) |
|
|
362 | (2) |
|
|
364 | (1) |
|
|
364 | (1) |
|
|
365 | (1) |
|
|
366 | (1) |
|
|
366 | (1) |
|
|
367 | (1) |
|
|
367 | (1) |
|
|
368 | (1) |
|
|
369 | (1) |
|
|
369 | (4) |
|
Part III Security, Privacy, and the Future of Data Engineering |
|
|
|
|
373 | (10) |
|
|
374 | (1) |
|
The Power of Negative Thinking |
|
|
374 | (1) |
|
|
374 | (1) |
|
|
375 | (1) |
|
Security Theater Versus Security Habit |
|
|
375 | (1) |
|
|
375 | (1) |
|
The Principle of Least Privilege |
|
|
376 | (1) |
|
Shared Responsibility in the Cloud |
|
|
376 | (1) |
|
|
376 | (1) |
|
An Example Security Policy |
|
|
377 | (1) |
|
|
378 | (1) |
|
|
378 | (1) |
|
|
379 | (1) |
|
Logging, Monitoring, and Alerting |
|
|
379 | (1) |
|
|
380 | (1) |
|
Security for Low-Level Data Engineering |
|
|
381 | (1) |
|
|
382 | (1) |
|
|
382 | (1) |
|
11 The Future of Data Engineering |
|
|
383 | (12) |
|
The Data Engineering Lifecycle Isn't Going Away |
|
|
384 | (1) |
|
The Decline of Complexity and the Rise of Easy-to-Use Data Tools |
|
|
384 | (1) |
|
The Cloud-Scale Data OS and Improved Interoperability |
|
|
385 | (2) |
|
"Enterprisey" Data Engineering |
|
|
387 | (1) |
|
Titles and Responsibilities Will Morph... |
|
|
388 | (1) |
|
Moving Beyond the Modern Data Stack, Toward the Live Data Stack |
|
|
389 | (1) |
|
|
389 | (1) |
|
Streaming Pipelines and Real-Time Analytical Databases |
|
|
390 | (1) |
|
The Fusion of Data with Applications |
|
|
391 | (1) |
|
The Tight Feedback Between Applications and ML |
|
|
392 | (1) |
|
Dark Matter Data and the Rise of...Spreadsheets?! |
|
|
392 | (1) |
|
|
393 | (2) |
A Serialization and Compression Technical Details |
|
395 | (8) |
B Cloud Networking |
|
403 | (4) |
Index |
|
407 | |