Muutke küpsiste eelistusi

Fundamentals of Data Engineering: Plan and Build Robust Data Systems [Pehme köide]

4.19/5 (1509 hinnangut Goodreads-ist)
Contributions by ,
  • Formaat: Paperback / softback, 446 pages
  • Ilmumisaeg: 05-Jul-2022
  • Kirjastus: O'Reilly Media
  • ISBN-10: 1098108302
  • ISBN-13: 9781098108304
Teised raamatud teemal:
  • Pehme köide
  • Hind: 75,81 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Tavahind: 89,19 €
  • Säästad 15%
  • Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
  • Kogus:
  • Lisa ostukorvi
  • Tasuta tarne
  • Tellimisaeg 2-4 nädalat
  • Lisa soovinimekirja
  • Formaat: Paperback / softback, 446 pages
  • Ilmumisaeg: 05-Jul-2022
  • Kirjastus: O'Reilly Media
  • ISBN-10: 1098108302
  • ISBN-13: 9781098108304
Teised raamatud teemal:
Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and analysts looking for a comprehensive view of this practice. With this practical book, you will learn how to plan and build systems to serve the needs of your organization and customers by evaluating the best technologies available in the framework of the data engineering lifecycle.

Authors Joe Reis and Matt Housley walk you through the data engineering lifecycle and show you how to stitch together a variety of cloud technologies to serve the needs of downstream data consumers. You will understand how to apply the concepts of data generation, ingestion, orchestration, transformation, storage, governance, and deployment that are critical in any data environment regardless of the underlying technology.

This book will help you:





Assess data engineering problems using an end-to-end data framework of best practices Cut through marketing hype when choosing data technologies, architecture, and processes Use the data engineering lifecycle to design and build a robust architecture

Incorporate data governance and security across the data engineering lifecycle
Preface xiii
Part I Foundation and Building Blocks
1 Data Engineering Described
3(32)
What Is Data Engineering?
3(1)
Data Engineering Defined
4(1)
The Data Engineering Lifecycle
5(1)
Evolution of the Data Engineer
6(5)
Data Engineering and Data Science
11(2)
Data Engineering Skills and Activities
13(1)
Data Maturity and the Data Engineer
13(4)
The Background and Skills of a Data Engineer
17(1)
Business Responsibilities
18(1)
Technical Responsibilities
19(2)
The Continuum of Data Engineering Roles, from A to B
21(1)
Data Engineers Inside an Organization
22(1)
Internal-Facing Versus External-Facing Data Engineers
23(1)
Data Engineers and Other Technical Roles
24(4)
Data Engineers and Business Leadership
28(3)
Conclusion
31(1)
Additional Resources
32(3)
2 The Data Engineering Lifecycle
35(38)
What Is the Data Engineering Lifecycle?
35(1)
The Data Lifecycle Versus the Data Engineering Lifecycle
36(1)
Generation: Source Systems
37(3)
Storage
40(1)
Ingestion
41(4)
Transformation
45(1)
Serving Data
46(4)
Major Undercurrents Across the Data Engineering Lifecycle
50(1)
Security
51(1)
Data Management
52(9)
DataOps
61(5)
Data Architecture
66(1)
Orchestration
66(2)
Software Engineering
68(2)
Conclusion
70(1)
Additional Resources
71(2)
3 Designing Good Data Architecture
73(46)
What Is Data Architecture?
73(1)
Enterprise Architecture Defined
74(3)
Data Architecture Defined
77(1)
"Good" Data Architecture
78(1)
Principles of Good Data Architecture
79(1)
Principle 1 Choose Common Components Wisely
80(1)
Principle 2 Plan for Failure
81(1)
Principle 3 Architect for Scalability
82(1)
Principle 4 Architecture Is Leadership
82(1)
Principle 5 Always Be Architecting
83(1)
Principle 6 Build Loosely Coupled Systems
83(2)
Principle 7 Make Reversible Decisions
85(1)
Principle 8 Prioritize Security
86(1)
Principle 9 Embrace FinOps
87(2)
Major Architecture Concepts
89(1)
Domains and Services
89(1)
Distributed Systems, Scalability, and Designing for Failure
90(2)
Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices
92(4)
User Access: Single Versus Multitenant
96(1)
Event-Driven Architecture
97(1)
Brownfield Versus Greenfield Projects
98(2)
Examples and Types of Data Architecture
100(1)
Data Warehouse
100(3)
Data Lake
103(1)
Convergence, Next-Generation Data Lakes, and the Data Platform
104(1)
Modern Data Stack
105(1)
Lambda Architecture
106(1)
Kappa Architecture
107(1)
The Dataflow Model and Unified Batch and Streaming
107(1)
Architecture for IoT
108(3)
Data Mesh
111(1)
Other Data Architecture Examples
112(1)
Who's Involved with Designing a Data Architecture?
113(1)
Conclusion
113(1)
Additional Resources
113(6)
4 Choosing Technologies Across the Data Engineering Lifecycle
119(40)
Team Size and Capabilities
120(1)
Speed to Market
121(1)
Interoperability
121(1)
Cost Optimization and Business Value
122(1)
Total Cost of Ownership
122(1)
Total Opportunity Cost of Ownership
123(1)
FinOps
124(1)
Today Versus the Future: Immutable Versus Transitory Technologies
124(2)
Our Advice
126(1)
Location
127(1)
On Premises
127(1)
Cloud
128(3)
Hybrid Cloud
131(1)
Multicloud
132(1)
Decentralized: Blockchain and the Edge
133(1)
Our Advice
133(1)
Cloud Repatriation Arguments
134(2)
Build Versus Buy
136(1)
Open Source Software
137(4)
Proprietary Walled Gardens
141(1)
Our Advice
142(1)
Monolith Versus Modular
143(1)
Monolith
143(1)
Modularity
144(2)
The Distributed Monolith Pattern
146(1)
Our Advice
146(1)
Serverless Versus Servers
147(1)
Serverless
147(1)
Containers
148(1)
How to Evaluate Server Versus Serverless
149(1)
Our Advice
150(1)
Optimization, Performance, and the Benchmark Wars
151(1)
Big Data...for the 1990s
152(1)
Nonsensical Cost Comparisons
152(1)
Asymmetric Optimization
152(1)
Caveat Emptor
153(1)
Undercurrents and Their Impacts on Choosing Technologies
153(1)
Data Management
153(1)
DataOps
153(1)
Data Architecture
154(1)
Orchestration Example: Airflow
154(1)
Software Engineering
155(1)
Conclusion
155(1)
Additional Resources
155(4)
Part II The Data Engineering Lifecycle in Depth
5 Data Generation in Source Systems
159(34)
Sources of Data: How Is Data Created?
160(1)
Source Systems: Main Ideas
160(1)
Files and Unstructured Data
160(1)
APIs
161(1)
Application Databases (OLTP Systems)
161(2)
Online Analytical Processing System
163(1)
Change Data Capture
163(1)
Logs
164(1)
Database Logs
165(1)
CRUD
166(1)
Insert-Only
166(1)
Messages and Streams
167(1)
Types of Time
168(1)
Source System Practical Details
169(1)
Databases
170(8)
APIs
178(2)
Data Sharing
180(1)
Third-Party Data Sources
181(1)
Message Queues and Event-Streaming Platforms
181(4)
Whom You'll Work With
185(2)
Undercurrents and Their Impact on Source Systems
187(1)
Security
187(1)
Data Management
188(1)
DataOps
188(1)
Data Architecture
189(1)
Orchestration
190(1)
Software Engineering
191(1)
Conclusion
191(1)
Additional Resources
192(1)
6 Storage
193(44)
Raw Ingredients of Data Storage
195(1)
Magnetic Disk Drive
195(2)
Solid-State Drive
197(1)
Random Access Memory
198(1)
Networking and CPU
199(1)
Serialization
199(1)
Compression
200(1)
Caching
201(1)
Data Storage Systems
201(1)
Single Machine Versus Distributed Storage
202(1)
Eventual Versus Strong Consistency
202(1)
File Storage
203(3)
Block Storage
206(3)
Object Storage
209(6)
Cache and Memory-Based Storage Systems
215(1)
The Hadoop Distributed File System
215(1)
Streaming Storage
216(1)
Indexes, Partitioning, and Clustering
217(2)
Data Engineering Storage Abstractions
219(1)
The Data Warehouse
219(1)
The Data Lake
220(1)
The Data Lakehouse
220(1)
Data Platforms
221(1)
Stream-to-Batch Storage Architecture
221(1)
Big Ideas and Trends in Storage
222(1)
Data Catalog
222(1)
Data Sharing
223(1)
Schema
223(1)
Separation of Compute from Storage
224(3)
Data Storage Lifecycle and Data Retention
227(3)
Single-Tenant Versus Multitenant Storage
230(1)
Whom You'll Work With
231(1)
Undercurrents
232(1)
Security
232(1)
Data Management
232(1)
DataOps
233(1)
Data Architecture
234(1)
Orchestration
234(1)
Software Engineering
234(1)
Conclusion
234(1)
Additional Resources
235(2)
7 Ingestion
237(38)
What Is Data Ingestion?
238(1)
Key Engineering Considerations for the Ingestion Phase
239(1)
Bounded Versus Unbounded Data
240(1)
Frequency
241(1)
Synchronous Versus Asynchronous Ingestion
242(1)
Serialization and Deserialization
243(1)
Throughput and Scalability
243(1)
Reliability and Durability
244(1)
Payload
245(3)
Push Versus Pull Versus Poll Patterns
248(1)
Batch Ingestion Considerations
248(2)
Snapshot or Differential Extraction
250(1)
File-Based Export and Ingestion
250(1)
ETL Versus ELT
250(1)
Inserts, Updates, and Batch Size
251(1)
Data Migration
251(1)
Message and Stream Ingestion Considerations
252(1)
Schema Evolution
252(1)
Late-Arriving Data
252(1)
Ordering and Multiple Delivery
252(1)
Replay
253(1)
Time to Live
253(1)
Message Size
253(1)
Error Handling and Dead-Letter Queues
253(1)
Consumer Pull and Push
254(1)
Location
254(1)
Ways to Ingest Data
254(1)
Direct Database Connection
255(1)
Change Data Capture
256(2)
APIs
258(1)
Message Queues and Event-Streaming Platforms
259(1)
Managed Data Connectors
260(1)
Moving Data with Object Storage
261(1)
EDI
261(1)
Databases and File Export
261(1)
Practical Issues with Common File Formats
262(1)
Shell
262(1)
SSH
263(1)
SFTP and SCP
263(1)
Webhooks
263(1)
Web Interface
264(1)
Web Scraping
264(1)
Transfer Appliances for Data Migration
265(1)
Data Sharing
266(1)
Whom You'll Work With
266(1)
Upstream Stakeholders
266(1)
Downstream Stakeholders
267(1)
Undercurrents
267(1)
Security
268(1)
Data Management
268(2)
DataOps
270(2)
Orchestration
272(1)
Software Engineering
272(1)
Conclusion
272(1)
Additional Resources
273(2)
8 Queries, Modeling, and Transformation
275(66)
Queries
276(1)
What Is a Query?
277(1)
The Life of a Query
278(1)
The Query Optimizer
279(1)
Improving Query Performance
279(6)
Queries on Streaming Data
285(6)
Data Modeling
291(1)
What Is a Data Model?
292(1)
Conceptual, Logical, and Physical Data Models
293(1)
Normalization
294(4)
Techniques for Modeling Batch Analytical Data
298(13)
Modeling Streaming Data
311(2)
Transformations
313(1)
Batch Transformations
314(13)
Materialized Views, Federation, and Query Virtualization
327(3)
Streaming Transformations and Processing
330(3)
Whom You'll Work With
333(1)
Upstream Stakeholders
333(1)
Downstream Stakeholders
334(1)
Undercurrents
334(1)
Security
334(1)
Data Management
335(1)
DataOps
336(1)
Data Architecture
337(1)
Orchestration
337(1)
Software Engineering
337(1)
Conclusion
338(1)
Additional Resources
339(2)
9 Serving Data for Analytics, Machine Learning, and Reverse ETL
341(32)
General Considerations for Serving Data
342(1)
Trust
342(1)
What's the Use Case, and Who's the User?
343(1)
Data Products
344(1)
Self-Service or Not?
345(1)
Data Definitions and Logic
346(1)
Data Mesh
347(1)
Analytics
348(1)
Business Analytics
348(2)
Operational Analytics
350(2)
Embedded Analytics
352(1)
Machine Learning
353(1)
What a Data Engineer Should Know About ML
354(1)
Ways to Serve Data for Analytics and ML
355(1)
File Exchange
355(1)
Databases
356(2)
Streaming Systems
358(1)
Query Federation
358(1)
Data Sharing
359(1)
Semantic and Metrics Layers
359(1)
Serving Data in Notebooks
360(2)
Reverse ETL
362(2)
Whom You'll Work With
364(1)
Undercurrents
364(1)
Security
365(1)
Data Management
366(1)
DataOps
366(1)
Data Architecture
367(1)
Orchestration
367(1)
Software Engineering
368(1)
Conclusion
369(1)
Additional Resources
369(4)
Part III Security, Privacy, and the Future of Data Engineering
10 Security and Privacy
373(10)
People
374(1)
The Power of Negative Thinking
374(1)
Always Be Paranoid
374(1)
Processes
375(1)
Security Theater Versus Security Habit
375(1)
Active Security
375(1)
The Principle of Least Privilege
376(1)
Shared Responsibility in the Cloud
376(1)
Always Back Up Your Data
376(1)
An Example Security Policy
377(1)
Technology
378(1)
Patch and Update Systems
378(1)
Encryption
379(1)
Logging, Monitoring, and Alerting
379(1)
Network Access
380(1)
Security for Low-Level Data Engineering
381(1)
Conclusion
382(1)
Additional Resources
382(1)
11 The Future of Data Engineering
383(12)
The Data Engineering Lifecycle Isn't Going Away
384(1)
The Decline of Complexity and the Rise of Easy-to-Use Data Tools
384(1)
The Cloud-Scale Data OS and Improved Interoperability
385(2)
"Enterprisey" Data Engineering
387(1)
Titles and Responsibilities Will Morph...
388(1)
Moving Beyond the Modern Data Stack, Toward the Live Data Stack
389(1)
The Live Data Stack
389(1)
Streaming Pipelines and Real-Time Analytical Databases
390(1)
The Fusion of Data with Applications
391(1)
The Tight Feedback Between Applications and ML
392(1)
Dark Matter Data and the Rise of...Spreadsheets?!
392(1)
Conclusion
393(2)
A Serialization and Compression Technical Details 395(8)
B Cloud Networking 403(4)
Index 407
Joe Reis is a business-minded data nerd who's worked in the data industry for 20 years, with responsibilities ranging from statistical modeling, forecasting, machine learning, data engineering, data architecture, and almost everything else in between. Joe is the CEO and cofounder of Ternary Data, a data engineering and architecture consulting firm based in Salt Lake City, Utah. In addition, he volunteers with several technology groups and teaches at the University of Utah. In his spare time, Joe likes to rock climb, produce electronic music, and take his kids on crazy adventures. Matt Housley is a data engineering consultant and cloud specialist. After some early programming experience with Logo, Basic, and 6502 assembly, he completed a PhD in mathematics at the University of Utah. Matt then began working in data science, eventually specializing in cloud-based data engineering. He cofounded Ternary Data with Joe Reis, where he leverages his teaching experience to train future data engineers and advise teams on robust data architecture. Matt and Joe also pontificate on all things data on The Monday Morning Data Chat.