preface |
|
xi | |
acknowledgments |
|
xiii | |
About this book |
|
xv | |
About the authors |
|
xviii | |
About the cover illustration |
|
xix | |
|
1 Introducing the data platform |
|
|
1 | (17) |
|
1.1 The trends behind the change from data warehouses to data platforms |
|
|
2 | (1) |
|
1.2 Data warehouses struggle with data variety, volume, and velocity |
|
|
3 | (3) |
|
|
4 | (1) |
|
|
5 | (1) |
|
|
5 | (1) |
|
|
6 | (1) |
|
1.3 Data lakes to the rescue? |
|
|
6 | (1) |
|
|
7 | (2) |
|
1.5 Cloud, data lakes, and data warehouses: The emergence of cloud data platforms |
|
|
9 | (1) |
|
1.6 Building blocks of a cloud data platform |
|
|
10 | (4) |
|
|
10 | (1) |
|
|
11 | (1) |
|
|
12 | (1) |
|
|
13 | (1) |
|
1.7 How the cloud data platform deals with the three V's |
|
|
14 | (2) |
|
|
14 | (1) |
|
|
15 | (1) |
|
|
15 | (1) |
|
|
16 | (1) |
|
|
16 | (2) |
|
2 Why a data platform and not just a data warehouse |
|
|
18 | (19) |
|
2.1 Cloud data platforms and cloud data warehouses: The practical aspects |
|
|
19 | (5) |
|
A closer look at the data sources |
|
|
20 | (2) |
|
An example cloud data warehouse-only architecture |
|
|
22 | (1) |
|
An example cloud data platform architecture |
|
|
23 | (1) |
|
|
24 | (4) |
|
Ingesting data directly into Azure Synapse |
|
|
25 | (1) |
|
Ingesting data into an Azure data platform |
|
|
26 | (1) |
|
Managing changes in upstream data sources |
|
|
26 | (2) |
|
|
28 | (5) |
|
Processing data in the warehouse |
|
|
29 | (2) |
|
Processing data in the data platform |
|
|
31 | (2) |
|
|
33 | (1) |
|
2.5 Cloud cost considerations |
|
|
34 | (2) |
|
|
36 | (1) |
|
3 Getting bigger and leveraging the Big 3: Amazon, Microsoft Azure, and Google |
|
|
37 | (41) |
|
3.1 Cloud data platform layered architecture |
|
|
38 | (21) |
|
|
40 | (4) |
|
|
44 | (2) |
|
|
46 | (1) |
|
|
47 | (2) |
|
The serving layer and data consumers |
|
|
49 | (4) |
|
Orchestration and ETL overlay layers |
|
|
53 | (6) |
|
3.2 The importance of layers in a data platform architecture |
|
|
59 | (1) |
|
3.3 Mapping cloud data platform layers to specific tools |
|
|
60 | (14) |
|
|
62 | (4) |
|
|
66 | (4) |
|
|
70 | (4) |
|
3.4 Open source and commercial alternatives |
|
|
74 | (3) |
|
|
74 | (1) |
|
Streaming data ingestion and real-time analytics |
|
|
75 | (1) |
|
|
75 | (2) |
|
|
77 | (1) |
|
4 Getting data into the platform |
|
|
78 | (49) |
|
4.1 Databases, files, APIs, and streams |
|
|
79 | (4) |
|
|
80 | (1) |
|
|
81 | (1) |
|
|
82 | (1) |
|
|
82 | (1) |
|
4.2 Ingesting data from relational databases |
|
|
83 | (24) |
|
Ingesting data from RDBMSs using a SQL interface |
|
|
84 | (2) |
|
|
86 | (5) |
|
Incremental table ingestion |
|
|
91 | (3) |
|
Change data capture (CDC) |
|
|
94 | (4) |
|
|
98 | (2) |
|
|
100 | (3) |
|
Ingesting data from NoSQL databases |
|
|
103 | (1) |
|
Capturing important metadata for RDBMS or NoSQL ingestion pipelines |
|
|
104 | (3) |
|
4.3 Ingesting data from files |
|
|
107 | (7) |
|
|
109 | (3) |
|
Capturing file ingestion metadata |
|
|
112 | (2) |
|
4.4 Ingesting data from streams |
|
|
114 | (6) |
|
Differences between batch and streaming ingestion |
|
|
117 | (2) |
|
Capturing streaming pipeline metadata |
|
|
119 | (1) |
|
4.5 Ingesting data from SaaS applications |
|
|
120 | (3) |
|
No standard approach to API design |
|
|
121 | (1) |
|
No standard way to deal with full vs. incremental data exports |
|
|
122 | (1) |
|
Resulting data is typically highly nested JSON |
|
|
122 | (1) |
|
4.6 Network and security considerations for data ingestion into the cloud |
|
|
123 | (3) |
|
Connecting other networks to your cloud data platform |
|
|
123 | (3) |
|
|
126 | (1) |
|
5 Organizing and processing data |
|
|
127 | (29) |
|
5.1 Processing as a separate layer in the data platform |
|
|
129 | (2) |
|
5.2 Data processing stages |
|
|
131 | (1) |
|
5.3 Organizing your cloud storage |
|
|
132 | (8) |
|
Cloud storage containers and folders |
|
|
134 | (6) |
|
5.4 Common data processing steps |
|
|
140 | (12) |
|
|
140 | (5) |
|
|
145 | (5) |
|
|
150 | (2) |
|
5.5 Configurable pipelines |
|
|
152 | (3) |
|
|
155 | (1) |
|
6 Real-time data processing and analytics |
|
|
156 | (41) |
|
6.1 Real-time ingestion vs. real-time processing |
|
|
157 | (3) |
|
6.2 Use cases for real-time data processing |
|
|
160 | (4) |
|
Retail use case: Real-time ingestion |
|
|
160 | (1) |
|
Online gaming use case: Real-time ingestion and real-time processing |
|
|
161 | (3) |
|
Summary of real-time ingestion vs. real-time processing |
|
|
164 | (1) |
|
6.3 When should you use real-time ingestion and/or real-time processing? |
|
|
164 | (3) |
|
6.4 Organizing data for real-time use |
|
|
167 | (11) |
|
The anatomy of fast storage |
|
|
167 | (3) |
|
How does fast storage scale? |
|
|
170 | (2) |
|
Organizing data in the real-time storage |
|
|
172 | (6) |
|
6.5 Common data transformations in real time |
|
|
178 | (12) |
|
Causes of duplicates in real-time systems |
|
|
178 | (3) |
|
Deduplicating data in real-time systems |
|
|
181 | (5) |
|
Converting message formats in real-time pipelines |
|
|
186 | (1) |
|
Real-time data quality checks |
|
|
187 | (1) |
|
Combining batch and real-time data |
|
|
188 | (2) |
|
6.6 Cloud services for real-time data processing |
|
|
190 | (5) |
|
AWS real-time processing services |
|
|
190 | (2) |
|
Google Cloud real-time processing services |
|
|
192 | (1) |
|
Azure real-time processing services |
|
|
193 | (2) |
|
|
195 | (2) |
|
7 Metadata layer architecture |
|
|
197 | (31) |
|
7.1 What we mean by metadata |
|
|
198 | (1) |
|
|
198 | (1) |
|
Data platform internal metadata or "pipeline metadata" |
|
|
199 | (1) |
|
7.2 Taking advantage of pipeline metadata |
|
|
199 | (4) |
|
|
203 | (10) |
|
|
204 | (9) |
|
7.4 Metadata layer implementation options |
|
|
213 | (7) |
|
Metadata layer as a collection of configuration files |
|
|
214 | (3) |
|
|
217 | (1) |
|
|
218 | (2) |
|
7.5 Overview of existing solutions |
|
|
220 | (7) |
|
|
221 | (2) |
|
Open source metadata layer implementations |
|
|
223 | (4) |
|
|
227 | (1) |
|
|
228 | (33) |
|
8.1 Why schema management |
|
|
229 | (3) |
|
Schema changes in a traditional data warehouse architecture |
|
|
230 | (1) |
|
|
231 | (1) |
|
8.2 Schema-management approaches |
|
|
232 | (11) |
|
|
233 | (2) |
|
Schema management in the data platform |
|
|
235 | (6) |
|
Monitoring schema changes |
|
|
241 | (2) |
|
8.3 Schema Registry Implementation |
|
|
243 | (5) |
|
|
243 | (2) |
|
Existing Schema Registry implementations |
|
|
245 | (1) |
|
Schema Registry as part of a Metadata layer |
|
|
246 | (2) |
|
8.4 Schema evolution scenarios |
|
|
248 | (7) |
|
Schema compatibility rules |
|
|
249 | (2) |
|
Schema evolution and data transformation pipelines |
|
|
251 | (4) |
|
8.5 Schema evolution and data warehouses |
|
|
255 | (5) |
|
Schema-management features of cloud data warehouses |
|
|
257 | (3) |
|
|
260 | (1) |
|
9 Data access and security |
|
|
261 | (28) |
|
9.1 Different types of data consumers |
|
|
262 | (1) |
|
9.2 Cloud data warehouses |
|
|
263 | (11) |
|
|
264 | (4) |
|
|
268 | (2) |
|
|
270 | (3) |
|
Choosing the right data warehouse |
|
|
273 | (1) |
|
9.3 Application data access |
|
|
274 | (4) |
|
Cloud relational databases |
|
|
275 | (1) |
|
Cloud key/value data stores |
|
|
276 | (1) |
|
Full-text search services |
|
|
277 | (1) |
|
|
278 | (1) |
|
9.4 Machine learning on the data platform |
|
|
278 | (5) |
|
Machine learning model lifecycle on a cloud data platform |
|
|
279 | (3) |
|
ML cloud collaboration tools |
|
|
282 | (1) |
|
9.5 Business intelligence and reporting tools |
|
|
283 | (2) |
|
Traditional BI tools and cloud data platform integration |
|
|
283 | (1) |
|
|
284 | (1) |
|
BI tools that are external to the cloud provider |
|
|
284 | (1) |
|
|
285 | (3) |
|
|
285 | (1) |
|
Credentials and configuration management |
|
|
286 | (1) |
|
|
286 | (1) |
|
|
287 | (1) |
|
|
288 | (1) |
|
10 Fueling business value with data platforms |
|
|
289 | (15) |
|
10.1 Why you need a data strategy |
|
|
290 | (1) |
|
10.2 The analytics maturity journey |
|
|
291 | (5) |
|
SEE: Getting insights from data |
|
|
292 | (1) |
|
PREDICT: Using data to predict what to do |
|
|
293 | (1) |
|
DO: Making your analytics actionable |
|
|
294 | (1) |
|
CREATE: Going beyond analytics into products |
|
|
295 | (1) |
|
10.3 The data platform: The engine that powers analytics maturity |
|
|
296 | (1) |
|
10.4 Platform project stoppers |
|
|
297 | (7) |
|
|
297 | (1) |
|
|
298 | (1) |
|
User trust and the need for data governance |
|
|
299 | (1) |
|
Operating in a platform silo |
|
|
300 | (1) |
|
|
301 | (3) |
index |
|
304 | |