About the Authors |
|
xi | |
About the Technical Reviewer |
|
xiii | |
Acknowledgments |
|
xv | |
Foreword |
|
xvii | |
Chapter 1 Introduction to Enterprise Data Lakes |
|
1 | (32) |
|
Data explosion: the beginning |
|
|
3 | (3) |
|
|
6 | (5) |
|
Hadoop and MapReduce - Early days |
|
|
7 | (1) |
|
|
8 | (3) |
|
|
11 | (2) |
|
|
12 | (1) |
|
|
13 | (6) |
|
|
15 | (1) |
|
Data Lake Characteristics |
|
|
16 | (3) |
|
Data lake vs. Data warehouse |
|
|
19 | (2) |
|
How to achieve success with Data Lake? |
|
|
21 | (4) |
|
Data governance and data operations |
|
|
22 | (3) |
|
Data democratization with data lake |
|
|
25 | (3) |
|
Fast Data - Life beyond Big Data |
|
|
28 | (2) |
|
|
30 | (3) |
Chapter 2 Data lake ingestion strategies |
|
33 | (54) |
|
|
34 | (13) |
|
Understand the data sources |
|
|
35 | (2) |
|
Structured vs. Semi-structured vs. Unstructured data |
|
|
37 | (2) |
|
Data ingestion framework parameters |
|
|
39 | (6) |
|
|
45 | (2) |
|
Big Data Integration with Data Lake |
|
|
47 | (4) |
|
Hadoop Distributed File System (HDFS) |
|
|
48 | (1) |
|
Copy files directly into HDFS |
|
|
49 | (1) |
|
|
49 | (2) |
|
Challenges and design considerations |
|
|
51 | (13) |
|
|
52 | (5) |
|
|
57 | (1) |
|
|
58 | (2) |
|
CDC design considerations |
|
|
60 | (1) |
|
Example of CDC pipeline: Databus, LinkedIn's open-source solution |
|
|
61 | (3) |
|
|
64 | (7) |
|
|
64 | (1) |
|
|
65 | (1) |
|
|
66 | (1) |
|
Sqoop design considerations |
|
|
67 | (4) |
|
Native ingestion utilities |
|
|
71 | (6) |
|
|
72 | (1) |
|
|
73 | (3) |
|
Data transfer from Greenplum to using gpfdist |
|
|
76 | (1) |
|
Ingest unstructured data into Hadoop |
|
|
77 | (8) |
|
|
77 | (2) |
|
Tiered architecture for convergent flow of events |
|
|
79 | (1) |
|
Features and design considerations |
|
|
80 | (5) |
|
|
85 | (2) |
Chapter 3 Capture Streaming Data with Change-Data-Capture |
|
87 | (38) |
|
Change Data Capture Concepts |
|
|
88 | (1) |
|
Strategies for Data Capture |
|
|
89 | (4) |
|
|
91 | (1) |
|
|
92 | (1) |
|
|
93 | (2) |
|
|
94 | (1) |
|
|
94 | (1) |
|
|
95 | (1) |
|
|
95 | (2) |
|
|
97 | (4) |
|
|
98 | (1) |
|
|
98 | (1) |
|
|
99 | (1) |
|
Centralization of Change Data |
|
|
100 | (1) |
|
Analyzing a Centralized Data Store |
|
|
101 | (14) |
|
Metadata: Data about Data |
|
|
102 | (2) |
|
|
104 | (1) |
|
Privacy/Sensitivity Information |
|
|
104 | (1) |
|
|
104 | (1) |
|
|
105 | (1) |
|
|
105 | (1) |
|
|
106 | (1) |
|
Consumption and Checkpointing |
|
|
107 | (1) |
|
Simple Checkpoint Mechanism |
|
|
107 | (1) |
|
|
107 | (1) |
|
Merging and Consolidation |
|
|
108 | (1) |
|
Design Considerations for Merge and Consolidate |
|
|
109 | (1) |
|
|
110 | (1) |
|
|
111 | (1) |
|
|
112 | (1) |
|
|
112 | (3) |
|
|
115 | (2) |
|
|
117 | (6) |
|
|
118 | (1) |
|
|
119 | (1) |
|
Multiple Topics and Partitioning |
|
|
120 | (1) |
|
|
121 | (1) |
|
|
122 | (1) |
|
|
123 | (2) |
Chapter 4 Data Processing Strategies in Data Lakes |
|
125 | (76) |
|
MapReduce Processing Framework |
|
|
126 | (15) |
|
Motivation: Why MapReduce? |
|
|
127 | (1) |
|
MapReduce V1 Refresher and Design Considerations |
|
|
128 | (8) |
|
Yet Another Resource Negotiator - YARN |
|
|
136 | (5) |
|
|
141 | (19) |
|
|
143 | (3) |
|
Hive Metastore (a.k.a. HCatalog) |
|
|
146 | (2) |
|
Hive - Design Considerations |
|
|
148 | (10) |
|
|
158 | (2) |
|
|
160 | (6) |
|
Pig Execution Architecture |
|
|
161 | (5) |
|
|
166 | (18) |
|
|
167 | (2) |
|
Resilient Distributed Datasets (RDD) |
|
|
169 | (2) |
|
|
171 | (3) |
|
|
174 | (1) |
|
|
175 | (3) |
|
Deployment Modes of Spark Application |
|
|
178 | (2) |
|
|
180 | (2) |
|
Caching and Persistence of an RDD in Spark |
|
|
182 | (1) |
|
|
183 | (1) |
|
|
184 | (15) |
|
|
186 | (8) |
|
|
194 | (3) |
|
|
197 | (2) |
|
|
199 | (2) |
Chapter 5 Data Archiving Strategies in Data Lakes |
|
201 | (24) |
|
The Act of Data Governance |
|
|
202 | (3) |
|
|
204 | (1) |
|
Introduction to Data Archival |
|
|
205 | (12) |
|
Data Lifecycle Management (DLM) |
|
|
208 | (2) |
|
|
210 | (1) |
|
|
211 | (2) |
|
DLM design considerations |
|
|
213 | (4) |
|
Amazon S3 and Glacier storage classes |
|
|
217 | (5) |
|
|
219 | (1) |
|
DLM Case Study - Archiving with Amazon |
|
|
220 | (2) |
|
|
222 | (3) |
Chapter 6 Data Security in Data Lakes |
|
225 | (36) |
|
|
226 | (6) |
|
|
227 | (3) |
|
Hadoop Roles within a cluster |
|
|
230 | (2) |
|
Host Firewalls for operating system security |
|
|
232 | (1) |
|
|
233 | (4) |
|
|
233 | (4) |
|
|
237 | (7) |
|
Procedure to generate and verify key in LUKS |
|
|
238 | (1) |
|
|
238 | (5) |
|
|
243 | (1) |
|
Multiple passphrases with LUKS |
|
|
243 | (1) |
|
|
244 | (12) |
|
Kerberos Protocol overview |
|
|
244 | (2) |
|
|
246 | (1) |
|
|
247 | (2) |
|
|
249 | (7) |
|
|
256 | (1) |
|
HDFS Authorization with Apache Ranger |
|
|
257 | (2) |
|
|
258 | (1) |
|
|
259 | (2) |
Chapter 7 Ensure High Availability of Data Lake |
|
261 | (36) |
|
Scale Hadoop through HDFS federation |
|
|
262 | (5) |
|
High availability of Hadoop components |
|
|
267 | (20) |
|
|
267 | (1) |
|
HiveServer2 and Zookeeper integration |
|
|
268 | (1) |
|
|
269 | (3) |
|
NameNode high availability |
|
|
272 | (1) |
|
|
273 | (7) |
|
Data Center disaster recovery strategies |
|
|
280 | (7) |
|
Data replication strategies |
|
|
287 | (8) |
|
Active-passive data center replication |
|
|
289 | (1) |
|
Active-active data center replication |
|
|
290 | (5) |
|
|
295 | (2) |
Chapter 8 Managing Data Lake Operations |
|
297 | (20) |
|
|
299 | (2) |
|
Hadoop metrics architecture |
|
|
300 | (1) |
|
Identification of source components |
|
|
301 | (6) |
|
|
301 | (1) |
|
|
302 | (1) |
|
|
302 | (1) |
|
|
303 | (2) |
|
|
305 | (2) |
|
Logs and Metrics visualization |
|
|
307 | (2) |
|
|
308 | (1) |
|
|
309 | (2) |
|
Data lake operationalization |
|
|
311 | (4) |
|
|
315 | (2) |
Index |
|
317 | |