Foreword |
|
xiii | |
Preface |
|
xi | |
Acknowledgments |
|
xvii | |
About This Book |
|
xix | |
About The Author |
|
xxv | |
About The Cover Illustration |
|
xxvi | |
|
PART 1 THE THEORY CRIPPLED BY AWESOME EXAMPLES |
|
|
1 | (136) |
|
1 So, what is Spark, anyway? |
|
|
3 | (16) |
|
1.1 The big picture: What Spark is and what it does |
|
|
4 | (4) |
|
|
4 | (2) |
|
|
6 | (2) |
|
1.2 How can you use Spark? |
|
|
8 | (2) |
|
Spark in a data processing/engineering scenario |
|
|
8 | (1) |
|
Spark in a data science scenario |
|
|
9 | (1) |
|
1.3 What can you do with Spark? |
|
|
10 | (2) |
|
Spark predicts restaurant quality at NC eateries |
|
|
11 | (1) |
|
Spark allows fast data transfer for Lumeris |
|
|
11 | (1) |
|
Spark analyzes equipment logs for CERN |
|
|
12 | (1) |
|
|
12 | (1) |
|
1.4 Why you will love the dataframe |
|
|
12 | (2) |
|
The dataframe from a Java perspective |
|
|
13 | (1) |
|
The dataframe from an RDBMS perspective |
|
|
13 | (1) |
|
A graphical representation of the dataframe |
|
|
14 | (1) |
|
|
14 | (5) |
|
|
15 | (1) |
|
|
15 | (2) |
|
Running your first application 15' Your first code |
|
|
17 | (2) |
|
|
19 | (14) |
|
2.1 Building your mental model |
|
|
20 | (1) |
|
2.2 Usingjava code to build your mental model |
|
|
21 | (2) |
|
2.3 Walking through your application |
|
|
23 | (10) |
|
|
24 | (1) |
|
Loading, or ingesting, the CSV file |
|
|
25 | (3) |
|
|
28 | (1) |
|
Saving the work done in your dataframe to a database |
|
|
29 | (4) |
|
3 The majestic role of the dataframe |
|
|
33 | (35) |
|
3.1 The essential role of the dataframe in Spark |
|
|
34 | (3) |
|
Organization of a dataframe |
|
|
35 | (1) |
|
Immutability is not a swear word |
|
|
36 | (1) |
|
3.2 Using dataframes through examples |
|
|
37 | (20) |
|
A dataframe after a simple CSV ingestion |
|
|
39 | (5) |
|
Data is stored in partitions |
|
|
44 | (1) |
|
|
45 | (1) |
|
A dataframe after a JSON ingestion |
|
|
46 | (6) |
|
|
52 | (5) |
|
3.3 The dataframe is a Dataset<Row> |
|
|
57 | (9) |
|
|
58 | (1) |
|
Creating a dataset of strings |
|
|
59 | (1) |
|
Converting back and forth |
|
|
60 | (6) |
|
3.4 Dataframe's ancestor: the RDD |
|
|
66 | (2) |
|
|
68 | (22) |
|
4.1 A real-life example of efficient laziness |
|
|
69 | (1) |
|
4.2 A Spark example of efficient laziness |
|
|
70 | (13) |
|
Looking at the results of transformations and actions 70' The transformation process, step by step |
|
|
72 | (2) |
|
The code behind the transformation/action process |
|
|
74 | (3) |
|
The mystery behind the creation of 7 million datapoints in 182 ms |
|
|
77 | (2) |
|
The mystery behind the timing of actions |
|
|
79 | (4) |
|
4.3 Comparing to RDBMS and traditional applications |
|
|
83 | (3) |
|
Working with the teen birth rates dataset |
|
|
83 | (1) |
|
Analyzing differences between a traditional app and a Spark app |
|
|
84 | (2) |
|
4.4 Spark is amazing for data-focused applications |
|
|
86 | (1) |
|
4.5 Catalyst is your app catalyzer |
|
|
86 | (4) |
|
5 Building a simple app for deployment |
|
|
90 | (24) |
|
5.1 An ingestionless example |
|
|
91 | (11) |
|
|
91 | (2) |
|
The code to approximate p |
|
|
93 | (6) |
|
What are lambda functions in Java? |
|
|
99 | (2) |
|
Approximating p by using lambda functions |
|
|
101 | (1) |
|
5.2 Interacting with Spark |
|
|
102 | (12) |
|
|
103 | (1) |
|
|
104 | (3) |
|
Interactive mode in Scala and Python |
|
|
107 | (7) |
|
6 Deploying your simple app |
|
|
114 | (23) |
|
6.1 Beyond the example: The role of the components |
|
|
116 | (5) |
|
Quick overview of the components and their interactions |
|
|
116 | (4) |
|
Troubleshooting tips for the Spark architecture |
|
|
120 | (1) |
|
|
121 | (1) |
|
|
121 | (5) |
|
Building a cluster that works for you |
|
|
122 | (1) |
|
Setting up the environment |
|
|
123 | (3) |
|
6.3 Building your application to run on the cluster |
|
|
126 | (6) |
|
Building your application's uber JAR |
|
|
127 | (2) |
|
Building your application by using Git and Maven |
|
|
129 | (3) |
|
6.4 Running your application on the cluster |
|
|
132 | (5) |
|
|
132 | (1) |
|
|
133 | (1) |
|
Analyzing the Spark user interface |
|
|
133 | (4) |
|
|
137 | (103) |
|
|
139 | (29) |
|
7.1 Common behaviors of parsers |
|
|
141 | (1) |
|
7.2 Complex ingestion from CSV |
|
|
141 | (3) |
|
|
142 | (1) |
|
|
143 | (1) |
|
7.3 Ingesting a CSV with a known schema |
|
|
144 | (2) |
|
|
145 | (1) |
|
|
145 | (1) |
|
7.4 Ingesting a JSON file |
|
|
146 | (4) |
|
|
148 | (1) |
|
|
149 | (1) |
|
7.5 Ingesting a multiline JSON file |
|
|
150 | (3) |
|
|
151 | (1) |
|
|
152 | (1) |
|
7.6 Ingesting an XML file |
|
|
153 | (4) |
|
|
155 | (1) |
|
|
155 | (2) |
|
7.7 Ingesting a text file |
|
|
157 | (2) |
|
|
158 | (1) |
|
|
158 | (1) |
|
7.8 File formats for big data |
|
|
159 | (3) |
|
The problem with traditional file formats |
|
|
159 | (1) |
|
Avro is a schema-based serialization format |
|
|
160 | (1) |
|
ORC is a columnar storage format |
|
|
161 | (1) |
|
Parquet is also a columnar storage format |
|
|
161 | (1) |
|
Comparing Avro, ORC, and Parquet |
|
|
161 | (1) |
|
7.9 Ingesting Avro, ORC, and Parquet files |
|
|
162 | (6) |
|
|
162 | (2) |
|
|
164 | (1) |
|
|
165 | (2) |
|
Reference table for ingesting Avro, ORC, or Parquet |
|
|
167 | (1) |
|
8 Ingestion from databases |
|
|
168 | (26) |
|
8.1 Ingestion from relational databases |
|
|
169 | (7) |
|
Database connection checklist |
|
|
170 | (1) |
|
Understanding the data used in the examples |
|
|
170 | (2) |
|
|
172 | (1) |
|
|
173 | (2) |
|
|
175 | (1) |
|
8.2 The role of the dialect |
|
|
176 | (4) |
|
What is a dialect, anyway? |
|
|
177 | (1) |
|
JDBC dialects provided with Spark |
|
|
177 | (1) |
|
Building your own dialect |
|
|
177 | (3) |
|
8.3 Advanced queries and ingestion |
|
|
180 | (8) |
|
Filtering by using a WHERE clause |
|
|
180 | (3) |
|
Joining data in the database |
|
|
183 | (2) |
|
Performing Ingestion and partitioning |
|
|
185 | (3) |
|
Summary of advanced features |
|
|
188 | (1) |
|
8.4 Ingestion from Elasticsearch |
|
|
188 | (6) |
|
|
189 | (1) |
|
The New York restaurants dataset digested by Spark |
|
|
189 | (2) |
|
Code to ingest the restaurant dataset from Elasticsearch |
|
|
191 | (3) |
|
9 Advanced ingestion: finding data sources and building your own |
|
|
194 | (28) |
|
9.1 What is a data source? |
|
|
196 | (1) |
|
9.2 Benefits of a direct connection to a data source |
|
|
197 | (2) |
|
|
198 | (1) |
|
|
198 | (1) |
|
|
199 | (1) |
|
9.3 Finding data sources at Spark Packages |
|
|
199 | (1) |
|
9.4 Building your own data source |
|
|
199 | (4) |
|
Scope of the example project |
|
|
200 | (2) |
|
Your data source API and options |
|
|
202 | (1) |
|
9.5 Behind the scenes: Building the data source itself |
|
|
203 | (1) |
|
9.6 Using the register file and the advertiser class |
|
|
204 | (3) |
|
9.7 Understanding the relationship between the data and schema |
|
|
207 | (6) |
|
The data source builds the relation |
|
|
207 | (3) |
|
|
210 | (3) |
|
9.8 Building the schema from aJavaBean |
|
|
213 | (2) |
|
9.9 Building the dataframe is magic with the utilities |
|
|
215 | (5) |
|
|
220 | (2) |
|
10 Ingestion through structured streaming |
|
|
222 | (18) |
|
|
224 | (1) |
|
10.2 Creating your first stream |
|
|
225 | (10) |
|
|
226 | (3) |
|
|
229 | (5) |
|
Getting records, not lines |
|
|
234 | (1) |
|
10.3 Ingesting data from network streams |
|
|
235 | (2) |
|
10.4 Dealing with multiple streams |
|
|
237 | (5) |
|
10.5 Differentiating discretized and structured streaming |
|
|
242 | (3) |
|
PART 3 TRANSFORMING YOUR DATA |
|
|
245 | (2) |
|
|
247 | (1) |
|
11.1 Working with Spark SQL |
|
|
248 | (3) |
|
11.2 The difference between local and global views |
|
|
251 | (2) |
|
11.3 Mixing the dataframe API and Spark SQL |
|
|
253 | (3) |
|
|
256 | (2) |
|
11.5 Going further with SQL |
|
|
258 | (2) |
|
12 Transforming your data |
|
|
260 | (31) |
|
12.1 What is data transformation? |
|
|
261 | (1) |
|
12.2 Process and example of record-level transformation |
|
|
262 | (14) |
|
Data discovery to understand the complexity |
|
|
264 | (1) |
|
Data mapping to draw the process |
|
|
265 | (3) |
|
Writing the transformation code |
|
|
268 | (6) |
|
Reviewing your data transformation to ensure a quality process |
|
|
274 | (1) |
|
|
275 | (1) |
|
Wrapping up your first Spark transformation |
|
|
275 | (1) |
|
|
276 | (13) |
|
A closer look at the datasets to join |
|
|
276 | (2) |
|
Building the list of higher education institutions per county |
|
|
278 | (5) |
|
|
283 | (6) |
|
12.4 Performing more transformations |
|
|
289 | (2) |
|
13 Transforming entire documents |
|
|
291 | (13) |
|
13.1 Transforming entire documents and their structure |
|
|
292 | (9) |
|
Flattening your JSON document |
|
|
293 | (5) |
|
Building nested documents for transfer and storage |
|
|
298 | (3) |
|
13.2 The magic behind static functions |
|
|
301 | (1) |
|
13.3 Performing more transformations |
|
|
302 | (1) |
|
|
303 | (1) |
|
14 Extending transformations with user-defined functions |
|
|
304 | (16) |
|
14.1 Extending Apache Spark |
|
|
305 | (1) |
|
14.2 Registering and calling a UDF |
|
|
306 | (10) |
|
Registering the UDF with Spark |
|
|
309 | (1) |
|
Using the UDF with the dataframe API |
|
|
310 | (2) |
|
Manipulating UDFs xvith SQL |
|
|
312 | (1) |
|
|
313 | (1) |
|
Writing the service itself |
|
|
314 | (2) |
|
14.3 Using UDFs to ensure a high level of data quality |
|
|
316 | (2) |
|
14.4 Considering UDFs' constraints |
|
|
318 | (2) |
|
|
320 | (25) |
|
15.1 Aggregating data with Spark |
|
|
321 | (6) |
|
A quick reminder on aggregations |
|
|
321 | (3) |
|
Performing basic aggregations with Spark |
|
|
324 | (3) |
|
15.2 Performing aggregations with live data |
|
|
327 | (11) |
|
|
327 | (5) |
|
Aggregating data to better understand the schools |
|
|
332 | (6) |
|
15.3 Building custom aggregations with UDAFs |
|
|
338 | (7) |
|
|
345 | (66) |
|
16 Cache and checkpoint: Enhancing Spark's performances |
|
|
347 | (26) |
|
16.1 Caching and checkpointing can increase performance |
|
|
348 | (13) |
|
The usefulness of Spark caching |
|
|
350 | (1) |
|
The subtle effectiveness of Spark checkpointing |
|
|
351 | (1) |
|
Using caching and checkpointing |
|
|
352 | (9) |
|
|
361 | (10) |
|
16.3 Going further in performance optimization |
|
|
371 | (2) |
|
17 Exporting data and building full data pipelines |
|
|
373 | (22) |
|
|
374 | (9) |
|
Building a pipeline with NASA datasets |
|
|
374 | (4) |
|
Transforming columns to datetime |
|
|
378 | (1) |
|
Transforming the confidence percentage to confidence level |
|
|
379 | (1) |
|
|
379 | (3) |
|
Exporting the data: What really happened? |
|
|
382 | (1) |
|
17.2 Delta Lake: Enjoying a database close to your system |
|
|
383 | (9) |
|
Understanding why a database is needed |
|
|
384 | (1) |
|
Using Delta Lake in your data pipeline |
|
|
385 | (4) |
|
Consuming data from Delta Lake |
|
|
389 | (3) |
|
17.3 Accessing cloud storage services from Spark |
|
|
392 | (3) |
|
18 Exploring deployment constraints: Understanding the JL O ecosystem |
|
|
395 | (16) |
|
18.1 Managing resources with YARN, Mesos, and Kubernetes |
|
|
396 | (7) |
|
The built-in standalone mode manages resources |
|
|
397 | (1) |
|
YARN manages resources in a Hadoop environment |
|
|
398 | (1) |
|
Mesos is a standalone resource manager |
|
|
399 | (2) |
|
Kubernetes orchestrates containers |
|
|
401 | (1) |
|
Choosing the right resource manager |
|
|
402 | (1) |
|
18.2 Sharing files with Spark |
|
|
403 | (5) |
|
Accessing the data contained in files |
|
|
404 | (1) |
|
Sharing files through distributed filesystems |
|
|
404 | (1) |
|
Accessing files on shared drives orfile server |
|
|
405 | (1) |
|
Using file-sharing services to distribute files |
|
|
406 | (1) |
|
Other options for accessing files in Spark |
|
|
407 | (1) |
|
Hybrid solution for sharing files with Spark |
|
|
408 | (1) |
|
18.3 Making sure your Spark application is secure |
|
|
408 | (3) |
|
Securing the network components of your infrastructure |
|
|
408 | (1) |
|
Securing Spark's disk usage |
|
|
409 | (2) |
Appendix A Installing Eclipse |
|
411 | (7) |
Appendix B Installing Maven |
|
418 | (4) |
Appendix C Installing Git |
|
422 | (2) |
Appendix D Downloading the code and getting started with Eclipse |
|
424 | (6) |
Appendix E A history of enterprise data |
|
430 | (4) |
Appendix F Getting help with relational databases |
|
434 | (4) |
Appendix G Static functions ease your transformations |
|
438 | (8) |
Appendix H Maven quick cheat sheet |
|
446 | (4) |
Appendix I Reference for transformations and actions |
|
450 | (10) |
Appendix J Enough Scala |
|
460 | (2) |
Appendix K Installing Spark in production and a few tips |
|
462 | (14) |
Appendix L Reference for ingestion |
|
476 | (12) |
Appendix M Reference for joins |
|
488 | (11) |
Appendix N Installing Elasticsearch and sample data |
|
499 | (6) |
Appendix O (Generating streaming data |
|
505 | (5) |
Appendix P Reference for streaming |
|
510 | (10) |
Appendix Q Reference for exporting data |
|
520 | (8) |
Appendix R Finding help when you're stuck |
|
528 | (5) |
Index |
|
533 | |