About the Author |
|
xi | |
About the Technical Reviewers |
|
xiii | |
Acknowledgments |
|
xv | |
Introduction |
|
xvii | |
Chapter 1 Introduction to Apache Spark |
|
1 | (16) |
|
|
1 | (1) |
|
|
2 | (1) |
|
Spark Core Concepts and Architecture |
|
|
3 | (7) |
|
Spark Cluster and Resource Management System |
|
|
4 | (1) |
|
|
4 | (1) |
|
Spark Drivers and Executors |
|
|
5 | (1) |
|
|
6 | (4) |
|
|
10 | (1) |
|
Adaptive Query Execution Framework |
|
|
11 | (1) |
|
Dynamic Partition Pruning (DPP) |
|
|
11 | (1) |
|
Accelerator-aware Scheduler |
|
|
11 | (1) |
|
Apache Spark Applications |
|
|
11 | (1) |
|
Spark Example Applications |
|
|
12 | (1) |
|
|
13 | (1) |
|
|
13 | (1) |
|
|
13 | (1) |
|
|
14 | (1) |
|
|
14 | (3) |
Chapter 2 Working with Apache Spark |
|
17 | (34) |
|
Downloading and Installation |
|
|
17 | (4) |
|
|
17 | (1) |
|
|
18 | (3) |
|
Having Fun with the Spark Scala Shell |
|
|
21 | (11) |
|
Useful Spark Scala Shell Command and Tips |
|
|
21 | (3) |
|
Basic Interactions with Scala and Spark |
|
|
24 | (8) |
|
Introduction to Collaborative Notebooks |
|
|
32 | (15) |
|
|
35 | (3) |
|
|
38 | (2) |
|
|
40 | (7) |
|
Setting up Spark Source Code |
|
|
47 | (1) |
|
|
48 | (3) |
Chapter 3 Spark SQL: Foundation |
|
51 | (60) |
|
|
52 | (1) |
|
Introduction to the DataFrame API |
|
|
53 | (1) |
|
|
54 | (40) |
|
Creating a DataFrame from RDD |
|
|
54 | (3) |
|
Creating a DataFrame from a Range of Numbers |
|
|
57 | (3) |
|
Creating a DataFrame from Data Sources |
|
|
60 | (14) |
|
Working with Structured Operations |
|
|
74 | (20) |
|
|
94 | (5) |
|
|
96 | (1) |
|
|
97 | (2) |
|
|
99 | (4) |
|
|
99 | (4) |
|
Writing Data Out to Storage Systems |
|
|
103 | (3) |
|
The Trio: DataFrame, Dataset, and SQL |
|
|
106 | (1) |
|
|
107 | (1) |
|
|
108 | (3) |
Chapter 4 Spark SQL: Advanced |
|
111 | (72) |
|
|
111 | (17) |
|
|
112 | (9) |
|
Aggregation with Grouping |
|
|
121 | (4) |
|
Aggregation with Pivoting |
|
|
125 | (3) |
|
|
128 | (14) |
|
Join Expression and Join Types |
|
|
128 | (2) |
|
|
130 | (7) |
|
Dealing with Duplicate Column Names |
|
|
137 | (2) |
|
Overview of Join Implementation |
|
|
139 | (3) |
|
|
142 | (18) |
|
Working with Built-in Functions |
|
|
142 | (16) |
|
Working with User-Defined Functions (UDFs) |
|
|
158 | (2) |
|
Advanced Analytics Functions |
|
|
160 | (15) |
|
Aggregation with Rollups and Cubes |
|
|
160 | (1) |
|
|
161 | (2) |
|
|
163 | (12) |
|
Exploring Catalyst Optimizer |
|
|
175 | (7) |
|
|
175 | (1) |
|
|
176 | (1) |
|
|
176 | (4) |
|
|
180 | (2) |
|
|
182 | (1) |
Chapter 5 Optimizing Spark Applications |
|
183 | (38) |
|
Common Performance Issues |
|
|
183 | (10) |
|
|
184 | (3) |
|
|
187 | (6) |
|
Leverage In-Memory Computation |
|
|
193 | (5) |
|
When to Persist and Cache Data |
|
|
193 | (1) |
|
Persistence and Caching APIs |
|
|
193 | (2) |
|
Persistence and Caching Example |
|
|
195 | (3) |
|
Understanding Spark Joins |
|
|
198 | (6) |
|
|
199 | (2) |
|
|
201 | (3) |
|
|
204 | (14) |
|
Dynamically Coalescing Shuffle Partitions |
|
|
206 | (5) |
|
Dynamically Switching Join Strategies |
|
|
211 | (2) |
|
Dynamically Optimizing Skew Joins |
|
|
213 | (5) |
|
|
218 | (3) |
Chapter 6 Spark Streaming |
|
221 | (66) |
|
|
222 | (8) |
|
|
224 | (4) |
|
Stream Processing Engine Landscape |
|
|
228 | (2) |
|
|
230 | (1) |
|
|
231 | (53) |
|
Spark Structured Streaming |
|
|
234 | (1) |
|
|
234 | (2) |
|
|
236 | (6) |
|
Structured Streaming Applications |
|
|
242 | (7) |
|
Streaming DataFrame Operations |
|
|
249 | (3) |
|
Working with Data Sources |
|
|
252 | (12) |
|
|
264 | (10) |
|
|
274 | (5) |
|
|
279 | (5) |
|
|
284 | (3) |
Chapter 7 Advanced Spark Streaming |
|
287 | (44) |
|
|
287 | (13) |
|
Fixed Window Aggregation over an Event Time |
|
|
289 | (2) |
|
Sliding Window Aggregation over Event Time |
|
|
291 | (4) |
|
|
295 | (1) |
|
Watermarking: Limit State and Handle Late Data |
|
|
296 | (4) |
|
Arbitrary Stateful Processing |
|
|
300 | (16) |
|
Arbitrary Stateful Processing with Structured Streaming |
|
|
300 | (3) |
|
|
303 | (1) |
|
Arbitrary State Processing in Action |
|
|
304 | (12) |
|
|
316 | (2) |
|
|
318 | (2) |
|
Streaming Application Code Change |
|
|
319 | (1) |
|
|
320 | (1) |
|
Streaming Query Metrics and Monitoring |
|
|
320 | (8) |
|
|
320 | (3) |
|
Monitoring Streaming Queries via Callback |
|
|
323 | (1) |
|
Monitoring Streaming Queries via Visualization UI |
|
|
324 | (1) |
|
Streaming Query Summary Information |
|
|
325 | (1) |
|
Streaming Query Detailed Statistics Information |
|
|
326 | (1) |
|
Troubleshooting Streaming Query |
|
|
327 | (1) |
|
|
328 | (3) |
Chapter 8 Machine Learning with Spark |
|
331 | (64) |
|
Machine Learning Overview |
|
|
332 | (9) |
|
Machine Learning Terminologies |
|
|
333 | (2) |
|
|
335 | (4) |
|
Machine Learning Development Process |
|
|
339 | (2) |
|
Spark Machine Learning Library |
|
|
341 | (34) |
|
Machine Learning Pipelines |
|
|
341 | (34) |
|
Machine Learning Tasks in Action |
|
|
375 | (16) |
|
|
375 | (4) |
|
|
379 | (3) |
|
|
382 | (9) |
|
|
391 | (1) |
|
|
392 | (3) |
Chapter 9 Managing the Machine Learning Life Cycle |
|
395 | (36) |
|
|
396 | (2) |
|
|
396 | (2) |
|
|
398 | (29) |
|
|
399 | (1) |
|
|
400 | (27) |
|
Model Deployment and Prediction |
|
|
427 | (1) |
|
|
428 | (3) |
Index |
|
431 | |