Foreword |
|
xiii | |
Preface |
|
xv | |
|
1 Introduction To Apache Spark: A Unified Analytics Engine |
|
|
1 | (18) |
|
|
1 | (3) |
|
Big Data and Distributed Computing at Google |
|
|
1 | (1) |
|
|
2 | (1) |
|
Spark's Early Years at AMPLab |
|
|
3 | (1) |
|
|
4 | (2) |
|
|
4 | (1) |
|
|
5 | (1) |
|
|
5 | (1) |
|
|
5 | (1) |
|
|
6 | (8) |
|
Apache Spark Components as a Unified Stack |
|
|
6 | (4) |
|
Apache Spark's Distributed Execution |
|
|
10 | (4) |
|
The Developer's Experience |
|
|
14 | (5) |
|
Who Uses Spark, and for What? |
|
|
14 | (2) |
|
Community Adoption and Expansion |
|
|
16 | (3) |
|
2 Downloading Apache Spark And Getting Started |
|
|
19 | (24) |
|
Step 1 Downloading Apache Spark |
|
|
19 | (3) |
|
Spark's Directories and Files |
|
|
21 | (1) |
|
Step 2 Using the Scala or PySpark Shell |
|
|
22 | (3) |
|
|
23 | (2) |
|
Step 3 Understanding Spark Application Concepts |
|
|
25 | (3) |
|
Spark Application and SparkSession |
|
|
26 | (1) |
|
|
27 | (1) |
|
|
28 | (1) |
|
|
28 | (1) |
|
Transformations, Actions, and Lazy Evaluation |
|
|
28 | (3) |
|
Narrow and Wide Transformations |
|
|
30 | (1) |
|
|
31 | (3) |
|
Your First Standalone Application |
|
|
34 | (8) |
|
Counting M & Ms for the Cookie Monster |
|
|
35 | (5) |
|
Building Standalone Applications in Scala |
|
|
40 | (2) |
|
|
42 | (1) |
|
3 Apache Spark's Structured Apis |
|
|
43 | (40) |
|
Spark: What's Underneath an RDD? |
|
|
43 | (1) |
|
|
44 | (3) |
|
|
45 | (2) |
|
|
47 | (22) |
|
|
48 | (1) |
|
Spark's Structured and Complex Data Types |
|
|
49 | (1) |
|
Schemas and Creating DataFrames |
|
|
50 | (4) |
|
|
54 | (3) |
|
|
57 | (1) |
|
Common DataFrame Operations |
|
|
58 | (10) |
|
End-to-End DataFrame Example |
|
|
68 | (1) |
|
|
69 | (5) |
|
Typed Objects, Untyped Objects, and Generic Rows |
|
|
69 | (2) |
|
|
71 | (1) |
|
|
72 | (2) |
|
End-to-End Dataset Example |
|
|
74 | (1) |
|
DataFrames Versus Datasets |
|
|
74 | (2) |
|
|
75 | (1) |
|
Spark SQL and the Underlying Engine |
|
|
76 | (6) |
|
|
77 | (5) |
|
|
82 | (1) |
|
4 Spark Sql And Dataframes: Introduction To Built-In Data Sources |
|
|
83 | (30) |
|
Using Spark SQL in Spark Applications |
|
|
84 | (5) |
|
|
85 | (4) |
|
|
89 | (5) |
|
Managed Versus UnmanagedTables |
|
|
89 | (1) |
|
Creating SQL Databases and Tables |
|
|
90 | (1) |
|
|
91 | (2) |
|
|
93 | (1) |
|
|
93 | (1) |
|
Reading Tables into DataFrames |
|
|
93 | (1) |
|
Data Sources for DataFrames and SQL Tables |
|
|
94 | (17) |
|
|
94 | (2) |
|
|
96 | (1) |
|
|
97 | (3) |
|
|
100 | (2) |
|
|
102 | (2) |
|
|
104 | (2) |
|
|
106 | (2) |
|
|
108 | (2) |
|
|
110 | (1) |
|
|
111 | (2) |
|
5 Spark Sql And Dataframes: Interacting With External Data Sources |
|
|
113 | (44) |
|
Spark SQL and Apache Hive |
|
|
113 | (6) |
|
|
114 | (5) |
|
Querying with the Spark SQL Shell, Beeline, and Tableau |
|
|
119 | (10) |
|
Using the Spark SQL Shell |
|
|
119 | (1) |
|
|
120 | (2) |
|
|
122 | (7) |
|
|
129 | (9) |
|
|
129 | (3) |
|
|
132 | (1) |
|
|
133 | (1) |
|
|
134 | (2) |
|
|
136 | (1) |
|
|
137 | (1) |
|
Higher-Order Functions in DataFrames and Spark SQL |
|
|
138 | (6) |
|
Option 1: Explode and Collect |
|
|
138 | (1) |
|
Option 2: User-Defined Function |
|
|
138 | (1) |
|
Built-in Functions for Complex Data Types |
|
|
139 | (2) |
|
|
141 | (3) |
|
Common DataFrames and Spark SQL Operations |
|
|
144 | (11) |
|
|
147 | (1) |
|
|
148 | (1) |
|
|
149 | (2) |
|
|
151 | (4) |
|
|
155 | (2) |
|
|
157 | (16) |
|
Single API for Java and Scala |
|
|
157 | (3) |
|
Scala Case Classes and JavaBeans for Datasets |
|
|
158 | (2) |
|
|
160 | (7) |
|
|
160 | (2) |
|
|
162 | (5) |
|
Memory Management for Datasets and DataFrames |
|
|
167 | (1) |
|
|
168 | (2) |
|
Sparks Internal Format Versus Java Object Format |
|
|
168 | (1) |
|
Serialization and Deserialization (SerDe) |
|
|
169 | (1) |
|
|
170 | (2) |
|
Strategies to Mitigate Costs |
|
|
170 | (2) |
|
|
172 | (1) |
|
7 Optimizing And Tuning Spark Applications |
|
|
173 | (34) |
|
Optimizing and Tuning Spark for Efficiency |
|
|
173 | (10) |
|
Viewing and Setting Apache Spark Configurations |
|
|
173 | (4) |
|
Scaling Spark for Large Workloads |
|
|
177 | (6) |
|
Caching and Persistence of Data |
|
|
183 | (4) |
|
|
183 | (1) |
|
|
184 | (3) |
|
When to Cache and Persist |
|
|
187 | (1) |
|
When Not to Cache and Persist |
|
|
187 | (1) |
|
|
187 | (10) |
|
|
188 | (1) |
|
|
189 | (8) |
|
|
197 | (8) |
|
Journey Through the Spark UI Tabs |
|
|
197 | (8) |
|
|
205 | (2) |
|
|
207 | (58) |
|
Evolution of the Apache Spark Stream Processing Engine |
|
|
207 | (4) |
|
The Advent of Micro-Batch Stream Processing |
|
|
208 | (1) |
|
Lessons Learned from Spark Streaming (DStreams) |
|
|
209 | (1) |
|
The Philosophy of Structured Streaming |
|
|
210 | (1) |
|
The Programming Model of Structured Streaming |
|
|
211 | (2) |
|
The Fundamentals of a Structured Streaming Query |
|
|
213 | (13) |
|
Five Steps to Define a Streaming Query |
|
|
213 | (6) |
|
Under the Hood of an Active Streaming Query |
|
|
219 | (2) |
|
Recovering from Failures with Exactly-Once Guarantees |
|
|
221 | (2) |
|
Monitoring an Active Query |
|
|
223 | (3) |
|
Streaming Data Sources and Sinks |
|
|
226 | (8) |
|
|
226 | (2) |
|
|
228 | (2) |
|
Custom Streaming Sources and Sinks |
|
|
230 | (4) |
|
|
234 | (4) |
|
Incremental Execution and Streaming State |
|
|
234 | (1) |
|
Stateless Transformations |
|
|
235 | (1) |
|
|
235 | (3) |
|
Stateful Streaming Aggregations |
|
|
238 | (8) |
|
Aggregations Not Based on Time |
|
|
238 | (1) |
|
Aggregations with Event-Time Windows |
|
|
239 | (7) |
|
|
246 | (7) |
|
|
246 | (2) |
|
|
248 | (5) |
|
Arbitrary Stateful Computations |
|
|
253 | (9) |
|
Modeling Arbitrary Stateful Operations with mapGroupsWithState() |
|
|
254 | (3) |
|
Using Timeouts to Manage Inactive Groups |
|
|
257 | (4) |
|
Generalization with flatMapGroupsWithState() |
|
|
261 | (1) |
|
|
262 | (2) |
|
|
264 | (1) |
|
9 Building Reliable Data Lakes With Apache Spark |
|
|
265 | (20) |
|
The Importance of an Optimal Storage Solution |
|
|
265 | (1) |
|
|
266 | (2) |
|
A Brief Introduction to Databases |
|
|
266 | (1) |
|
Reading from and Writing to Databases Using Apache Spark |
|
|
267 | (1) |
|
|
267 | (1) |
|
|
268 | (3) |
|
A Brief Introduction to Data Lakes |
|
|
268 | (1) |
|
Reading from and Writing to Data Lakes using Apache Spark |
|
|
269 | (1) |
|
Limitations of Data Lakes |
|
|
270 | (1) |
|
Lakehouses: The Next Step in the Evolution of Storage Solutions |
|
|
271 | (3) |
|
|
272 | (1) |
|
|
272 | (1) |
|
|
273 | (1) |
|
Building Lakehouses with Apache Spark and Delta Lake |
|
|
274 | (10) |
|
Configuring Apache Spark with Delta Lake |
|
|
274 | (1) |
|
Loading Data into a Delta Lake Table |
|
|
275 | (2) |
|
Loading Data Streams into a Delta Lake Table |
|
|
277 | (1) |
|
Enforcing Schema on Write to Prevent Data Corruption |
|
|
278 | (1) |
|
Evolving Schemas to Accommodate Changing Data |
|
|
279 | (1) |
|
Transforming Existing Data |
|
|
279 | (3) |
|
Auditing Data Changes with Operation History |
|
|
282 | (1) |
|
Querying Previous Snapshots of a Table with Time Travel |
|
|
283 | (1) |
|
|
284 | (1) |
|
10 Machine Learning With Mllib |
|
|
285 | (38) |
|
What Is Machine Learning? |
|
|
286 | (3) |
|
|
286 | (2) |
|
|
288 | (1) |
|
Why Spark for Machine Learning? |
|
|
289 | (1) |
|
Designing Machine Learning Pipelines |
|
|
289 | (18) |
|
Data Ingestion and Exploration |
|
|
290 | (1) |
|
Creating Training and Test Data Sets |
|
|
291 | (2) |
|
Preparing Features with Transformers |
|
|
293 | (1) |
|
Understanding Linear Regression |
|
|
294 | (1) |
|
Using Estimators to Build Models |
|
|
295 | (1) |
|
|
296 | (6) |
|
|
302 | (4) |
|
Saving and Loading Models |
|
|
306 | (1) |
|
|
307 | (14) |
|
|
307 | (9) |
|
|
316 | (4) |
|
|
320 | (1) |
|
|
321 | (2) |
|
11 Managing, Deploying, And Scaling Machine Learning Pipelines With Apache Spark |
|
|
323 | (20) |
|
|
323 | (7) |
|
|
324 | (6) |
|
Model Deployment Options with MLlib |
|
|
330 | (6) |
|
|
332 | (1) |
|
|
333 | (1) |
|
Model Export Patterns for Real-Time Inference |
|
|
334 | (2) |
|
Leveraging Spark for Non-MLlib Models |
|
|
336 | (5) |
|
|
336 | (1) |
|
Spark for Distributed Hyperparameter Tuning |
|
|
337 | (4) |
|
|
341 | (2) |
|
12 Epilogue: Apache Spark 3.0 |
|
|
343 | (18) |
|
|
343 | (9) |
|
Dynamic Partition Pruning |
|
|
343 | (2) |
|
|
345 | (3) |
|
|
348 | (1) |
|
Catalog Plugin API and DataSourceV2 |
|
|
349 | (2) |
|
Accelerator-Aware Scheduler |
|
|
351 | (1) |
|
|
352 | (2) |
|
PySpark, Pandas UDFs, and Pandas Function APIs |
|
|
354 | (3) |
|
Redesigned Pandas UDFs with Python Type Hints |
|
|
354 | (1) |
|
Iterator Support in Pandas UDFs |
|
|
355 | (1) |
|
|
356 | (1) |
|
|
357 | (3) |
|
Languages Supported and Deprecated |
|
|
357 | (1) |
|
Changes to the DataFrame and Dataset APIs |
|
|
357 | (1) |
|
DataFrame and SQL Explain Commands |
|
|
358 | (2) |
|
|
360 | (1) |
Index |
|
361 | |