Foreword |
|
xiii | |
Preface |
|
xv | |
|
|
|
1 Introduction to Spark and PySpark |
|
|
1 | (34) |
|
Why Spark for Data Analytics |
|
|
2 | (3) |
|
|
5 | (1) |
|
|
6 | (6) |
|
|
12 | (3) |
|
|
15 | (2) |
|
|
17 | (1) |
|
|
17 | (1) |
|
|
18 | (3) |
|
|
21 | (3) |
|
|
24 | (1) |
|
Launching the PySpark Shell |
|
|
25 | (1) |
|
Creating an RDD from a Collection |
|
|
26 | (1) |
|
Aggregating and Merging Values of Keys |
|
|
26 | (2) |
|
Filtering an RDDs Elements |
|
|
28 | (1) |
|
|
28 | (1) |
|
Aggregating Values for Similar Keys |
|
|
29 | (1) |
|
ETL Example with DataFrames |
|
|
30 | (1) |
|
|
31 | (1) |
|
|
32 | (1) |
|
|
33 | (1) |
|
|
33 | (2) |
|
2 Transformations in Action |
|
|
35 | (30) |
|
The DNA Base Count Example |
|
|
36 | (2) |
|
The DNA Base Count Problem |
|
|
38 | (1) |
|
|
39 | (1) |
|
|
39 | (1) |
|
DNA Base Count Solution 1 |
|
|
40 | (1) |
|
Step 1 Create an RDD [ String] from the Input |
|
|
41 | (1) |
|
Step 2 Define a Mapper Function |
|
|
42 | (2) |
|
Step 3 Find the Frequencies of DNA Letters |
|
|
44 | (3) |
|
Pros and Cons of Solution 1 |
|
|
47 | (1) |
|
DNA Base Count Solution 2 |
|
|
47 | (2) |
|
Step 1 Create an RDD [ String] from the Input |
|
|
49 | (1) |
|
Step 2 Define a Mapper Function |
|
|
49 | (2) |
|
Step 3 Find the Frequencies of DNA Letters |
|
|
51 | (1) |
|
Pros and Cons of Solution 2 |
|
|
52 | (1) |
|
DNA Base Count Solution 3 |
|
|
52 | (1) |
|
The mapPartitionsO Transformation |
|
|
52 | (8) |
|
Step 1 Create an RDD [ String] from the Input |
|
|
60 | (1) |
|
Step 2 Define a Function to Handle a Partition |
|
|
60 | (2) |
|
Step 3 Apply the Custom Function to Each Partition |
|
|
62 | (2) |
|
Pros and Cons of Solution 3 |
|
|
64 | (1) |
|
|
64 | (1) |
|
|
65 | (38) |
|
Data Abstractions and Mappers |
|
|
65 | (2) |
|
What Are Transformations? |
|
|
67 | (5) |
|
|
72 | (1) |
|
|
73 | (5) |
|
|
78 | (2) |
|
The flatMapO Transformation |
|
|
80 | (5) |
|
|
85 | (1) |
|
Apply flatMap() to a DataFrame |
|
|
86 | (3) |
|
The mapValuesO Transformation |
|
|
89 | (1) |
|
The flatMapValues() Transformation |
|
|
90 | (1) |
|
The mapPartitionsO Transformation |
|
|
91 | (4) |
|
Handling Empty Partitions |
|
|
95 | (3) |
|
|
98 | (1) |
|
DataFrames and mapPartitionsO Transformation |
|
|
99 | (3) |
|
|
102 | (1) |
|
|
103 | (42) |
|
|
104 | (1) |
|
Reduction Transformations |
|
|
105 | (3) |
|
|
108 | (2) |
|
|
110 | (1) |
|
Solving with reduceByKey() |
|
|
111 | (1) |
|
Solving with groupByKey() |
|
|
112 | (1) |
|
Solving with aggregateByKey() |
|
|
112 | (1) |
|
Solving with combineByKey() |
|
|
113 | (2) |
|
|
115 | (2) |
|
Monoid and Non-Monoid Examples |
|
|
117 | (1) |
|
|
118 | (3) |
|
|
121 | (1) |
|
The aggregateByKey() Transformation |
|
|
122 | (2) |
|
First Solution Using aggregateByKey() |
|
|
124 | (3) |
|
Second Solution Using aggregateByKey() |
|
|
127 | (2) |
|
Complete PySpark Solution Using groupByKey() |
|
|
129 | (2) |
|
Complete PySpark Solution Using reduceByKey() |
|
|
131 | (3) |
|
Complete PySpark Solution Using combineByKey() |
|
|
134 | (3) |
|
The Shuffle Step in Reductions |
|
|
137 | (1) |
|
Shuffle Step for groupByKey() |
|
|
138 | (1) |
|
Shuffle Step for reduceByKey() |
|
|
139 | (1) |
|
|
140 | (5) |
|
Part II Working with Data |
|
|
|
|
145 | (16) |
|
Introduction to Partitions |
|
|
146 | (1) |
|
|
146 | (4) |
|
|
150 | (1) |
|
|
151 | (1) |
|
|
152 | (1) |
|
Physical Partitioning for SQL Queries |
|
|
153 | (3) |
|
Physical Partitioning of Data in Spark |
|
|
156 | (1) |
|
|
156 | (1) |
|
Partition as Parquet Format |
|
|
157 | (1) |
|
How to Query Partitioned Data |
|
|
158 | (1) |
|
|
158 | (2) |
|
|
160 | (1) |
|
|
161 | (42) |
|
|
162 | (2) |
|
|
164 | (1) |
|
|
165 | (3) |
|
GraphFrames Functions and Attributes |
|
|
168 | (1) |
|
|
169 | (1) |
|
|
169 | (3) |
|
|
172 | (9) |
|
|
181 | (1) |
|
|
181 | (2) |
|
|
183 | (4) |
|
|
187 | (4) |
|
|
191 | (2) |
|
|
193 | (9) |
|
|
202 | (1) |
|
7 Interacting with External Data Sources |
|
|
203 | (44) |
|
|
204 | (1) |
|
|
205 | (8) |
|
Writing a DataFrame to a Database |
|
|
213 | (5) |
|
|
218 | (2) |
|
Reading and Writing CSV Files |
|
|
220 | (1) |
|
|
220 | (4) |
|
|
224 | (1) |
|
Reading and Writing JSON Files |
|
|
225 | (1) |
|
|
226 | (1) |
|
|
227 | (1) |
|
Reading from and Writing to Amazon S3 |
|
|
228 | (1) |
|
|
229 | (2) |
|
|
231 | (1) |
|
Reading and Writing Hadoop Files |
|
|
232 | (1) |
|
Reading Hadoop Text Files |
|
|
233 | (3) |
|
Writing Hadoop Text Files |
|
|
236 | (2) |
|
Reading and Writing HDFS SequenceFiles |
|
|
238 | (1) |
|
Reading and Writing Parquet Files |
|
|
239 | (1) |
|
|
239 | (2) |
|
|
241 | (1) |
|
Reading and Writing Avro Files |
|
|
242 | (1) |
|
|
242 | (1) |
|
|
242 | (1) |
|
Reading from and Writing to MS SQL Server |
|
|
243 | (1) |
|
|
243 | (1) |
|
Reading from MS SQL Server |
|
|
244 | (1) |
|
|
244 | (1) |
|
Creating a DataFrame from Images |
|
|
244 | (2) |
|
|
246 | (1) |
|
|
247 | (24) |
|
|
248 | (1) |
|
Calculation of the Rank Product |
|
|
249 | (1) |
|
|
249 | (1) |
|
|
250 | (1) |
|
|
251 | (6) |
|
|
257 | (2) |
|
PageRanks Iterative Computation |
|
|
259 | (2) |
|
Custom PageRank in PySpark Using RDDs |
|
|
261 | (2) |
|
Custom PageRank in PySpark Using an Adjacency Matrix |
|
|
263 | (3) |
|
PageRank with GraphFrames |
|
|
266 | (1) |
|
|
267 | (4) |
|
Part III Data Design Patterns |
|
|
|
9 Classic Data Design Patterns |
|
|
271 | (32) |
|
|
272 | (1) |
|
|
273 | (2) |
|
|
275 | (2) |
|
Flat Mapper functionality |
|
|
277 | (1) |
|
|
278 | (1) |
|
|
279 | (1) |
|
|
280 | (1) |
|
|
280 | (2) |
|
|
282 | (1) |
|
|
282 | (3) |
|
|
285 | (2) |
|
Input-Multiple-Maps-Reduce-Output |
|
|
287 | (1) |
|
|
288 | (2) |
|
|
290 | (1) |
|
Input-Map-Combiner-Reduce-Output |
|
|
291 | (3) |
|
Input-MapPartitions-Reduce-Output |
|
|
294 | (4) |
|
|
298 | (1) |
|
|
298 | (1) |
|
|
298 | (1) |
|
|
299 | (1) |
|
|
299 | (3) |
|
|
302 | (1) |
|
10 Practical Data Design Patterns |
|
|
303 | (42) |
|
|
304 | (1) |
|
Basic MapReduce Algorithm |
|
|
305 | (2) |
|
In-Mapper Combining per Record |
|
|
307 | (2) |
|
In-Mapper Combining per Partition |
|
|
309 | (3) |
|
|
312 | (2) |
|
|
314 | (2) |
|
|
316 | (2) |
|
|
318 | (1) |
|
|
319 | (1) |
|
Solution 1 Classic MapReduce |
|
|
319 | (1) |
|
|
319 | (1) |
|
Solution 3 Spark's mapPartitionsO |
|
|
320 | (3) |
|
The Composite Pattern and Monoids |
|
|
323 | (1) |
|
|
324 | (4) |
|
Monoidal and Non-Monoidal Examples |
|
|
328 | (3) |
|
Non-Monoid MapReduce Example |
|
|
331 | (1) |
|
|
332 | (2) |
|
PySpark Implementation of Monoidal Mean |
|
|
334 | (2) |
|
|
336 | (2) |
|
Conclusion on Using Monoids |
|
|
338 | (1) |
|
|
338 | (4) |
|
|
342 | (1) |
|
|
342 | (3) |
|
|
345 | (20) |
|
Introduction to the Join Operation |
|
|
345 | (3) |
|
|
348 | (1) |
|
|
348 | (1) |
|
|
349 | (1) |
|
Implementation in PySpark |
|
|
350 | (1) |
|
|
351 | (4) |
|
Map-Side Join Using DataFrames |
|
|
355 | (2) |
|
Step 1 Create Cache for Airports |
|
|
357 | (1) |
|
Step 2 Create Cache for Airlines |
|
|
357 | (1) |
|
Step 3 Create Facts Table |
|
|
358 | (1) |
|
Step 4 Apply Map-Side Join |
|
|
358 | (1) |
|
Efficient Joins Using Bloom Filters |
|
|
359 | (1) |
|
Introduction to Bloom Filters |
|
|
359 | (2) |
|
A Simple Bloom Filter Example |
|
|
361 | (1) |
|
|
362 | (1) |
|
Using Bloom Filters in PySpark |
|
|
362 | (1) |
|
|
363 | (2) |
|
12 Feature Engineering in PySpark |
|
|
365 | (38) |
|
Introduction to Feature Engineering |
|
|
366 | (2) |
|
|
368 | (1) |
|
|
369 | (1) |
|
|
370 | (2) |
|
|
372 | (1) |
|
|
373 | (2) |
|
|
375 | (1) |
|
|
376 | (1) |
|
|
376 | (1) |
|
Tokenization with a Pipeline |
|
|
377 | (1) |
|
|
377 | (3) |
|
|
380 | (2) |
|
Scaling a Column Using a Pipeline |
|
|
382 | (1) |
|
Using MinMaxScaler on Multiple Columns |
|
|
383 | (1) |
|
Normalization Using Normalizer |
|
|
384 | (1) |
|
|
385 | (1) |
|
Applying Stringlndexer to a Single Column |
|
|
385 | (1) |
|
Applying Stringlndexer to Several Columns |
|
|
386 | (1) |
|
|
386 | (1) |
|
|
387 | (1) |
|
|
388 | (1) |
|
|
389 | (1) |
|
|
390 | (1) |
|
|
391 | (6) |
|
|
397 | (4) |
|
|
401 | (1) |
|
|
402 | (1) |
Summary |
|
403 | (2) |
Index |
|
405 | |