Muutke küpsiste eelistusi

Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up using PySpark [Pehme köide]

  • Formaat: Paperback / softback, 500 pages, kõrgus x laius: 232x178 mm
  • Ilmumisaeg: 30-Apr-2022
  • Kirjastus: O'Reilly Media
  • ISBN-10: 1492082384
  • ISBN-13: 9781492082385
Teised raamatud teemal:
  • Pehme köide
  • Hind: 75,81 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Tavahind: 89,19 €
  • Säästad 15%
  • Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
  • Kogus:
  • Lisa ostukorvi
  • Tasuta tarne
  • Tellimisaeg 2-4 nädalat
  • Lisa soovinimekirja
  • Formaat: Paperback / softback, 500 pages, kõrgus x laius: 232x178 mm
  • Ilmumisaeg: 30-Apr-2022
  • Kirjastus: O'Reilly Media
  • ISBN-10: 1492082384
  • ISBN-13: 9781492082385
Teised raamatud teemal:

Apache Spark's speed, ease of use, sophisticated analytics, and multilanguage support makes practical knowledge of this cluster-computing framework a required skill for data engineers and data scientists. With this hands-on guide, anyone looking for an introduction to Spark will learn practical algorithms and examples using PySpark.

In each chapter, author Mahmoud Parsian shows you how to solve a data problem with a set of Spark transformations and algorithms. You'll learn how to tackle problems involving ETL, design patterns, machine learning algorithms, data partitioning, and genomics analysis. Each detailed recipe includes PySpark algorithms using the PySpark driver and shell script.

With this book, you will:

  • Learn how to select Spark transformations for optimized solutions
  • Explore powerful transformations and reductions including reduceByKey(), combineByKey(), and mapPartitions()
  • Understand data partitioning for optimized queries
  • Design machine learning algorithms including Naive Bayes, linear regression, and logistic regression
  • Build and apply a model using PySpark design patterns
  • Apply motif-finding algorithms to graph data
  • Analyze graph data by using the GraphFrames API
  • Apply PySpark algorithms to clinical and genomics data (such as DNA-Seq)
Foreword xiii
Preface xv
Part I Fundamentals
1 Introduction to Spark and PySpark
1(34)
Why Spark for Data Analytics
2(3)
The Spark Ecosystem
5(1)
Spark Architecture
6(6)
The Power of PySpark
12(3)
PySpark Architecture
15(2)
Spark Data Abstractions
17(1)
RDD Examples
17(1)
Spark RDD Operations
18(3)
DataFrame Examples
21(3)
Using the PySpark Shell
24(1)
Launching the PySpark Shell
25(1)
Creating an RDD from a Collection
26(1)
Aggregating and Merging Values of Keys
26(2)
Filtering an RDDs Elements
28(1)
Grouping Similar Keys
28(1)
Aggregating Values for Similar Keys
29(1)
ETL Example with DataFrames
30(1)
Extraction
31(1)
Transformation
32(1)
Loading
33(1)
Summary
33(2)
2 Transformations in Action
35(30)
The DNA Base Count Example
36(2)
The DNA Base Count Problem
38(1)
FASTA Format
39(1)
Sample Data
39(1)
DNA Base Count Solution 1
40(1)
Step 1 Create an RDD [ String] from the Input
41(1)
Step 2 Define a Mapper Function
42(2)
Step 3 Find the Frequencies of DNA Letters
44(3)
Pros and Cons of Solution 1
47(1)
DNA Base Count Solution 2
47(2)
Step 1 Create an RDD [ String] from the Input
49(1)
Step 2 Define a Mapper Function
49(2)
Step 3 Find the Frequencies of DNA Letters
51(1)
Pros and Cons of Solution 2
52(1)
DNA Base Count Solution 3
52(1)
The mapPartitionsO Transformation
52(8)
Step 1 Create an RDD [ String] from the Input
60(1)
Step 2 Define a Function to Handle a Partition
60(2)
Step 3 Apply the Custom Function to Each Partition
62(2)
Pros and Cons of Solution 3
64(1)
Summary
64(1)
3 Mapper Transformations
65(38)
Data Abstractions and Mappers
65(2)
What Are Transformations?
67(5)
Lazy Transformations
72(1)
The map() Transformation
73(5)
DataFrame Mapper
78(2)
The flatMapO Transformation
80(5)
map() Versus flatMapO
85(1)
Apply flatMap() to a DataFrame
86(3)
The mapValuesO Transformation
89(1)
The flatMapValues() Transformation
90(1)
The mapPartitionsO Transformation
91(4)
Handling Empty Partitions
95(3)
Benefits and Drawbacks
98(1)
DataFrames and mapPartitionsO Transformation
99(3)
Summary
102(1)
4 Reductions in Spark
103(42)
Creating Pair RDDs
104(1)
Reduction Transformations
105(3)
Spark's Reductions
108(2)
Simple Warmup Example
110(1)
Solving with reduceByKey()
111(1)
Solving with groupByKey()
112(1)
Solving with aggregateByKey()
112(1)
Solving with combineByKey()
113(2)
What Is a Monoid?
115(2)
Monoid and Non-Monoid Examples
117(1)
The Movie Problem
118(3)
Input Dataset to Analyze
121(1)
The aggregateByKey() Transformation
122(2)
First Solution Using aggregateByKey()
124(3)
Second Solution Using aggregateByKey()
127(2)
Complete PySpark Solution Using groupByKey()
129(2)
Complete PySpark Solution Using reduceByKey()
131(3)
Complete PySpark Solution Using combineByKey()
134(3)
The Shuffle Step in Reductions
137(1)
Shuffle Step for groupByKey()
138(1)
Shuffle Step for reduceByKey()
139(1)
Summary
140(5)
Part II Working with Data
5 Partitioning Data
145(16)
Introduction to Partitions
146(1)
Partitions in Spark
146(4)
Managing Partitions
150(1)
Default Partitioning
151(1)
Explicit Partitioning
152(1)
Physical Partitioning for SQL Queries
153(3)
Physical Partitioning of Data in Spark
156(1)
Partition as Text Format
156(1)
Partition as Parquet Format
157(1)
How to Query Partitioned Data
158(1)
Amazon Athena Example
158(2)
Summary
160(1)
6 Graph Algorithms
161(42)
Introduction to Graphs
162(2)
The GraphFrames API
164(1)
How to Use GraphFrames
165(3)
GraphFrames Functions and Attributes
168(1)
GraphFrames Algorithms
169(1)
Finding Triangles
169(3)
Motif Finding
172(9)
Real-World Applications
181(1)
Gene Analysis
181(2)
Social Recommendations
183(4)
Facebook Circles
187(4)
Connected Components
191(2)
Analyzing Flight Data
193(9)
Summary
202(1)
7 Interacting with External Data Sources
203(44)
Relational Databases
204(1)
Reading from a Database
205(8)
Writing a DataFrame to a Database
213(5)
Reading Text Files
218(2)
Reading and Writing CSV Files
220(1)
Reading CSV Files
220(4)
Writing CSV Files
224(1)
Reading and Writing JSON Files
225(1)
Reading JSON Files
226(1)
Writing JSON Files
227(1)
Reading from and Writing to Amazon S3
228(1)
Reading from Amazon S3
229(2)
Writing to Amazon S3
231(1)
Reading and Writing Hadoop Files
232(1)
Reading Hadoop Text Files
233(3)
Writing Hadoop Text Files
236(2)
Reading and Writing HDFS SequenceFiles
238(1)
Reading and Writing Parquet Files
239(1)
Writing Parquet Files
239(2)
Reading Parquet Files
241(1)
Reading and Writing Avro Files
242(1)
Reading Avro Files
242(1)
Writing Avro Files
242(1)
Reading from and Writing to MS SQL Server
243(1)
Writing to MS SQL Server
243(1)
Reading from MS SQL Server
244(1)
Reading Image Files
244(1)
Creating a DataFrame from Images
244(2)
Summary
246(1)
8 Ranking Algorithms
247(24)
Rank Product
248(1)
Calculation of the Rank Product
249(1)
Formalizing Rank Product
249(1)
Rank Product Example
250(1)
PySpark Solution
251(6)
PageRank
257(2)
PageRanks Iterative Computation
259(2)
Custom PageRank in PySpark Using RDDs
261(2)
Custom PageRank in PySpark Using an Adjacency Matrix
263(3)
PageRank with GraphFrames
266(1)
Summary
267(4)
Part III Data Design Patterns
9 Classic Data Design Patterns
271(32)
Input-Map-Output
272(1)
RDD Solution
273(2)
DataFrame Solution
275(2)
Flat Mapper functionality
277(1)
Input-Filter-Output
278(1)
RDD Solution
279(1)
DataFrame Solution
280(1)
DataFrame Filter
280(2)
Input-Map-Reduce-Output
282(1)
RDD Solution
282(3)
DataFrame Solution
285(2)
Input-Multiple-Maps-Reduce-Output
287(1)
RDD Solution
288(2)
DataFrame Solution
290(1)
Input-Map-Combiner-Reduce-Output
291(3)
Input-MapPartitions-Reduce-Output
294(4)
Inverted Index
298(1)
Problem Statement
298(1)
Input
298(1)
Output
299(1)
PySpark Solution
299(3)
Summary
302(1)
10 Practical Data Design Patterns
303(42)
In-Mapper Combining
304(1)
Basic MapReduce Algorithm
305(2)
In-Mapper Combining per Record
307(2)
In-Mapper Combining per Partition
309(3)
Top-10
312(2)
Top-N Formalized
314(2)
PySpark Solution
316(2)
Finding the Bottom 10
318(1)
MinMax
319(1)
Solution 1 Classic MapReduce
319(1)
Solution 2 Sorting
319(1)
Solution 3 Spark's mapPartitionsO
320(3)
The Composite Pattern and Monoids
323(1)
Monoids
324(4)
Monoidal and Non-Monoidal Examples
328(3)
Non-Monoid MapReduce Example
331(1)
Monoid MapReduce Example
332(2)
PySpark Implementation of Monoidal Mean
334(2)
Functors and Monoids
336(2)
Conclusion on Using Monoids
338(1)
Binning
338(4)
Sorting
342(1)
Summary
342(3)
11 Join Design Patterns
345(20)
Introduction to the Join Operation
345(3)
Join in MapReduce
348(1)
Map Phase
348(1)
Reducer Phase
349(1)
Implementation in PySpark
350(1)
Map-Side Join Using RDDs
351(4)
Map-Side Join Using DataFrames
355(2)
Step 1 Create Cache for Airports
357(1)
Step 2 Create Cache for Airlines
357(1)
Step 3 Create Facts Table
358(1)
Step 4 Apply Map-Side Join
358(1)
Efficient Joins Using Bloom Filters
359(1)
Introduction to Bloom Filters
359(2)
A Simple Bloom Filter Example
361(1)
Bloom Filters in Python
362(1)
Using Bloom Filters in PySpark
362(1)
Summary
363(2)
12 Feature Engineering in PySpark
365(38)
Introduction to Feature Engineering
366(2)
Adding New Features
368(1)
Applying UDFs
369(1)
Creating Pipelines
370(2)
Binarizing Data
372(1)
Imputation
373(2)
Tokenization
375(1)
Tokenizer
376(1)
RegexTokenizer
376(1)
Tokenization with a Pipeline
377(1)
Standardization
377(3)
Normalization
380(2)
Scaling a Column Using a Pipeline
382(1)
Using MinMaxScaler on Multiple Columns
383(1)
Normalization Using Normalizer
384(1)
String Indexing
385(1)
Applying Stringlndexer to a Single Column
385(1)
Applying Stringlndexer to Several Columns
386(1)
Vector Assembly
386(1)
Bucketing
387(1)
Bucketizer
388(1)
QuantileDiscretizer
389(1)
Logarithm Transformation
390(1)
One-Hot Encoding
391(6)
TF-IDF
397(4)
FeatureHasher
401(1)
SQLTransformer
402(1)
Summary 403(2)
Index 405
Mahmoud Parsian, Ph.D. in Computer Science, is a practicing software professional with 30 years of experience as a developer, designer, architect, and author. For the past 15 years, he has been involved in Java server-side, databases, MapReduce, Spark, PySpark, and distributed computing.