Update cookies preferences

Learning Spark 2nd edition [Paperback / softback]

4.32/5 (221 ratings by Goodreads)
  • Format: Paperback / softback, 300 pages, height x width: 233x178 mm
  • Pub. Date: 31-Aug-2020
  • Publisher: O'Reilly Media
  • ISBN-10: 1492050040
  • ISBN-13: 9781492050049
Other books in subject:
  • Paperback / softback
  • Price: 75,81 €*
  • * the price is final i.e. no additional discount will apply
  • Regular price: 89,19 €
  • Save 15%
  • This book is not in stock. Book will arrive in about 2-4 weeks. Please allow another 2 weeks for shipping outside Estonia.
  • Quantity:
  • Add to basket
  • Delivery time 4-6 weeks
  • Add to Wishlist
  • Format: Paperback / softback, 300 pages, height x width: 233x178 mm
  • Pub. Date: 31-Aug-2020
  • Publisher: O'Reilly Media
  • ISBN-10: 1492050040
  • ISBN-13: 9781492050049
Other books in subject:

Data is getting bigger, arriving faster, and coming in varied formats&;and it all needs to be processed at scale for analytics or machine learning. How can you process such varied data workloads efficiently? Enter Apache Spark.

Updated to emphasize new features in Spark 2.x., this second edition shows data engineers and scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine-learning algorithms. Through discourse, code snippets, and notebooks, you&;ll be able to:

  • Learn Python, SQL, Scala, or Java high-level APIs: DataFrames and Datasets
  • Peek under the hood of the Spark SQL engine to understand Spark transformations and performance
  • Inspect, tune, and debug your Spark operations with Spark configurations and Spark UI
  • Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
  • Perform analytics on batch and streaming data using Structured Streaming
  • Build reliable data pipelines with open source Delta Lake and Spark
  • Develop machine learning pipelines with MLlib and productionize models using MLflow
  • Use open source Pandas framework Koalas and Spark for data transformation and feature engineering
Foreword xiii
Preface xv
1 Introduction To Apache Spark: A Unified Analytics Engine
1(18)
The Genesis of Spark
1(3)
Big Data and Distributed Computing at Google
1(1)
Hadoop at Yahoo!
2(1)
Spark's Early Years at AMPLab
3(1)
What Is Apache Spark?
4(2)
Speed
4(1)
Ease of Use
5(1)
Modularity
5(1)
Extensibility
5(1)
Unified Analytics
6(8)
Apache Spark Components as a Unified Stack
6(4)
Apache Spark's Distributed Execution
10(4)
The Developer's Experience
14(5)
Who Uses Spark, and for What?
14(2)
Community Adoption and Expansion
16(3)
2 Downloading Apache Spark And Getting Started
19(24)
Step 1 Downloading Apache Spark
19(3)
Spark's Directories and Files
21(1)
Step 2 Using the Scala or PySpark Shell
22(3)
Using the Local Machine
23(2)
Step 3 Understanding Spark Application Concepts
25(3)
Spark Application and SparkSession
26(1)
Spark Jobs
27(1)
Spark Stages
28(1)
Spark Tasks
28(1)
Transformations, Actions, and Lazy Evaluation
28(3)
Narrow and Wide Transformations
30(1)
The Spark UI
31(3)
Your First Standalone Application
34(8)
Counting M & Ms for the Cookie Monster
35(5)
Building Standalone Applications in Scala
40(2)
Summary
42(1)
3 Apache Spark's Structured Apis
43(40)
Spark: What's Underneath an RDD?
43(1)
Structuring Spark
44(3)
Key Merits and Benefits
45(2)
The DataFrame API
47(22)
Sparks Basic Data Types
48(1)
Spark's Structured and Complex Data Types
49(1)
Schemas and Creating DataFrames
50(4)
Columns and Expressions
54(3)
Rows
57(1)
Common DataFrame Operations
58(10)
End-to-End DataFrame Example
68(1)
The Dataset API
69(5)
Typed Objects, Untyped Objects, and Generic Rows
69(2)
Creating Datasets
71(1)
Dataset Operations
72(2)
End-to-End Dataset Example
74(1)
DataFrames Versus Datasets
74(2)
When to Use RDDs
75(1)
Spark SQL and the Underlying Engine
76(6)
The Catalyst Optimizer
77(5)
Summary
82(1)
4 Spark Sql And Dataframes: Introduction To Built-In Data Sources
83(30)
Using Spark SQL in Spark Applications
84(5)
Basic Query Examples
85(4)
SQL Tables and Views
89(5)
Managed Versus UnmanagedTables
89(1)
Creating SQL Databases and Tables
90(1)
Creating Views
91(2)
Viewing the Metadata
93(1)
Caching SQL Tables
93(1)
Reading Tables into DataFrames
93(1)
Data Sources for DataFrames and SQL Tables
94(17)
DataFrameReader
94(2)
DataFrameWriter
96(1)
Parquet
97(3)
JSON
100(2)
CSV
102(2)
Avro
104(2)
ORC
106(2)
Images
108(2)
Binary Files
110(1)
Summary
111(2)
5 Spark Sql And Dataframes: Interacting With External Data Sources
113(44)
Spark SQL and Apache Hive
113(6)
User-Defined Functions
114(5)
Querying with the Spark SQL Shell, Beeline, and Tableau
119(10)
Using the Spark SQL Shell
119(1)
Working with Beeline
120(2)
Working with Tableau
122(7)
External Data Sources
129(9)
JDBC and SQL Databases
129(3)
PostgreSQL
132(1)
MySQL
133(1)
Azure Cosmos DB
134(2)
MS SQL Server
136(1)
Other External Sources
137(1)
Higher-Order Functions in DataFrames and Spark SQL
138(6)
Option 1: Explode and Collect
138(1)
Option 2: User-Defined Function
138(1)
Built-in Functions for Complex Data Types
139(2)
Higher-Order Functions
141(3)
Common DataFrames and Spark SQL Operations
144(11)
Unions
147(1)
Joins
148(1)
Windowing
149(2)
Modifications
151(4)
Summary
155(2)
6 Spark Sql And Datasets
157(16)
Single API for Java and Scala
157(3)
Scala Case Classes and JavaBeans for Datasets
158(2)
Working with Datasets
160(7)
Creating Sample Data
160(2)
Transforming Sample Data
162(5)
Memory Management for Datasets and DataFrames
167(1)
Dataset Encoders
168(2)
Sparks Internal Format Versus Java Object Format
168(1)
Serialization and Deserialization (SerDe)
169(1)
Costs of Using Datasets
170(2)
Strategies to Mitigate Costs
170(2)
Summary
172(1)
7 Optimizing And Tuning Spark Applications
173(34)
Optimizing and Tuning Spark for Efficiency
173(10)
Viewing and Setting Apache Spark Configurations
173(4)
Scaling Spark for Large Workloads
177(6)
Caching and Persistence of Data
183(4)
DataFrame.cache()
183(1)
DataFrame.persist()
184(3)
When to Cache and Persist
187(1)
When Not to Cache and Persist
187(1)
A Family of Spark Joins
187(10)
Broadcast Hash Join
188(1)
Shuffle Sort Merge Join
189(8)
Inspecting the Spark UI
197(8)
Journey Through the Spark UI Tabs
197(8)
Summary
205(2)
8 Structured Streaming
207(58)
Evolution of the Apache Spark Stream Processing Engine
207(4)
The Advent of Micro-Batch Stream Processing
208(1)
Lessons Learned from Spark Streaming (DStreams)
209(1)
The Philosophy of Structured Streaming
210(1)
The Programming Model of Structured Streaming
211(2)
The Fundamentals of a Structured Streaming Query
213(13)
Five Steps to Define a Streaming Query
213(6)
Under the Hood of an Active Streaming Query
219(2)
Recovering from Failures with Exactly-Once Guarantees
221(2)
Monitoring an Active Query
223(3)
Streaming Data Sources and Sinks
226(8)
Files
226(2)
Apache Kafka
228(2)
Custom Streaming Sources and Sinks
230(4)
Data Transformations
234(4)
Incremental Execution and Streaming State
234(1)
Stateless Transformations
235(1)
Stateful Transformations
235(3)
Stateful Streaming Aggregations
238(8)
Aggregations Not Based on Time
238(1)
Aggregations with Event-Time Windows
239(7)
Streaming Joins
246(7)
Stream-Static Joins
246(2)
Stream-Stream Joins
248(5)
Arbitrary Stateful Computations
253(9)
Modeling Arbitrary Stateful Operations with mapGroupsWithState()
254(3)
Using Timeouts to Manage Inactive Groups
257(4)
Generalization with flatMapGroupsWithState()
261(1)
Performance Tuning
262(2)
Summary
264(1)
9 Building Reliable Data Lakes With Apache Spark
265(20)
The Importance of an Optimal Storage Solution
265(1)
Databases
266(2)
A Brief Introduction to Databases
266(1)
Reading from and Writing to Databases Using Apache Spark
267(1)
Limitations of Databases
267(1)
Data Lakes
268(3)
A Brief Introduction to Data Lakes
268(1)
Reading from and Writing to Data Lakes using Apache Spark
269(1)
Limitations of Data Lakes
270(1)
Lakehouses: The Next Step in the Evolution of Storage Solutions
271(3)
Apache Hudi
272(1)
Apache Iceberg
272(1)
Delta Lake
273(1)
Building Lakehouses with Apache Spark and Delta Lake
274(10)
Configuring Apache Spark with Delta Lake
274(1)
Loading Data into a Delta Lake Table
275(2)
Loading Data Streams into a Delta Lake Table
277(1)
Enforcing Schema on Write to Prevent Data Corruption
278(1)
Evolving Schemas to Accommodate Changing Data
279(1)
Transforming Existing Data
279(3)
Auditing Data Changes with Operation History
282(1)
Querying Previous Snapshots of a Table with Time Travel
283(1)
Summary
284(1)
10 Machine Learning With Mllib
285(38)
What Is Machine Learning?
286(3)
Supervised Learning
286(2)
Unsupervised Learning
288(1)
Why Spark for Machine Learning?
289(1)
Designing Machine Learning Pipelines
289(18)
Data Ingestion and Exploration
290(1)
Creating Training and Test Data Sets
291(2)
Preparing Features with Transformers
293(1)
Understanding Linear Regression
294(1)
Using Estimators to Build Models
295(1)
Creating a Pipeline
296(6)
Evaluating Models
302(4)
Saving and Loading Models
306(1)
Hyperparameter Tuning
307(14)
Tree-Based Models
307(9)
k-Fold Cross-Validation
316(4)
Optimizing Pipelines
320(1)
Summary
321(2)
11 Managing, Deploying, And Scaling Machine Learning Pipelines With Apache Spark
323(20)
Model Management
323(7)
MLflow
324(6)
Model Deployment Options with MLlib
330(6)
Batch
332(1)
Streaming
333(1)
Model Export Patterns for Real-Time Inference
334(2)
Leveraging Spark for Non-MLlib Models
336(5)
Pandas UDFs
336(1)
Spark for Distributed Hyperparameter Tuning
337(4)
Summary
341(2)
12 Epilogue: Apache Spark 3.0
343(18)
Spark Core and Spark SQL
343(9)
Dynamic Partition Pruning
343(2)
Adaptive Query Execution
345(3)
SQL Join Hints
348(1)
Catalog Plugin API and DataSourceV2
349(2)
Accelerator-Aware Scheduler
351(1)
Structured Streaming
352(2)
PySpark, Pandas UDFs, and Pandas Function APIs
354(3)
Redesigned Pandas UDFs with Python Type Hints
354(1)
Iterator Support in Pandas UDFs
355(1)
New Pandas Function APIs
356(1)
Changed Functionality
357(3)
Languages Supported and Deprecated
357(1)
Changes to the DataFrame and Dataset APIs
357(1)
DataFrame and SQL Explain Commands
358(2)
Summary
360(1)
Index 361
Jules S. Damji is an Apache Spark Community and Developer Advocate at Databricks. He is a hands-on developer with over 20 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, @Home, LoudCloud/Opsware, VeriSign, ProQuest, and Hortonworks, building large-scale distributed systems. He holds a B.Sc and M.Sc in Computer Science and MA in Political Advocacy and Communication from Oregon State University, Cal State, and Johns Hopkins University respectively.

Denny Lee is a Technical Product Manager at Databricks. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers. His current technical focuses include Distributed Systems, Apache Spark, Deep Learning, Machine Learning, and Genomics.

Brooke Wenig is the Machine Learning Practice Lead at Databricks. She guides and assists customers in implementing machine learning pipelines, as well as teaching Distributed Machine Learning & Deep Learning courses. She received an MS in Computer Science from UCLA with a focus on distributed machine learning. She speaks Mandarin Chinese fluently and enjoys cycling.

Tathagata Das is an Apache Spark committer and a member of the PMC. He's the lead developer behind Spark Streaming and currently develops Structured Streaming. Previously, he was a grad student in the UC Berkeley at AMPLab, where he conducted research about data-center frameworks and networks with Scott Shenker and Ion Stoica.