Muutke küpsiste eelistusi

E-raamat: Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library

  • Formaat: PDF+DRM
  • Ilmumisaeg: 22-Oct-2021
  • Kirjastus: APress
  • Keel: eng
  • ISBN-13: 9781484273838
  • Formaat - PDF+DRM
  • Hind: 67,91 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
  • Formaat: PDF+DRM
  • Ilmumisaeg: 22-Oct-2021
  • Kirjastus: APress
  • Keel: eng
  • ISBN-13: 9781484273838

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

Take a journey toward discovering, learning, and using Apache Spark 3.0. In this book, you will gain expertise on the powerful and efficient distributed data processing engine inside of Apache Spark; its user-friendly, comprehensive, and flexible programming model for processing data in batch and streaming; and the scalable machine learning algorithms and practical utilities to build machine learning applications.





Beginning Apache Spark 3 begins by explaining different ways of interacting with Apache Spark, such as Spark Concepts and Architecture, and Spark Unified Stack. Next, it offers an overview of Spark SQL before moving on to its advanced features. It covers tips and techniques for dealing with performance issues, followed by an overview of the structured streaming processing engine. It concludes with a demonstration of how to develop machine learning applications using Spark MLlib and how to manage the machine learning development lifecycle. This book is packed with practical examples and code snippets to help you master concepts and features immediately after they are covered in each section.





After reading this book, you will have the knowledge required to build your own big data pipelines, applications, and machine learning applications.

What You Will Learn









Master the Spark unified data analytics engine and its various components Work in tandem to provide a scalable, fault tolerant and performant data processing engine Leverage the user-friendly and flexible programming model to perform simple to complex data analytics using dataframe and Spark SQL Develop machine learning applications using Spark MLlib Manage the machine learning development lifecycle using MLflow









Who This Book Is For

Data scientists, data engineers and software developers.
About the Author xi
About the Technical Reviewers xiii
Acknowledgments xv
Introduction xvii
Chapter 1 Introduction to Apache Spark 1(16)
Overview
1(1)
History
2(1)
Spark Core Concepts and Architecture
3(7)
Spark Cluster and Resource Management System
4(1)
Spark Applications
4(1)
Spark Drivers and Executors
5(1)
Spark Unified Stack
6(4)
Apache Spark 3.0
10(1)
Adaptive Query Execution Framework
11(1)
Dynamic Partition Pruning (DPP)
11(1)
Accelerator-aware Scheduler
11(1)
Apache Spark Applications
11(1)
Spark Example Applications
12(1)
Apache Spark Ecosystem
13(1)
Delta Lake
13(1)
Koalas
13(1)
MLflow
14(1)
Summary
14(3)
Chapter 2 Working with Apache Spark 17(34)
Downloading and Installation
17(4)
Downloading Spark
17(1)
Installing Spark
18(3)
Having Fun with the Spark Scala Shell
21(11)
Useful Spark Scala Shell Command and Tips
21(3)
Basic Interactions with Scala and Spark
24(8)
Introduction to Collaborative Notebooks
32(15)
Create a Cluster
35(3)
Create a Folder
38(2)
Create a Notebook
40(7)
Setting up Spark Source Code
47(1)
Summary
48(3)
Chapter 3 Spark SQL: Foundation 51(60)
Understanding RDD
52(1)
Introduction to the DataFrame API
53(1)
Creating a DataFrame
54(40)
Creating a DataFrame from RDD
54(3)
Creating a DataFrame from a Range of Numbers
57(3)
Creating a DataFrame from Data Sources
60(14)
Working with Structured Operations
74(20)
Introduction to Datasets
94(5)
Creating Datasets
96(1)
Working with Datasets
97(2)
Using SQL in Spark SQL
99(4)
Running SQL in Spark
99(4)
Writing Data Out to Storage Systems
103(3)
The Trio: DataFrame, Dataset, and SQL
106(1)
DataFrame Persistence
107(1)
Summary
108(3)
Chapter 4 Spark SQL: Advanced 111(72)
Aggregations
111(17)
Aggregation Functions
112(9)
Aggregation with Grouping
121(4)
Aggregation with Pivoting
125(3)
Joins
128(14)
Join Expression and Join Types
128(2)
Working with Joins
130(7)
Dealing with Duplicate Column Names
137(2)
Overview of Join Implementation
139(3)
Functions
142(18)
Working with Built-in Functions
142(16)
Working with User-Defined Functions (UDFs)
158(2)
Advanced Analytics Functions
160(15)
Aggregation with Rollups and Cubes
160(1)
Rollups
161(2)
Cubes
163(12)
Exploring Catalyst Optimizer
175(7)
Logical Plan
175(1)
Physical Plan
176(1)
Catalyst in Action
176(4)
Project Tungsten
180(2)
Summary
182(1)
Chapter 5 Optimizing Spark Applications 183(38)
Common Performance Issues
183(10)
Spark Configurations
184(3)
Spark Memory Management
187(6)
Leverage In-Memory Computation
193(5)
When to Persist and Cache Data
193(1)
Persistence and Caching APIs
193(2)
Persistence and Caching Example
195(3)
Understanding Spark Joins
198(6)
Broadcast Hash Join
199(2)
Shuffle Sort Merge Join
201(3)
Adaptive Query Execution
204(14)
Dynamically Coalescing Shuffle Partitions
206(5)
Dynamically Switching Join Strategies
211(2)
Dynamically Optimizing Skew Joins
213(5)
Summary
218(3)
Chapter 6 Spark Streaming 221(66)
Stream Processing
222(8)
Concepts
224(4)
Stream Processing Engine Landscape
228(2)
Spark Streaming Overview
230(1)
Spark DStream
231(53)
Spark Structured Streaming
234(1)
Overview
234(2)
Core Concepts
236(6)
Structured Streaming Applications
242(7)
Streaming DataFrame Operations
249(3)
Working with Data Sources
252(12)
Working with Data Sinks
264(10)
Output Modes
274(5)
Triggers
279(5)
Summary
284(3)
Chapter 7 Advanced Spark Streaming 287(44)
Event Time
287(13)
Fixed Window Aggregation over an Event Time
289(2)
Sliding Window Aggregation over Event Time
291(4)
Aggregation State
295(1)
Watermarking: Limit State and Handle Late Data
296(4)
Arbitrary Stateful Processing
300(16)
Arbitrary Stateful Processing with Structured Streaming
300(3)
Handling State Timeouts
303(1)
Arbitrary State Processing in Action
304(12)
Handling Duplicate Data
316(2)
Fault Tolerance
318(2)
Streaming Application Code Change
319(1)
Spark Runtime Change
320(1)
Streaming Query Metrics and Monitoring
320(8)
Streaming Query Metrics
320(3)
Monitoring Streaming Queries via Callback
323(1)
Monitoring Streaming Queries via Visualization UI
324(1)
Streaming Query Summary Information
325(1)
Streaming Query Detailed Statistics Information
326(1)
Troubleshooting Streaming Query
327(1)
Summary
328(3)
Chapter 8 Machine Learning with Spark 331(64)
Machine Learning Overview
332(9)
Machine Learning Terminologies
333(2)
Machine Learning Types
335(4)
Machine Learning Development Process
339(2)
Spark Machine Learning Library
341(34)
Machine Learning Pipelines
341(34)
Machine Learning Tasks in Action
375(16)
Classification
375(4)
Regression
379(3)
Recommendation
382(9)
Deep Learning Pipeline
391(1)
Summary
392(3)
Chapter 9 Managing the Machine Learning Life Cycle 395(36)
The Rise of MLOps
396(2)
MLOps Overview
396(2)
MLflow Overview
398(29)
MLflow Components
399(1)
MLflow in Action
400(27)
Model Deployment and Prediction
427(1)
Summary
428(3)
Index 431
Hien Luu has extensive experience in designing and building big data applications and machine learning infrastructure. He is particularly passionate about the intersection between big data and machine learning. Hien enjoys working with open source software and has contributed to Apache Pig and Azkaban. Teaching is also one of his passions, and he serves as an instructor at the UCSC Silicon Valley Extension school teaching Apache Spark. He has given presentations at various conferences such as Data+AI Summit, MLOps World, QCon SF, QCon London, Hadoop Summit, and JavaOne.