Update cookies preferences

Spark in Action, Second Edition 2nd edition [Paperback / softback]

3.92/5 (49 ratings by Goodreads)
  • Format: Paperback / softback, 576 pages, height x width x depth: 235x185x32 mm, weight: 1040 g
  • Pub. Date: 22-Jun-2020
  • Publisher: Manning Publications
  • ISBN-10: 1617295523
  • ISBN-13: 9781617295522
Other books in subject:
  • Paperback / softback
  • Price: 67,59 €
  • This book is not in stock. Book will arrive in about 2-4 weeks. Please allow another 2 weeks for shipping outside Estonia.
  • Quantity:
  • Add to basket
  • Delivery time 4-6 weeks
  • Add to Wishlist
  • Format: Paperback / softback, 576 pages, height x width x depth: 235x185x32 mm, weight: 1040 g
  • Pub. Date: 22-Jun-2020
  • Publisher: Manning Publications
  • ISBN-10: 1617295523
  • ISBN-13: 9781617295522
Other books in subject:

The Spark distributed data processing platform provides an easy-to-implement tool for ingesting, streaming, and processing data from any source. In Spark in Action, Second Edition, you'll learn to take advantage of Spark's core features and incredible processing speed, with applications including real-time computation, delayed evaluation, and machine learning.

Unlike many Spark books written for data scientists, Spark in Action, Second Edition is designed for data engineers and software engineers who want to master data processing using Spark without having to learn a complex new ecosystem of languages and tools. You'll instead learn to apply your existing Java and SQL skills to take on practical, real-world challenges.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.



Summary
The Spark distributed data processing platform provides an easy-to-implement tool for ingesting, streaming, and processing data from any source. In Spark in Action, Second Edition, you&;ll learn to take advantage of Spark&;s core features and incredible processing speed, with applications including real-time computation, delayed evaluation, and machine learning. Spark skills are a hot commodity in enterprises worldwide, and with Spark&;s powerful and flexible Java APIs, you can reap all the benefits without first learning Scala or Hadoop.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the technology
Analyzing enterprise data starts by reading, filtering, and merging files and streams from many sources. The Spark data processing engine handles this varied volume like a champ, delivering speeds 100 times faster than Hadoop systems. Thanks to SQL support, an intuitive interface, and a straightforward multilanguage API, you can use Spark without learning a complex new ecosystem.

About the book
Spark in Action, Second Edition, teaches you to create end-to-end analytics applications. In this entirely new book, you&;ll learn from interesting Java-based examples, including a complete data pipeline for processing NASA satellite data. And you&;ll discover Java, Python, and Scala code samples hosted on GitHub that you can explore and adapt, plus appendixes that give you a cheat sheet for installing tools and understanding Spark-specific terms.

What's inside

    Writing Spark applications in Java
    Spark application architecture
    Ingestion through files, databases, streaming, and Elasticsearch
    Querying distributed datasets with Spark SQL

About the reader
This book does not assume previous experience with Spark, Scala, or Hadoop.

About the author
Jean-Georges Perrin is an experienced data and software architect. He is France&;s first IBM Champion and has been honored for 12 consecutive years.

Table of Contents

PART 1 - THE THEORY CRIPPLED BY AWESOME EXAMPLES

1 So, what is Spark, anyway?

2 Architecture and flow

3 The majestic role of the dataframe

4 Fundamentally lazy

5 Building a simple app for deployment

6 Deploying your simple app

PART 2 - INGESTION

7 Ingestion from files

8 Ingestion from databases

9 Advanced ingestion: finding data sources and building

your own

10 Ingestion through structured streaming

PART 3 - TRANSFORMING YOUR DATA

11 Working with SQL

12 Transforming your data

13 Transforming entire documents

14 Extending transformations with user-defined functions

15 Aggregating your data

PART 4 - GOING FURTHER

16 Cache and checkpoint: Enhancing Spark&;s performances

17 Exporting data and building full data pipelines

18 Exploring deployment
Foreword xiii
Preface xi
Acknowledgments xvii
About This Book xix
About The Author xxv
About The Cover Illustration xxvi
PART 1 THE THEORY CRIPPLED BY AWESOME EXAMPLES
1(136)
1 So, what is Spark, anyway?
3(16)
1.1 The big picture: What Spark is and what it does
4(4)
What is Spark?
4(2)
The four pillars of mana
6(2)
1.2 How can you use Spark?
8(2)
Spark in a data processing/engineering scenario
8(1)
Spark in a data science scenario
9(1)
1.3 What can you do with Spark?
10(2)
Spark predicts restaurant quality at NC eateries
11(1)
Spark allows fast data transfer for Lumeris
11(1)
Spark analyzes equipment logs for CERN
12(1)
Other use cases
12(1)
1.4 Why you will love the dataframe
12(2)
The dataframe from a Java perspective
13(1)
The dataframe from an RDBMS perspective
13(1)
A graphical representation of the dataframe
14(1)
1.5 Your first example
14(5)
Recommended software
15(1)
Downloading the code
15(2)
Running your first application 15' Your first code
17(2)
2 Architecture and flow
19(14)
2.1 Building your mental model
20(1)
2.2 Usingjava code to build your mental model
21(2)
2.3 Walking through your application
23(10)
Connecting to a master
24(1)
Loading, or ingesting, the CSV file
25(3)
Transforming your data
28(1)
Saving the work done in your dataframe to a database
29(4)
3 The majestic role of the dataframe
33(35)
3.1 The essential role of the dataframe in Spark
34(3)
Organization of a dataframe
35(1)
Immutability is not a swear word
36(1)
3.2 Using dataframes through examples
37(20)
A dataframe after a simple CSV ingestion
39(5)
Data is stored in partitions
44(1)
Digging in the schema
45(1)
A dataframe after a JSON ingestion
46(6)
Combining two dataframes
52(5)
3.3 The dataframe is a Dataset<Row>
57(9)
Reusing your POJOs
58(1)
Creating a dataset of strings
59(1)
Converting back and forth
60(6)
3.4 Dataframe's ancestor: the RDD
66(2)
4 Fundamentally lazy
68(22)
4.1 A real-life example of efficient laziness
69(1)
4.2 A Spark example of efficient laziness
70(13)
Looking at the results of transformations and actions 70' The transformation process, step by step
72(2)
The code behind the transformation/action process
74(3)
The mystery behind the creation of 7 million datapoints in 182 ms
77(2)
The mystery behind the timing of actions
79(4)
4.3 Comparing to RDBMS and traditional applications
83(3)
Working with the teen birth rates dataset
83(1)
Analyzing differences between a traditional app and a Spark app
84(2)
4.4 Spark is amazing for data-focused applications
86(1)
4.5 Catalyst is your app catalyzer
86(4)
5 Building a simple app for deployment
90(24)
5.1 An ingestionless example
91(11)
Calculating p
91(2)
The code to approximate p
93(6)
What are lambda functions in Java?
99(2)
Approximating p by using lambda functions
101(1)
5.2 Interacting with Spark
102(12)
Local mode
103(1)
Cluster mode
104(3)
Interactive mode in Scala and Python
107(7)
6 Deploying your simple app
114(23)
6.1 Beyond the example: The role of the components
116(5)
Quick overview of the components and their interactions
116(4)
Troubleshooting tips for the Spark architecture
120(1)
Going further
121(1)
6.2 Building a cluster
121(5)
Building a cluster that works for you
122(1)
Setting up the environment
123(3)
6.3 Building your application to run on the cluster
126(6)
Building your application's uber JAR
127(2)
Building your application by using Git and Maven
129(3)
6.4 Running your application on the cluster
132(5)
Submitting the uber JAR
132(1)
Running the application
133(1)
Analyzing the Spark user interface
133(4)
PART 2 INGESTION
137(103)
7 Ingestion from files
139(29)
7.1 Common behaviors of parsers
141(1)
7.2 Complex ingestion from CSV
141(3)
Desired output
142(1)
Code
143(1)
7.3 Ingesting a CSV with a known schema
144(2)
Desired output
145(1)
Code
145(1)
7.4 Ingesting a JSON file
146(4)
Desired output
148(1)
Code
149(1)
7.5 Ingesting a multiline JSON file
150(3)
Desired output
151(1)
Code
152(1)
7.6 Ingesting an XML file
153(4)
Desired output
155(1)
Code
155(2)
7.7 Ingesting a text file
157(2)
Desired output
158(1)
Code
158(1)
7.8 File formats for big data
159(3)
The problem with traditional file formats
159(1)
Avro is a schema-based serialization format
160(1)
ORC is a columnar storage format
161(1)
Parquet is also a columnar storage format
161(1)
Comparing Avro, ORC, and Parquet
161(1)
7.9 Ingesting Avro, ORC, and Parquet files
162(6)
Ingesting Avro
162(2)
Ingesting ORC
164(1)
Ingesting Parquet
165(2)
Reference table for ingesting Avro, ORC, or Parquet
167(1)
8 Ingestion from databases
168(26)
8.1 Ingestion from relational databases
169(7)
Database connection checklist
170(1)
Understanding the data used in the examples
170(2)
Desired output
172(1)
Code
173(2)
Alternative code
175(1)
8.2 The role of the dialect
176(4)
What is a dialect, anyway?
177(1)
JDBC dialects provided with Spark
177(1)
Building your own dialect
177(3)
8.3 Advanced queries and ingestion
180(8)
Filtering by using a WHERE clause
180(3)
Joining data in the database
183(2)
Performing Ingestion and partitioning
185(3)
Summary of advanced features
188(1)
8.4 Ingestion from Elasticsearch
188(6)
Data flow
189(1)
The New York restaurants dataset digested by Spark
189(2)
Code to ingest the restaurant dataset from Elasticsearch
191(3)
9 Advanced ingestion: finding data sources and building your own
194(28)
9.1 What is a data source?
196(1)
9.2 Benefits of a direct connection to a data source
197(2)
Temporary files
198(1)
Data quality scripts
198(1)
Data on demand
199(1)
9.3 Finding data sources at Spark Packages
199(1)
9.4 Building your own data source
199(4)
Scope of the example project
200(2)
Your data source API and options
202(1)
9.5 Behind the scenes: Building the data source itself
203(1)
9.6 Using the register file and the advertiser class
204(3)
9.7 Understanding the relationship between the data and schema
207(6)
The data source builds the relation
207(3)
Inside the relation
210(3)
9.8 Building the schema from aJavaBean
213(2)
9.9 Building the dataframe is magic with the utilities
215(5)
9.10 The other classes
220(2)
10 Ingestion through structured streaming
222(18)
10.1 what's streaming?
224(1)
10.2 Creating your first stream
225(10)
Generating a file stream
226(3)
Consuming the records
229(5)
Getting records, not lines
234(1)
10.3 Ingesting data from network streams
235(2)
10.4 Dealing with multiple streams
237(5)
10.5 Differentiating discretized and structured streaming
242(3)
PART 3 TRANSFORMING YOUR DATA
245(2)
11 Working with SQL
247(1)
11.1 Working with Spark SQL
248(3)
11.2 The difference between local and global views
251(2)
11.3 Mixing the dataframe API and Spark SQL
253(3)
11.4 Don't DELETE it!
256(2)
11.5 Going further with SQL
258(2)
12 Transforming your data
260(31)
12.1 What is data transformation?
261(1)
12.2 Process and example of record-level transformation
262(14)
Data discovery to understand the complexity
264(1)
Data mapping to draw the process
265(3)
Writing the transformation code
268(6)
Reviewing your data transformation to ensure a quality process
274(1)
What about sorting1?
275(1)
Wrapping up your first Spark transformation
275(1)
12.3 Joining datasets
276(13)
A closer look at the datasets to join
276(2)
Building the list of higher education institutions per county
278(5)
Performing the joins
283(6)
12.4 Performing more transformations
289(2)
13 Transforming entire documents
291(13)
13.1 Transforming entire documents and their structure
292(9)
Flattening your JSON document
293(5)
Building nested documents for transfer and storage
298(3)
13.2 The magic behind static functions
301(1)
13.3 Performing more transformations
302(1)
13.4 Summary
303(1)
14 Extending transformations with user-defined functions
304(16)
14.1 Extending Apache Spark
305(1)
14.2 Registering and calling a UDF
306(10)
Registering the UDF with Spark
309(1)
Using the UDF with the dataframe API
310(2)
Manipulating UDFs xvith SQL
312(1)
Implementing the UDF
313(1)
Writing the service itself
314(2)
14.3 Using UDFs to ensure a high level of data quality
316(2)
14.4 Considering UDFs' constraints
318(2)
15 Aggregating your data
320(25)
15.1 Aggregating data with Spark
321(6)
A quick reminder on aggregations
321(3)
Performing basic aggregations with Spark
324(3)
15.2 Performing aggregations with live data
327(11)
Preparing your dataset
327(5)
Aggregating data to better understand the schools
332(6)
15.3 Building custom aggregations with UDAFs
338(7)
PART 4 GOING FURTHER
345(66)
16 Cache and checkpoint: Enhancing Spark's performances
347(26)
16.1 Caching and checkpointing can increase performance
348(13)
The usefulness of Spark caching
350(1)
The subtle effectiveness of Spark checkpointing
351(1)
Using caching and checkpointing
352(9)
16.2 Caching in action
361(10)
16.3 Going further in performance optimization
371(2)
17 Exporting data and building full data pipelines
373(22)
17.1 Exporting data
374(9)
Building a pipeline with NASA datasets
374(4)
Transforming columns to datetime
378(1)
Transforming the confidence percentage to confidence level
379(1)
Exporting the data
379(3)
Exporting the data: What really happened?
382(1)
17.2 Delta Lake: Enjoying a database close to your system
383(9)
Understanding why a database is needed
384(1)
Using Delta Lake in your data pipeline
385(4)
Consuming data from Delta Lake
389(3)
17.3 Accessing cloud storage services from Spark
392(3)
18 Exploring deployment constraints: Understanding the JL O ecosystem
395(16)
18.1 Managing resources with YARN, Mesos, and Kubernetes
396(7)
The built-in standalone mode manages resources
397(1)
YARN manages resources in a Hadoop environment
398(1)
Mesos is a standalone resource manager
399(2)
Kubernetes orchestrates containers
401(1)
Choosing the right resource manager
402(1)
18.2 Sharing files with Spark
403(5)
Accessing the data contained in files
404(1)
Sharing files through distributed filesystems
404(1)
Accessing files on shared drives orfile server
405(1)
Using file-sharing services to distribute files
406(1)
Other options for accessing files in Spark
407(1)
Hybrid solution for sharing files with Spark
408(1)
18.3 Making sure your Spark application is secure
408(3)
Securing the network components of your infrastructure
408(1)
Securing Spark's disk usage
409(2)
Appendix A Installing Eclipse 411(7)
Appendix B Installing Maven 418(4)
Appendix C Installing Git 422(2)
Appendix D Downloading the code and getting started with Eclipse 424(6)
Appendix E A history of enterprise data 430(4)
Appendix F Getting help with relational databases 434(4)
Appendix G Static functions ease your transformations 438(8)
Appendix H Maven quick cheat sheet 446(4)
Appendix I Reference for transformations and actions 450(10)
Appendix J Enough Scala 460(2)
Appendix K Installing Spark in production and a few tips 462(14)
Appendix L Reference for ingestion 476(12)
Appendix M Reference for joins 488(11)
Appendix N Installing Elasticsearch and sample data 499(6)
Appendix O (Generating streaming data 505(5)
Appendix P Reference for streaming 510(10)
Appendix Q Reference for exporting data 520(8)
Appendix R Finding help when you're stuck 528(5)
Index 533
An experienced consultant and entrepreneur passionate about all things data, Jean-Georges Perrin was the first IBM Champion in France, an honor hes now held for ten consecutive years. Jean-Georges has managed many teams of software and data engineers.