Customer Support: +372 7440010

Help | New account | Log In

Spark in Action, Second Edition 2nd edition [Paperback / softback]

3.92/5 (49 ratings by Goodreads)

Jean-Georges Perrin

Format: Paperback / softback, 576 pages, height x width x depth: 235x185x32 mm, weight: 1040 g
Pub. Date: 22-Jun-2020
Publisher: Manning Publications
ISBN-10: 1617295523
ISBN-13: 9781617295522

Other books in subject:

Machine learning
Data analysis: general - (Currently in stock: 1 items)

Paperback / softback
Price: 67,59 €
This book is not in stock. Book will arrive in about 2-4 weeks. Please allow another 2 weeks for shipping outside Estonia.
Quantity:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Add to basket
Delivery time 4-6 weeks
Add to Wishlist

Format: Paperback / softback, 576 pages, height x width x depth: 235x185x32 mm, weight: 1040 g
Pub. Date: 22-Jun-2020
Publisher: Manning Publications
ISBN-10: 1617295523
ISBN-13: 9781617295522

Other books in subject:

Machine learning
Data analysis: general - (Currently in stock: 1 items)

Permanent link: https://www.kriso.ee/db/9781617295522.html

Keywords:

The Spark distributed data processing platform provides an easy-to-implement tool for ingesting, streaming, and processing data from any source. In Spark in Action, Second Edition, you'll learn to take advantage of Spark's core features and incredible processing speed, with applications including real-time computation, delayed evaluation, and machine learning.

Unlike many Spark books written for data scientists, Spark in Action, Second Edition is designed for data engineers and software engineers who want to master data processing using Spark without having to learn a complex new ecosystem of languages and tools. You'll instead learn to apply your existing Java and SQL skills to take on practical, real-world challenges.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

Summary
The Spark distributed data processing platform provides an easy-to-implement tool for ingesting, streaming, and processing data from any source. In Spark in Action, Second Edition, you&;ll learn to take advantage of Spark&;s core features and incredible processing speed, with applications including real-time computation, delayed evaluation, and machine learning. Spark skills are a hot commodity in enterprises worldwide, and with Spark&;s powerful and flexible Java APIs, you can reap all the benefits without first learning Scala or Hadoop.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the technology
Analyzing enterprise data starts by reading, filtering, and merging files and streams from many sources. The Spark data processing engine handles this varied volume like a champ, delivering speeds 100 times faster than Hadoop systems. Thanks to SQL support, an intuitive interface, and a straightforward multilanguage API, you can use Spark without learning a complex new ecosystem.

About the book
Spark in Action, Second Edition, teaches you to create end-to-end analytics applications. In this entirely new book, you&;ll learn from interesting Java-based examples, including a complete data pipeline for processing NASA satellite data. And you&;ll discover Java, Python, and Scala code samples hosted on GitHub that you can explore and adapt, plus appendixes that give you a cheat sheet for installing tools and understanding Spark-specific terms.

What's inside

    Writing Spark applications in Java
    Spark application architecture
    Ingestion through files, databases, streaming, and Elasticsearch
    Querying distributed datasets with Spark SQL

About the reader
This book does not assume previous experience with Spark, Scala, or Hadoop.

About the author
Jean-Georges Perrin is an experienced data and software architect. He is France&;s first IBM Champion and has been honored for 12 consecutive years.

Table of Contents

PART 1 - THE THEORY CRIPPLED BY AWESOME EXAMPLES

1 So, what is Spark, anyway?

2 Architecture and flow

3 The majestic role of the dataframe

4 Fundamentally lazy

5 Building a simple app for deployment

6 Deploying your simple app

PART 2 - INGESTION

7 Ingestion from files

8 Ingestion from databases

9 Advanced ingestion: finding data sources and building

your own

10 Ingestion through structured streaming

PART 3 - TRANSFORMING YOUR DATA

11 Working with SQL

12 Transforming your data

13 Transforming entire documents

14 Extending transformations with user-defined functions

15 Aggregating your data

PART 4 - GOING FURTHER

16 Cache and checkpoint: Enhancing Spark&;s performances

17 Exporting data and building full data pipelines

18 Exploring deployment

Foreword

xiii

Preface

Acknowledgments

xvii

About This Book

xix

About The Author

xxv

About The Cover Illustration

xxvi

PART 1 THE THEORY CRIPPLED BY AWESOME EXAMPLES

(136)

1 So, what is Spark, anyway?

(16)

1.1 The big picture: What Spark is and what it does

(4)

What is Spark?

(2)

The four pillars of mana

(2)

1.2 How can you use Spark?

(2)

Spark in a data processing/engineering scenario

(1)

Spark in a data science scenario

(1)

1.3 What can you do with Spark?

(2)

Spark predicts restaurant quality at NC eateries

(1)

Spark allows fast data transfer for Lumeris

(1)

Spark analyzes equipment logs for CERN

(1)

Other use cases

(1)

1.4 Why you will love the dataframe

(2)

The dataframe from a Java perspective

(1)

The dataframe from an RDBMS perspective

(1)

A graphical representation of the dataframe

(1)

1.5 Your first example

(5)

Recommended software

(1)

Downloading the code

(2)

Running your first application 15' Your first code

(2)

2 Architecture and flow

(14)

2.1 Building your mental model

(1)

2.2 Usingjava code to build your mental model

(2)

2.3 Walking through your application

(10)

Connecting to a master

(1)

Loading, or ingesting, the CSV file

(3)

Transforming your data

(1)

Saving the work done in your dataframe to a database

(4)

3 The majestic role of the dataframe

(35)

3.1 The essential role of the dataframe in Spark

(3)

Organization of a dataframe

(1)

Immutability is not a swear word

(1)

3.2 Using dataframes through examples

(20)

A dataframe after a simple CSV ingestion

(5)

Data is stored in partitions

(1)

Digging in the schema

(1)

A dataframe after a JSON ingestion

(6)

Combining two dataframes

(5)

3.3 The dataframe is a Dataset<Row>

(9)

Reusing your POJOs

(1)

Creating a dataset of strings

(1)

Converting back and forth

(6)

3.4 Dataframe's ancestor: the RDD

(2)

4 Fundamentally lazy

(22)

4.1 A real-life example of efficient laziness

(1)

4.2 A Spark example of efficient laziness

(13)

Looking at the results of transformations and actions 70' The transformation process, step by step

(2)

The code behind the transformation/action process

(3)

The mystery behind the creation of 7 million datapoints in 182 ms

(2)

The mystery behind the timing of actions

(4)

4.3 Comparing to RDBMS and traditional applications

(3)

Working with the teen birth rates dataset

(1)

Analyzing differences between a traditional app and a Spark app

(2)

4.4 Spark is amazing for data-focused applications

(1)

4.5 Catalyst is your app catalyzer

(4)

5 Building a simple app for deployment

(24)

5.1 An ingestionless example

(11)

Calculating p

(2)

The code to approximate p

(6)

What are lambda functions in Java?

(2)

Approximating p by using lambda functions

101

(1)

5.2 Interacting with Spark

102

(12)

Local mode

103

(1)

Cluster mode

104

(3)

Interactive mode in Scala and Python

107

(7)

6 Deploying your simple app

114

(23)

6.1 Beyond the example: The role of the components

116

(5)

Quick overview of the components and their interactions

116

(4)

Troubleshooting tips for the Spark architecture

120

(1)

Going further

121

(1)

6.2 Building a cluster

121

(5)

Building a cluster that works for you

122

(1)

Setting up the environment

123

(3)

6.3 Building your application to run on the cluster

126

(6)

Building your application's uber JAR

127

(2)

Building your application by using Git and Maven

129

(3)

6.4 Running your application on the cluster

132

(5)

Submitting the uber JAR

132

(1)

Running the application

133

(1)

Analyzing the Spark user interface

133

(4)

PART 2 INGESTION

137

(103)

7 Ingestion from files

139

(29)

7.1 Common behaviors of parsers

141

(1)

7.2 Complex ingestion from CSV

141

(3)

Desired output

142

(1)

Code

143

(1)

7.3 Ingesting a CSV with a known schema

144

(2)

Desired output

145

(1)

Code

145

(1)

7.4 Ingesting a JSON file

146

(4)

Desired output

148

(1)

Code

149

(1)

7.5 Ingesting a multiline JSON file

150

(3)

Desired output

151

(1)

Code

152

(1)

7.6 Ingesting an XML file

153

(4)

Desired output

155

(1)

Code

155

(2)

7.7 Ingesting a text file

157

(2)

Desired output

158

(1)

Code

158

(1)

7.8 File formats for big data

159

(3)

The problem with traditional file formats

159

(1)

Avro is a schema-based serialization format

160

(1)

ORC is a columnar storage format

161

(1)

Parquet is also a columnar storage format

161

(1)

Comparing Avro, ORC, and Parquet

161

(1)

7.9 Ingesting Avro, ORC, and Parquet files

162

(6)

Ingesting Avro

162

(2)

Ingesting ORC

164

(1)

Ingesting Parquet

165

(2)

Reference table for ingesting Avro, ORC, or Parquet

167

(1)

8 Ingestion from databases

168

(26)

8.1 Ingestion from relational databases

169

(7)

Database connection checklist

170

(1)

Understanding the data used in the examples

170

(2)

Desired output

172

(1)

Code

173

(2)

Alternative code

175

(1)

8.2 The role of the dialect

176

(4)

What is a dialect, anyway?

177

(1)

JDBC dialects provided with Spark

177

(1)

Building your own dialect

177

(3)

8.3 Advanced queries and ingestion

180

(8)

Filtering by using a WHERE clause

180

(3)

Joining data in the database

183

(2)

Performing Ingestion and partitioning

185

(3)

Summary of advanced features

188

(1)

8.4 Ingestion from Elasticsearch

188

(6)

Data flow

189

(1)

The New York restaurants dataset digested by Spark

189

(2)

Code to ingest the restaurant dataset from Elasticsearch

191

(3)

9 Advanced ingestion: finding data sources and building your own

194

(28)

9.1 What is a data source?

196

(1)

9.2 Benefits of a direct connection to a data source

197

(2)

Temporary files

198

(1)

Data quality scripts

198

(1)

Data on demand

199

(1)

9.3 Finding data sources at Spark Packages

199

(1)

9.4 Building your own data source

199

(4)

Scope of the example project

200

(2)

Your data source API and options

202

(1)

9.5 Behind the scenes: Building the data source itself

203

(1)

9.6 Using the register file and the advertiser class

204

(3)

9.7 Understanding the relationship between the data and schema

207

(6)

The data source builds the relation

207

(3)

Inside the relation

210

(3)

9.8 Building the schema from aJavaBean

213

(2)

9.9 Building the dataframe is magic with the utilities

215

(5)

9.10 The other classes

220

(2)

10 Ingestion through structured streaming

222

(18)

10.1 what's streaming?

224

(1)

10.2 Creating your first stream

225

(10)

Generating a file stream

226

(3)

Consuming the records

229

(5)

Getting records, not lines

234

(1)

10.3 Ingesting data from network streams

235

(2)

10.4 Dealing with multiple streams

237

(5)

10.5 Differentiating discretized and structured streaming

242

(3)

PART 3 TRANSFORMING YOUR DATA

245

(2)

11 Working with SQL

247

(1)

11.1 Working with Spark SQL

248

(3)

11.2 The difference between local and global views

251

(2)

11.3 Mixing the dataframe API and Spark SQL

253

(3)

11.4 Don't DELETE it!

256

(2)

11.5 Going further with SQL

258

(2)

12 Transforming your data

260

(31)

12.1 What is data transformation?

261

(1)

12.2 Process and example of record-level transformation

262

(14)

Data discovery to understand the complexity

264

(1)

Data mapping to draw the process

265

(3)

Writing the transformation code

268

(6)

Reviewing your data transformation to ensure a quality process

274

(1)

What about sorting1?

275

(1)

Wrapping up your first Spark transformation

275

(1)

12.3 Joining datasets

276

(13)

A closer look at the datasets to join

276

(2)

Building the list of higher education institutions per county

278

(5)

Performing the joins

283

(6)

12.4 Performing more transformations

289

(2)

13 Transforming entire documents

291

(13)

13.1 Transforming entire documents and their structure

292

(9)

Flattening your JSON document

293

(5)

Building nested documents for transfer and storage

298

(3)

13.2 The magic behind static functions

301

(1)

13.3 Performing more transformations

302

(1)

13.4 Summary

303

(1)

14 Extending transformations with user-defined functions

304

(16)

14.1 Extending Apache Spark

305

(1)

14.2 Registering and calling a UDF

306

(10)

Registering the UDF with Spark

309

(1)

Using the UDF with the dataframe API

310

(2)

Manipulating UDFs xvith SQL

312

(1)

Implementing the UDF

313

(1)

Writing the service itself

314

(2)

14.3 Using UDFs to ensure a high level of data quality

316

(2)

14.4 Considering UDFs' constraints

318

(2)

15 Aggregating your data

320

(25)

15.1 Aggregating data with Spark

321

(6)

A quick reminder on aggregations

321

(3)

Performing basic aggregations with Spark

324

(3)

15.2 Performing aggregations with live data

327

(11)

Preparing your dataset

327

(5)

Aggregating data to better understand the schools

332

(6)

15.3 Building custom aggregations with UDAFs

338

(7)

PART 4 GOING FURTHER

345

(66)

16 Cache and checkpoint: Enhancing Spark's performances

347

(26)

16.1 Caching and checkpointing can increase performance

348

(13)

The usefulness of Spark caching

350

(1)

The subtle effectiveness of Spark checkpointing

351

(1)

Using caching and checkpointing

352

(9)

16.2 Caching in action

361

(10)

16.3 Going further in performance optimization

371

(2)

17 Exporting data and building full data pipelines

373

(22)

17.1 Exporting data

374

(9)

Building a pipeline with NASA datasets

374

(4)

Transforming columns to datetime

378

(1)

Transforming the confidence percentage to confidence level

379

(1)

Exporting the data

379

(3)

Exporting the data: What really happened?

382

(1)

17.2 Delta Lake: Enjoying a database close to your system

383

(9)

Understanding why a database is needed

384

(1)

Using Delta Lake in your data pipeline

385

(4)

Consuming data from Delta Lake

389

(3)

17.3 Accessing cloud storage services from Spark

392

(3)

18 Exploring deployment constraints: Understanding the JL O ecosystem

395

(16)

18.1 Managing resources with YARN, Mesos, and Kubernetes

396

(7)

The built-in standalone mode manages resources

397

(1)

YARN manages resources in a Hadoop environment

398

(1)

Mesos is a standalone resource manager

399

(2)

Kubernetes orchestrates containers

401

(1)

Choosing the right resource manager

402

(1)

18.2 Sharing files with Spark

403

(5)

Accessing the data contained in files

404

(1)

Sharing files through distributed filesystems

404

(1)

Accessing files on shared drives orfile server

405

(1)

Using file-sharing services to distribute files

406

(1)

Other options for accessing files in Spark

407

(1)

Hybrid solution for sharing files with Spark

408

(1)

18.3 Making sure your Spark application is secure

408

(3)

Securing the network components of your infrastructure

408

(1)

Securing Spark's disk usage

409

(2)

Appendix A Installing Eclipse

411

(7)

Appendix B Installing Maven

418

(4)

Appendix C Installing Git

422

(2)

Appendix D Downloading the code and getting started with Eclipse

424

(6)

Appendix E A history of enterprise data

430

(4)

Appendix F Getting help with relational databases

434

(4)

Appendix G Static functions ease your transformations

438

(8)

Appendix H Maven quick cheat sheet

446

(4)

Appendix I Reference for transformations and actions

450

(10)

Appendix J Enough Scala

460

(2)

Appendix K Installing Spark in production and a few tips

462

(14)

Appendix L Reference for ingestion

476

(12)

Appendix M Reference for joins

488

(11)

Appendix N Installing Elasticsearch and sample data

499

(6)

Appendix O (Generating streaming data

505

(5)

Appendix P Reference for streaming

510

(10)

Appendix Q Reference for exporting data

520

(8)

Appendix R Finding help when you're stuck

528

(5)

Index

533

An experienced consultant and entrepreneur passionate about all things data, Jean-Georges Perrin was the first IBM Champion in France, an honor hes now held for ten consecutive years. Jean-Georges has managed many teams of software and data engineers.

Spark in Action, Second Edition 2nd edition [Paperback / softback]

Account & settings

Search

Search database

Refine By

Subjects English Books

Choose shopping cart