Customer Support: +372 7440010

Help | New account | Log In

Learning Spark 2nd edition [Paperback / softback]

4.32/5 (221 ratings by Goodreads)

Brooke Wenig, Tathagata Das, Jules Damji, Denny Lee

Format: Paperback / softback, 300 pages, height x width: 233x178 mm
Pub. Date: 31-Aug-2020
Publisher: O'Reilly Media
ISBN-10: 1492050040
ISBN-13: 9781492050049

Other books in subject:

Databases - (Currently in stock: 1 items)
Algorithms & data structures
Machine learning

Paperback / softback
Price: 75,81 €*
* the price is final i.e. no additional discount will apply
Regular price: 89,19 €
Save 15%
This book is not in stock. Book will arrive in about 2-4 weeks. Please allow another 2 weeks for shipping outside Estonia.
Quantity:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Add to basket
Delivery time 4-6 weeks
Add to Wishlist

Format: Paperback / softback, 300 pages, height x width: 233x178 mm
Pub. Date: 31-Aug-2020
Publisher: O'Reilly Media
ISBN-10: 1492050040
ISBN-13: 9781492050049

Other books in subject:

Databases - (Currently in stock: 1 items)
Algorithms & data structures
Machine learning

Permanent link: https://www.kriso.ee/db/9781492050049.html

Keywords:

Data is getting bigger, arriving faster, and coming in varied formats&;and it all needs to be processed at scale for analytics or machine learning. How can you process such varied data workloads efficiently? Enter Apache Spark.

Updated to emphasize new features in Spark 2.x., this second edition shows data engineers and scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine-learning algorithms. Through discourse, code snippets, and notebooks, you&;ll be able to:

Learn Python, SQL, Scala, or Java high-level APIs: DataFrames and Datasets
Peek under the hood of the Spark SQL engine to understand Spark transformations and performance
Inspect, tune, and debug your Spark operations with Spark configurations and Spark UI
Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
Perform analytics on batch and streaming data using Structured Streaming
Build reliable data pipelines with open source Delta Lake and Spark
Develop machine learning pipelines with MLlib and productionize models using MLflow
Use open source Pandas framework Koalas and Spark for data transformation and feature engineering

Foreword

xiii

Preface

1 Introduction To Apache Spark: A Unified Analytics Engine

(18)

The Genesis of Spark

(3)

Big Data and Distributed Computing at Google

(1)

Hadoop at Yahoo!

(1)

Spark's Early Years at AMPLab

(1)

What Is Apache Spark?

(2)

Speed

(1)

Ease of Use

(1)

Modularity

(1)

Extensibility

(1)

Unified Analytics

(8)

Apache Spark Components as a Unified Stack

(4)

Apache Spark's Distributed Execution

(4)

The Developer's Experience

(5)

Who Uses Spark, and for What?

(2)

Community Adoption and Expansion

(3)

2 Downloading Apache Spark And Getting Started

(24)

Step 1 Downloading Apache Spark

(3)

Spark's Directories and Files

(1)

Step 2 Using the Scala or PySpark Shell

(3)

Using the Local Machine

(2)

Step 3 Understanding Spark Application Concepts

(3)

Spark Application and SparkSession

(1)

Spark Jobs

(1)

Spark Stages

(1)

Spark Tasks

(1)

Transformations, Actions, and Lazy Evaluation

(3)

Narrow and Wide Transformations

(1)

The Spark UI

(3)

Your First Standalone Application

(8)

Counting M & Ms for the Cookie Monster

(5)

Building Standalone Applications in Scala

(2)

Summary

(1)

3 Apache Spark's Structured Apis

(40)

Spark: What's Underneath an RDD?

(1)

Structuring Spark

(3)

Key Merits and Benefits

(2)

The DataFrame API

(22)

Sparks Basic Data Types

(1)

Spark's Structured and Complex Data Types

(1)

Schemas and Creating DataFrames

(4)

Columns and Expressions

(3)

Rows

(1)

Common DataFrame Operations

(10)

End-to-End DataFrame Example

(1)

The Dataset API

(5)

Typed Objects, Untyped Objects, and Generic Rows

(2)

Creating Datasets

(1)

Dataset Operations

(2)

End-to-End Dataset Example

(1)

DataFrames Versus Datasets

(2)

When to Use RDDs

(1)

Spark SQL and the Underlying Engine

(6)

The Catalyst Optimizer

(5)

Summary

(1)

4 Spark Sql And Dataframes: Introduction To Built-In Data Sources

(30)

Using Spark SQL in Spark Applications

(5)

Basic Query Examples

(4)

SQL Tables and Views

(5)

Managed Versus UnmanagedTables

(1)

Creating SQL Databases and Tables

(1)

Creating Views

(2)

Viewing the Metadata

(1)

Caching SQL Tables

(1)

Reading Tables into DataFrames

(1)

Data Sources for DataFrames and SQL Tables

(17)

DataFrameReader

(2)

DataFrameWriter

(1)

Parquet

(3)

JSON

100

(2)

CSV

102

(2)

Avro

104

(2)

ORC

106

(2)

Images

108

(2)

Binary Files

110

(1)

Summary

111

(2)

5 Spark Sql And Dataframes: Interacting With External Data Sources

113

(44)

Spark SQL and Apache Hive

113

(6)

User-Defined Functions

114

(5)

Querying with the Spark SQL Shell, Beeline, and Tableau

119

(10)

Using the Spark SQL Shell

119

(1)

Working with Beeline

120

(2)

Working with Tableau

122

(7)

External Data Sources

129

(9)

JDBC and SQL Databases

129

(3)

PostgreSQL

132

(1)

MySQL

133

(1)

Azure Cosmos DB

134

(2)

MS SQL Server

136

(1)

Other External Sources

137

(1)

Higher-Order Functions in DataFrames and Spark SQL

138

(6)

Option 1: Explode and Collect

138

(1)

Option 2: User-Defined Function

138

(1)

Built-in Functions for Complex Data Types

139

(2)

Higher-Order Functions

141

(3)

Common DataFrames and Spark SQL Operations

144

(11)

Unions

147

(1)

Joins

148

(1)

Windowing

149

(2)

Modifications

151

(4)

Summary

155

(2)

6 Spark Sql And Datasets

157

(16)

Single API for Java and Scala

157

(3)

Scala Case Classes and JavaBeans for Datasets

158

(2)

Working with Datasets

160

(7)

Creating Sample Data

160

(2)

Transforming Sample Data

162

(5)

Memory Management for Datasets and DataFrames

167

(1)

Dataset Encoders

168

(2)

Sparks Internal Format Versus Java Object Format

168

(1)

Serialization and Deserialization (SerDe)

169

(1)

Costs of Using Datasets

170

(2)

Strategies to Mitigate Costs

170

(2)

Summary

172

(1)

7 Optimizing And Tuning Spark Applications

173

(34)

Optimizing and Tuning Spark for Efficiency

173

(10)

Viewing and Setting Apache Spark Configurations

173

(4)

Scaling Spark for Large Workloads

177

(6)

Caching and Persistence of Data

183

(4)

DataFrame.cache()

183

(1)

DataFrame.persist()

184

(3)

When to Cache and Persist

187

(1)

When Not to Cache and Persist

187

(1)

A Family of Spark Joins

187

(10)

Broadcast Hash Join

188

(1)

Shuffle Sort Merge Join

189

(8)

Inspecting the Spark UI

197

(8)

Journey Through the Spark UI Tabs

197

(8)

Summary

205

(2)

8 Structured Streaming

207

(58)

Evolution of the Apache Spark Stream Processing Engine

207

(4)

The Advent of Micro-Batch Stream Processing

208

(1)

Lessons Learned from Spark Streaming (DStreams)

209

(1)

The Philosophy of Structured Streaming

210

(1)

The Programming Model of Structured Streaming

211

(2)

The Fundamentals of a Structured Streaming Query

213

(13)

Five Steps to Define a Streaming Query

213

(6)

Under the Hood of an Active Streaming Query

219

(2)

Recovering from Failures with Exactly-Once Guarantees

221

(2)

Monitoring an Active Query

223

(3)

Streaming Data Sources and Sinks

226

(8)

Files

226

(2)

Apache Kafka

228

(2)

Custom Streaming Sources and Sinks

230

(4)

Data Transformations

234

(4)

Incremental Execution and Streaming State

234

(1)

Stateless Transformations

235

(1)

Stateful Transformations

235

(3)

Stateful Streaming Aggregations

238

(8)

Aggregations Not Based on Time

238

(1)

Aggregations with Event-Time Windows

239

(7)

Streaming Joins

246

(7)

Stream-Static Joins

246

(2)

Stream-Stream Joins

248

(5)

Arbitrary Stateful Computations

253

(9)

Modeling Arbitrary Stateful Operations with mapGroupsWithState()

254

(3)

Using Timeouts to Manage Inactive Groups

257

(4)

Generalization with flatMapGroupsWithState()

261

(1)

Performance Tuning

262

(2)

Summary

264

(1)

9 Building Reliable Data Lakes With Apache Spark

265

(20)

The Importance of an Optimal Storage Solution

265

(1)

Databases

266

(2)

A Brief Introduction to Databases

266

(1)

Reading from and Writing to Databases Using Apache Spark

267

(1)

Limitations of Databases

267

(1)

Data Lakes

268

(3)

A Brief Introduction to Data Lakes

268

(1)

Reading from and Writing to Data Lakes using Apache Spark

269

(1)

Limitations of Data Lakes

270

(1)

Lakehouses: The Next Step in the Evolution of Storage Solutions

271

(3)

Apache Hudi

272

(1)

Apache Iceberg

272

(1)

Delta Lake

273

(1)

Building Lakehouses with Apache Spark and Delta Lake

274

(10)

Configuring Apache Spark with Delta Lake

274

(1)

Loading Data into a Delta Lake Table

275

(2)

Loading Data Streams into a Delta Lake Table

277

(1)

Enforcing Schema on Write to Prevent Data Corruption

278

(1)

Evolving Schemas to Accommodate Changing Data

279

(1)

Transforming Existing Data

279

(3)

Auditing Data Changes with Operation History

282

(1)

Querying Previous Snapshots of a Table with Time Travel

283

(1)

Summary

284

(1)

10 Machine Learning With Mllib

285

(38)

What Is Machine Learning?

286

(3)

Supervised Learning

286

(2)

Unsupervised Learning

288

(1)

Why Spark for Machine Learning?

289

(1)

Designing Machine Learning Pipelines

289

(18)

Data Ingestion and Exploration

290

(1)

Creating Training and Test Data Sets

291

(2)

Preparing Features with Transformers

293

(1)

Understanding Linear Regression

294

(1)

Using Estimators to Build Models

295

(1)

Creating a Pipeline

296

(6)

Evaluating Models

302

(4)

Saving and Loading Models

306

(1)

Hyperparameter Tuning

307

(14)

Tree-Based Models

307

(9)

k-Fold Cross-Validation

316

(4)

Optimizing Pipelines

320

(1)

Summary

321

(2)

11 Managing, Deploying, And Scaling Machine Learning Pipelines With Apache Spark

323

(20)

Model Management

323

(7)

MLflow

324

(6)

Model Deployment Options with MLlib

330

(6)

Batch

332

(1)

Streaming

333

(1)

Model Export Patterns for Real-Time Inference

334

(2)

Leveraging Spark for Non-MLlib Models

336

(5)

Pandas UDFs

336

(1)

Spark for Distributed Hyperparameter Tuning

337

(4)

Summary

341

(2)

12 Epilogue: Apache Spark 3.0

343

(18)

Spark Core and Spark SQL

343

(9)

Dynamic Partition Pruning

343

(2)

Adaptive Query Execution

345

(3)

SQL Join Hints

348

(1)

Catalog Plugin API and DataSourceV2

349

(2)

Accelerator-Aware Scheduler

351

(1)

Structured Streaming

352

(2)

PySpark, Pandas UDFs, and Pandas Function APIs

354

(3)

Redesigned Pandas UDFs with Python Type Hints

354

(1)

Iterator Support in Pandas UDFs

355

(1)

New Pandas Function APIs

356

(1)

Changed Functionality

357

(3)

Languages Supported and Deprecated

357

(1)

Changes to the DataFrame and Dataset APIs

357

(1)

DataFrame and SQL Explain Commands

358

(2)

Summary

360

(1)

Index

361

Jules S. Damji is an Apache Spark Community and Developer Advocate at Databricks. He is a hands-on developer with over 20 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, @Home, LoudCloud/Opsware, VeriSign, ProQuest, and Hortonworks, building large-scale distributed systems. He holds a B.Sc and M.Sc in Computer Science and MA in Political Advocacy and Communication from Oregon State University, Cal State, and Johns Hopkins University respectively.

Denny Lee is a Technical Product Manager at Databricks. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers. His current technical focuses include Distributed Systems, Apache Spark, Deep Learning, Machine Learning, and Genomics.

Brooke Wenig is the Machine Learning Practice Lead at Databricks. She guides and assists customers in implementing machine learning pipelines, as well as teaching Distributed Machine Learning & Deep Learning courses. She received an MS in Computer Science from UCLA with a focus on distributed machine learning. She speaks Mandarin Chinese fluently and enjoys cycling.

Tathagata Das is an Apache Spark committer and a member of the PMC. He's the lead developer behind Spark Streaming and currently develops Structured Streaming. Previously, he was a grad student in the UC Berkeley at AMPLab, where he conducted research about data-center frameworks and networks with Scott Shenker and Ion Stoica.

Learning Spark 2nd edition [Paperback / softback]

Account & settings

Search

Search database

Refine By

Subjects English Books

Choose shopping cart