Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up using PySpark [Pehme köide]

3.14/5 (13 hinnangut Goodreads-ist)

Mahmoud Parsian

Formaat: Paperback / softback, 500 pages, kõrgus x laius: 232x178 mm
Ilmumisaeg: 30-Apr-2022
Kirjastus: O'Reilly Media
ISBN-10: 1492082384
ISBN-13: 9781492082385

Teised raamatud teemal:

Databases - (Hetkel poes: 1 nimetust)

Pehme köide
Hind: 75,81 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Tavahind: 89,19 €
Säästad 15%
Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Tellimisaeg 2-4 nädalat
Lisa soovinimekirja

Formaat: Paperback / softback, 500 pages, kõrgus x laius: 232x178 mm
Ilmumisaeg: 30-Apr-2022
Kirjastus: O'Reilly Media
ISBN-10: 1492082384
ISBN-13: 9781492082385

Teised raamatud teemal:

Databases - (Hetkel poes: 1 nimetust)

Püsilink: https://www.kriso.ee/db/9781492082385.html

Märksõnad:

Apache Spark's speed, ease of use, sophisticated analytics, and multilanguage support makes practical knowledge of this cluster-computing framework a required skill for data engineers and data scientists. With this hands-on guide, anyone looking for an introduction to Spark will learn practical algorithms and examples using PySpark.

In each chapter, author Mahmoud Parsian shows you how to solve a data problem with a set of Spark transformations and algorithms. You'll learn how to tackle problems involving ETL, design patterns, machine learning algorithms, data partitioning, and genomics analysis. Each detailed recipe includes PySpark algorithms using the PySpark driver and shell script.

With this book, you will:

Learn how to select Spark transformations for optimized solutions
Explore powerful transformations and reductions including reduceByKey(), combineByKey(), and mapPartitions()
Understand data partitioning for optimized queries
Design machine learning algorithms including Naive Bayes, linear regression, and logistic regression
Build and apply a model using PySpark design patterns
Apply motif-finding algorithms to graph data
Analyze graph data by using the GraphFrames API
Apply PySpark algorithms to clinical and genomics data (such as DNA-Seq)

Foreword

xiii

Preface

Part I Fundamentals

1 Introduction to Spark and PySpark

(34)

Why Spark for Data Analytics

(3)

The Spark Ecosystem

(1)

Spark Architecture

(6)

The Power of PySpark

(3)

PySpark Architecture

(2)

Spark Data Abstractions

(1)

RDD Examples

(1)

Spark RDD Operations

(3)

DataFrame Examples

(3)

Using the PySpark Shell

(1)

Launching the PySpark Shell

(1)

Creating an RDD from a Collection

(1)

Aggregating and Merging Values of Keys

(2)

Filtering an RDDs Elements

(1)

Grouping Similar Keys

(1)

Aggregating Values for Similar Keys

(1)

ETL Example with DataFrames

(1)

Extraction

(1)

Transformation

(1)

Summary

(2)

2 Transformations in Action

(30)

The DNA Base Count Example

(2)

The DNA Base Count Problem

(1)

FASTA Format

(1)

Sample Data

(1)

DNA Base Count Solution 1

(1)

Step 1 Create an RDD [ String] from the Input

(1)

Step 2 Define a Mapper Function

(2)

Step 3 Find the Frequencies of DNA Letters

(3)

Pros and Cons of Solution 1

(1)

DNA Base Count Solution 2

(2)

Step 1 Create an RDD [ String] from the Input

(1)

Step 2 Define a Mapper Function

(2)

Step 3 Find the Frequencies of DNA Letters

(1)

Pros and Cons of Solution 2

(1)

DNA Base Count Solution 3

(1)

The mapPartitionsO Transformation

(8)

Step 1 Create an RDD [ String] from the Input

(1)

Step 2 Define a Function to Handle a Partition

(2)

Step 3 Apply the Custom Function to Each Partition

(2)

Pros and Cons of Solution 3

(1)

Summary

(1)

3 Mapper Transformations

(38)

Data Abstractions and Mappers

(2)

What Are Transformations?

(5)

Lazy Transformations

(1)

The map() Transformation

(5)

DataFrame Mapper

(2)

The flatMapO Transformation

(5)

map() Versus flatMapO

(1)

Apply flatMap() to a DataFrame

(3)

The mapValuesO Transformation

(1)

The flatMapValues() Transformation

(1)

The mapPartitionsO Transformation

(4)

Handling Empty Partitions

(3)

Benefits and Drawbacks

(1)

DataFrames and mapPartitionsO Transformation

(3)

Summary

102

(1)

4 Reductions in Spark

103

(42)

Creating Pair RDDs

104

(1)

Reduction Transformations

105

(3)

Spark's Reductions

108

(2)

Simple Warmup Example

110

(1)

Solving with reduceByKey()

111

(1)

Solving with groupByKey()

112

(1)

Solving with aggregateByKey()

112

(1)

Solving with combineByKey()

113

(2)

What Is a Monoid?

115

(2)

Monoid and Non-Monoid Examples

117

(1)

The Movie Problem

118

(3)

Input Dataset to Analyze

121

(1)

The aggregateByKey() Transformation

122

(2)

First Solution Using aggregateByKey()

124

(3)

Second Solution Using aggregateByKey()

127

(2)

Complete PySpark Solution Using groupByKey()

129

(2)

Complete PySpark Solution Using reduceByKey()

131

(3)

Complete PySpark Solution Using combineByKey()

134

(3)

The Shuffle Step in Reductions

137

(1)

Shuffle Step for groupByKey()

138

(1)

Shuffle Step for reduceByKey()

139

(1)

Summary

140

(5)

Part II Working with Data

5 Partitioning Data

145

(16)

Introduction to Partitions

146

(1)

Partitions in Spark

146

(4)

Managing Partitions

150

(1)

Default Partitioning

151

(1)

Explicit Partitioning

152

(1)

Physical Partitioning for SQL Queries

153

(3)

Physical Partitioning of Data in Spark

156

(1)

Partition as Text Format

156

(1)

Partition as Parquet Format

157

(1)

How to Query Partitioned Data

158

(1)

Amazon Athena Example

158

(2)

Summary

160

(1)

6 Graph Algorithms

161

(42)

Introduction to Graphs

162

(2)

The GraphFrames API

164

(1)

How to Use GraphFrames

165

(3)

GraphFrames Functions and Attributes

168

(1)

GraphFrames Algorithms

169

(1)

Finding Triangles

169

(3)

Motif Finding

172

(9)

Real-World Applications

181

(1)

Gene Analysis

181

(2)

Social Recommendations

183

(4)

Facebook Circles

187

(4)

Connected Components

191

(2)

Analyzing Flight Data

193

(9)

Summary

202

(1)

7 Interacting with External Data Sources

203

(44)

Relational Databases

204

(1)

Reading from a Database

205

(8)

Writing a DataFrame to a Database

213

(5)

Reading Text Files

218

(2)

Reading and Writing CSV Files

220

(1)

Reading CSV Files

220

(4)

Writing CSV Files

224

(1)

Reading and Writing JSON Files

225

(1)

Reading JSON Files

226

(1)

Writing JSON Files

227

(1)

Reading from and Writing to Amazon S3

228

(1)

Reading from Amazon S3

229

(2)

Writing to Amazon S3

231

(1)

Reading and Writing Hadoop Files

232

(1)

Reading Hadoop Text Files

233

(3)

Writing Hadoop Text Files

236

(2)

Reading and Writing HDFS SequenceFiles

238

(1)

Reading and Writing Parquet Files

239

(1)

Writing Parquet Files

239

(2)

Reading Parquet Files

241

(1)

Reading and Writing Avro Files

242

(1)

Reading Avro Files

242

(1)

Writing Avro Files

242

(1)

Reading from and Writing to MS SQL Server

243

(1)

Writing to MS SQL Server

243

(1)

Reading from MS SQL Server

244

(1)

Reading Image Files

244

(1)

Creating a DataFrame from Images

244

(2)

Summary

246

(1)

8 Ranking Algorithms

247

(24)

Rank Product

248

(1)

Calculation of the Rank Product

249

(1)

Formalizing Rank Product

249

(1)

Rank Product Example

250

(1)

PySpark Solution

251

(6)

PageRank

257

(2)

PageRanks Iterative Computation

259

(2)

Custom PageRank in PySpark Using RDDs

261

(2)

Custom PageRank in PySpark Using an Adjacency Matrix

263

(3)

PageRank with GraphFrames

266

(1)

Summary

267

(4)

Part III Data Design Patterns

9 Classic Data Design Patterns

271

(32)

Input-Map-Output

272

(1)

RDD Solution

273

(2)

DataFrame Solution

275

(2)

Flat Mapper functionality

277

(1)

Input-Filter-Output

278

(1)

RDD Solution

279

(1)

DataFrame Solution

280

(1)

DataFrame Filter

280

(2)

Input-Map-Reduce-Output

282

(1)

RDD Solution

282

(3)

DataFrame Solution

285

(2)

Input-Multiple-Maps-Reduce-Output

287

(1)

RDD Solution

288

(2)

DataFrame Solution

290

(1)

Input-Map-Combiner-Reduce-Output

291

(3)

Input-MapPartitions-Reduce-Output

294

(4)

Inverted Index

298

(1)

Problem Statement

298

(1)

Input

298

(1)

Output

299

(1)

PySpark Solution

299

(3)

Summary

302

(1)

10 Practical Data Design Patterns

303

(42)

In-Mapper Combining

304

(1)

Basic MapReduce Algorithm

305

(2)

In-Mapper Combining per Record

307

(2)

In-Mapper Combining per Partition

309

(3)

Top-10

312

(2)

Top-N Formalized

314

(2)

PySpark Solution

316

(2)

Finding the Bottom 10

318

(1)

MinMax

319

(1)

Solution 1 Classic MapReduce

319

(1)

Solution 2 Sorting

319

(1)

Solution 3 Spark's mapPartitionsO

320

(3)

The Composite Pattern and Monoids

323

(1)

Monoids

324

(4)

Monoidal and Non-Monoidal Examples

328

(3)

Non-Monoid MapReduce Example

331

(1)

Monoid MapReduce Example

332

(2)

PySpark Implementation of Monoidal Mean

334

(2)

Functors and Monoids

336

(2)

Conclusion on Using Monoids

338

(1)

Binning

338

(4)

Sorting

342

(1)

Summary

342

(3)

11 Join Design Patterns

345

(20)

Introduction to the Join Operation

345

(3)

Join in MapReduce

348

(1)

Map Phase

348

(1)

Reducer Phase

349

(1)

Implementation in PySpark

350

(1)

Map-Side Join Using RDDs

351

(4)

Map-Side Join Using DataFrames

355

(2)

Step 1 Create Cache for Airports

357

(1)

Step 2 Create Cache for Airlines

357

(1)

Step 3 Create Facts Table

358

(1)

Step 4 Apply Map-Side Join

358

(1)

Efficient Joins Using Bloom Filters

359

(1)

Introduction to Bloom Filters

359

(2)

A Simple Bloom Filter Example

361

(1)

Bloom Filters in Python

362

(1)

Using Bloom Filters in PySpark

362

(1)

Summary

363

(2)

12 Feature Engineering in PySpark

365

(38)

Introduction to Feature Engineering

366

(2)

Adding New Features

368

(1)

Applying UDFs

369

(1)

Creating Pipelines

370

(2)

Binarizing Data

372

(1)

Imputation

373

(2)

Tokenization

375

(1)

Tokenizer

376

(1)

RegexTokenizer

376

(1)

Tokenization with a Pipeline

377

(1)

Standardization

377

(3)

Normalization

380

(2)

Scaling a Column Using a Pipeline

382

(1)

Using MinMaxScaler on Multiple Columns

383

(1)

Normalization Using Normalizer

384

(1)

String Indexing

385

(1)

Applying Stringlndexer to a Single Column

385

(1)

Applying Stringlndexer to Several Columns

386

(1)

Vector Assembly

386

(1)

Bucketing

387

(1)

Bucketizer

388

(1)

QuantileDiscretizer

389

(1)

Logarithm Transformation

390

(1)

One-Hot Encoding

391

(6)

TF-IDF

397

(4)

FeatureHasher

401

(1)

SQLTransformer

402

(1)

Summary

403

(2)

Index

405

Mahmoud Parsian, Ph.D. in Computer Science, is a practicing software professional with 30 years of experience as a developer, designer, architect, and author. For the past 15 years, he has been involved in Java server-side, databases, MapReduce, Spark, PySpark, and distributed computing.

Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up using PySpark [Pehme köide]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv