Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood [Pehme köide]

4.00/5 (1 hinnangut Goodreads-ist)

Saliya Ekanayake, Supun Kamburugamuve

Formaat: Paperback / softback, 416 pages, kõrgus x laius x paksus: 231x188x25 mm, kaal: 590 g
Ilmumisaeg: 15-Nov-2021
Kirjastus: John Wiley & Sons Inc
ISBN-10: 1119713021
ISBN-13: 9781119713029

Teised raamatud teemal:

Business applications - (Hetkel poes: 2 nimetust)

Pehme köide
Hind: 51,15 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Tavahind: 60,18 €
Säästad 15%
Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Tellimisaeg 2-4 nädalat
Lisa soovinimekirja

Formaat: Paperback / softback, 416 pages, kõrgus x laius x paksus: 231x188x25 mm, kaal: 590 g
Ilmumisaeg: 15-Nov-2021
Kirjastus: John Wiley & Sons Inc
ISBN-10: 1119713021
ISBN-13: 9781119713029

Teised raamatud teemal:

Business applications - (Hetkel poes: 2 nimetust)

Püsilink: https://www.kriso.ee/db/9781119713029.html

Märksõnad:

There is an ever increasing need to store this data, process them and incorporate the knowledge into everyday business operations of the companies. Before big data systems. there were high performance systems designed to do large calculations. Around the time big data became popular, high performance computing systems were mature enough to support the scientific community. But they weren’t ready for the enterprise needs of data analytics. Because of the lack of system support for big data systems at that time, there was a large number of systems created to store and process data. These systems were created according to different design principles and some of them thrived through the years while some didn’t succeed. Because of the diverse nature of systems and tools available for data analytics, there is a need to understand these systems and their applications from a theoretical perspective. These systems are masking the user from underlying details, and they use them without knowing how they work. This works for simple applications but when developing more complex applications that need to scale, users find themselves without the required foundational knowledge to reason about the issues. This knowledge is currently hidden in the systems and research papers.

 The underlying principles behind data processing systems originate from the parallel and distributed computing paradigms. Among the many systems and APIs for data processing, they use the same fundamental ideas under the hood with slightly different variations. We can breakdown data analytics systems according to these principles and study them to understand the inner workings of applications.

This book defines these foundational components of large scale, distributed data processing systems and go into details independently of specific frameworks. It draws examples of current systems to explain how these principles are used in practice. Major design decisions around these foundational components define the performance, type of applications supported and usability. One of the goals of the book is to explain these differences so that readers can take informed decisions when developing applications. Further it will help readers to acquire in-depth knowledge and recognize problems in their applications such as performance issues, distributed operation issues, and fault tolerance aspects.

This book aims to use state of the art research when appropriate to discuss some ideas and future of data analytics tools.

Introduction

xxvii

Chapter 1 Data Intensive Applications

(34)

Anatomy of a Data-Intensive Application

(1)

A Histogram Example

(1)

Program

(1)

Process Management

(1)

Communication

(1)

Execution

(1)

Data Structures

(1)

Putting It Together

(1)

Application

(1)

Resource Management

(1)

Messaging

(1)

Data Structures

(1)

Tasks and Execution

(1)

Fault Tolerance

(1)

Remote Execution

(1)

Parallel Applications

(1)

Serial Applications

(1)

Lloyd's K-Means Algorithm

(2)

Parallelizing Algorithms

(1)

Decomposition

(1)

Task Assignment

(1)

Orchestration

(1)

Mapping

(1)

K-Means Algorithm

(2)

Parallel and Distributed Computing

(1)

Memory Abstractions

(1)

Shared Memory

(2)

Distributed Memory

(2)

Hybrid (Shared + Distributed) Memory

(1)

Partitioned Global Address Space Memory

(1)

Application Classes and Frameworks

(1)

Parallel Interaction Patterns

(1)

Pleasingly Parallel

(1)

Dataflow

(1)

Iterative

(1)

Irregular

(1)

Data Abstractions

(1)

Data-Intensive Frameworks

(1)

Components

(1)

Workflows

(1)

An Example

(1)

What Makes It Difficult?

(1)

Developing Applications

(1)

Concurrency

(1)

Data Partitioning

(1)

Debugging

(1)

Diverse Environments

(1)

Computer Networks

(1)

Synchronization

(1)

Thread Synchronization

(1)

Data Synchronization

(1)

Ordering of Events

(1)

Faults

(1)

Consensus

(1)

Summary

(1)

References

(3)

Chapter 2 Data and Storage

(34)

Storage Systems

(1)

Storage for Distributed Systems

(1)

Direct-Attached Storage

(1)

Storage Area Network

(1)

Network-Attached Storage

(1)

DAS or SAN or NAS?

(1)

Storage Abstractions

(1)

Block Storage

(1)

File Systems

(1)

Object Storage

(1)

Data Formats

(1)

XML

(1)

JSON

(1)

CSV

(1)

Apache Parquet

(2)

Apache Avro

(1)

Avro Data Definitions (Schema)

(1)

Code Generation

(1)

Without Code Generation

(1)

Avro File

(1)

Schema Evolution

(1)

Protocol Buffers, Flat Buffers, and Thrift

(1)

Data Replication

(1)

Synchronous and Asynchronous Replication

(1)

Single-Leader and Multileader Replication

(1)

Data Locality

(1)

Disadvantages of Replication

(1)

Data Partitioning

(1)

Vertical Partitioning

(1)

Horizontal Partitioning (Sharding)

(1)

Hybrid Partitioning

(1)

Considerations for Partitioning

(1)

NoSQL Databases

(1)

Data Models

(1)

Key-Value Databases

(1)

Document Databases

(1)

Wide Column Databases

(1)

Graph Databases

(1)

CAP Theorem

(1)

Message Queuing

(2)

Message Processing Guarantees

(1)

Durability of Messages

(1)

Acknowledgments

(1)

Storage First Brokers and Transient Brokers

(1)

Summary

(1)

References

(3)

Chapter 3 Computing Resources

(38)

A Demonstration

(1)

Computer Clusters

(1)

Anatomy of a Computer Cluster

(1)

Data Analytics in Clusters

(2)

Dedicated Clusters

(1)

Classic Parallel Systems

(1)

Big Data Systems

(2)

Shared Clusters

(1)

OpenMPI on a Slurm Cluster

(1)

Spark on a Yarn Cluster

(1)

Distributed Application Life Cycle

(1)

Life Cycle Steps

(1)

Step 1 Preparation of the Job Package

(1)

Step 2 Resource Acquisition

(1)

Step 3 Distributing the Application (Job) Artifacts

(1)

Step 4 Bootstrapping the Distributed Environment

(1)

Step 5 Monitoring

(1)

Step 6 Termination

(1)

Computing Resources

(1)

Data Centers

(2)

Physical Machines

(1)

Network

(2)

Virtual Machines

(1)

Containers

(1)

Processor, Random Access Memory, and Cache

(1)

Cache

(1)

Multiple Processors in a Computer

(1)

Nonuniform Memory Access

(1)

Uniform Memory Access

(1)

Hard Disk

(1)

GPUs

(1)

Mapping Resources to Applications

(1)

Cluster Resource Managers

(1)

Kubernetes

(1)

Kubernetes Architecture

(2)

Kubernetes Application Concepts

(1)

Data-Intensive Applications on Kubernetes

(2)

Slurm

(1)

Yarn

(1)

Job Scheduling

(2)

Scheduling Policy

101

(1)

Objective Functions

101

(1)

Throughput and Latency

101

(1)

Priorities

102

(1)

Lowering Distance Among the Processes

102

(1)

Data Locality

102

(1)

Completion Deadline

102

(1)

Algorithms

103

(1)

First in First Out

103

(1)

Gang Scheduling

103

(1)

List Scheduling

103

(1)

Backfill Scheduling

104

(1)

Summary

104

(1)

References

104

(3)

Chapter 4 Data Structures

107

(32)

Virtual Memory

108

(1)

Paging and TLB

109

(2)

Cache

111

(1)

The Need for Data Structures

112

(1)

Cache and Memory Layout

112

(2)

Memory Fragmentation

114

(1)

Data Transfer

115

(1)

Data Transfer Between Frameworks

115

(1)

Cross-Language Data Transfer

115

(1)

Object and Text Data

116

(1)

Serialization

116

(1)

Vectors and Matrices

117

(1)

ID Vectors

118

(1)

Matrices

118

(1)

Row-Major and Column-Major Formats

119

(3)

N-Dimensional Arrays/Tensors

122

(1)

NumPy

123

(2)

Memory Representation

125

(1)

K-means with NumPy

126

(1)

Sparse Matrices

127

(1)

Table

128

(1)

Table Formats

129

(1)

Column Data Format

129

(1)

Row Data Format

130

(1)

Apache Arrow

130

(1)

Arrow Data Format

131

(1)

Primitive Types

131

(1)

Variable-Length Data

132

(1)

Arrow Serialization

133

(1)

Arrow Example

133

(1)

Pandas DataFrame

134

(2)

Column vs. Row Tables

136

(1)

Summary

136

(1)

References

136

(3)

Chapter 5 Programming Models

139

(48)

Introduction

139

(1)

Parallel Programming Models

140

(1)

Parallel Process Interaction

140

(1)

Problem Decomposition

140

(1)

Data Structures

140

(1)

Data Structures and Operations

141

(1)

Data Types

141

(2)

Local Operations

143

(1)

Distributed Operations

143

(1)

Array

144

(1)

Tensor

145

(1)

Indexing

145

(1)

Slicing

146

(1)

Broadcasting

146

(1)

Table

146

(2)

Graph Data

148

(2)

Message Passing Model

150

(1)

Model

151

(1)

Message Passing Frameworks

151

(1)

Message Passing Interface

151

(2)

Bulk Synchronous Parallel

153

(1)

K-Means

154

(3)

Distributed Data Model

157

(1)

Eager Model

157

(1)

Dataflow Model

158

(1)

Data Frames, Datasets, and Tables

159

(1)

Input and Output

160

(1)

Task Graphs (Dataflow Graphs)

160

(1)

Model

161

(1)

User Program to Task Graph

161

(1)

Tasks and Functions

162

(1)

Source Task

162

(1)

Compute Task

163

(1)

Implicit vs. Explicit Parallel Models

163

(1)

Remote Execution

163

(1)

Components

164

(1)

Batch Dataflow

165

(1)

Data Abstractions

165

(1)

Table Abstraction

165

(1)

Matrix/Tensors

165

(1)

Functions

166

(1)

Source

166

(1)

Compute

167

(1)

Sink

168

(1)

An Example

168

(1)

Caching State

169

(1)

Evaluation Strategy

170

(1)

Lazy Evaluation

171

(1)

Eager Evaluation

171

(1)

Iterative Computations

172

(1)

DOALL Parallel

172

(1)

DOACROSS Parallel

172

(1)

Pipeline Parallel

173

(1)

Task Graph Models for Iterative Computations

173

(1)

K-Means Algorithm

174

(2)

Streaming Dataflow

176

(1)

Data Abstractions

177

(1)

Streams

177

(1)

Distributed Operations

178

(1)

Streaming Functions

178

(1)

Sources

178

(1)

Compute

179

(1)

Sink

179

(1)

An Example

179

(1)

Windowing

180

(1)

Windowing Strategies

181

(1)

Operations on Windows

182

(1)

Handling Late Events

182

(1)

SQL

182

(1)

Queries

183

(1)

Summary

184

(1)

References

184

(3)

Chapter 6 Messaging

187

(42)

Network Services

188

(1)

TCP/IP

188

(1)

RDMA

189

(1)

Messaging for Data Analytics

189

(1)

Anatomy of a Message

190

(1)

Data Packing

190

(1)

Protocol

191

(1)

Message Types

192

(1)

Control Messages

192

(1)

External Data Sources

192

(1)

Data Transfer Messages

192

(2)

Distributed Operations

194

(1)

How Are They Used?

194

(1)

Task Graph

194

(1)

Parallel Processes

195

(3)

Anatomy of a Distributed Operation

198

(1)

Data Abstractions

198

(1)

Distributed Operation API

198

(1)

Streaming and Batch Operations

199

(1)

Streaming Operations

199

(1)

Batch Operations

199

(1)

Distributed Operations on Arrays

200

(1)

Broadcast

200

(1)

Reduce and AllReduce

201

(1)

Gather and AllGather

202

(1)

Scatter

203

(1)

AllToAll

204

(1)

Optimized Operations

204

(1)

Broadcast

205

(1)

Reduce

206

(1)

AllReduce

206

(2)

Gather and AllGather Collective Algorithms

208

(1)

Scatter and AllToAll Collective Algorithms

208

(1)

Distributed Operations on Tables

209

(1)

Shuffle

209

(2)

Partitioning Data

211

(1)

Handling Large Data

212

(1)

Fetch-Based Algorithm (Asynchronous Algorithm)

213

(1)

Distributed Synchronization Algorithm

214

(1)

GroupBy

214

(1)

Aggregate

215

(1)

Join

216

(3)

Join Algorithms

219

(2)

Distributed Joins

221

(2)

Performance of Joins

223

(1)

More Operations

223

(1)

Advanced Topics

224

(1)

Data Packing

224

(1)

Memory Considerations

224

(1)

Message Coalescing

224

(1)

Compression

225

(1)

Stragglers

225

(1)

Nonblocking vs. Blocking Operations

225

(1)

Blocking Operations

226

(1)

Nonblocking Operations

226

(1)

Summary

227

(1)

References

227

(2)

Chapter 7 Parallel Tasks

229

(42)

CPUs

229

(1)

Cache

229

(1)

False Sharing

230

(1)

Vectorization

231

(3)

Threads and Processes

234

(1)

Concurrency and Parallelism

234

(1)

Context Switches and Scheduling

234

(1)

Mutual Exclusion

235

(1)

User-Level Threads

236

(1)

Process Affinity

236

(1)

NUMA-Aware Programming

237

(1)

Accelerators

237

(1)

Task Execution

238

(2)

Scheduling

240

(1)

Static Scheduling

240

(1)

Dynamic Scheduling

240

(1)

Doosely Synchronous and Asynchronous Execution

241

(1)

Loosely Synchronous Parallel System

242

(1)

Asynchronous Parallel System (Fully Distributed)

243

(1)

Actor Model

244

(1)

Actor

244

(1)

Asynchronous Messages

244

(1)

Actor Frameworks

245

(1)

Execution Models

245

(1)

Process Model

246

(1)

Thread Model

246

(1)

Remote Execution

246

(2)

Tasks for Data Analytics

248

(1)

SPMD and MPMD Execution

248

(1)

Batch Tasks

249

(1)

Data Partitions

249

(2)

Operations

251

(2)

Task Graph Scheduling

253

(1)

Threads, CPU Cores, and Partitions

254

(1)

Data Locality

255

(2)

Execution

257

(1)

Streaming Execution

257

(1)

State

257

(1)

Immutable Data

258

(1)

State in Driver

258

(1)

Distributed State

259

(1)

Streaming Tasks

259

(1)

Streams and Data Partitioning

260

(1)

Partitions

260

(1)

Operations

261

(1)

Scheduling

262

(1)

Uniform Resources

263

(1)

Resource-Aware Scheduling

264

(1)

Execution

264

(1)

Dynamic Scaling

264

(1)

Back Pressure (Flow Control)

265

(1)

Rate-Based Flow Control

266

(1)

Credit-Based Flow Control

266

(1)

State

267

(1)

Summary

268

(1)

References

268

(3)

Chapter 8 Case Studies

271

(32)

Apache Hadoop

271

(1)

Programming Model

272

(2)

Architecture

274

(1)

Cluster Resource Management

275

(1)

Apache Spark

275

(1)

Programming Model

275

(1)

RDD API

276

(1)

SQL, DataFrames, and DataSets

277

(1)

Architecture

278

(1)

Resource Managers

278

(1)

Task Schedulers

279

(1)

Executors

279

(1)

Communication Operations

280

(1)

Apache Spark Streaming

280

(2)

Apache Storm

282

(1)

Programming Model

282

(2)

Architecture

284

(1)

Cluster Resource Managers

285

(1)

Communication Operations

286

(1)

Kafka Streams

286

(1)

Programming Model

286

(1)

Architecture

287

(1)

PyTorch

288

(1)

Programming Model

288

(4)

Execution

292

(3)

Cylon

295

(1)

Programming Model

296

(1)

Architecture

296

(1)

Execution

297

(1)

Communication Operations

298

(1)

Rapids cuDF

298

(1)

Programming Model

298

(1)

Architecture

299

(1)

Summary

300

(1)

References

300

(3)

Chapter 9 Fault Tolerance

303

(26)

Dependable Systems and Failures

303

(1)

Fault Tolerance Is Not Free

304

(1)

Dependable Systems

305

(1)

Failures

306

(1)

Process Failures

306

(1)

Network Failures

307

(1)

Node Failures

307

(1)

Byzantine Faults

307

(1)

Failure Models

308

(1)

Failure Detection

308

(1)

Recovering from Faults

309

(1)

Recovery Methods

310

(1)

Stateless Programs

310

(1)

Batch Systems

311

(1)

Streaming Systems

311

(1)

Processing Guarantees

311

(1)

Role of Cluster Resource Managers

312

(1)

Checkpointing

313

(1)

State

313

(1)

Consistent Global State

313

(1)

Uncoordinated Checkpointing

314

(1)

Coordinated Checkpointing

315

(1)

Chandy-Lamport Algorithm

315

(1)

Batch Systems

316

(1)

When to Checkpoint?

317

(1)

Snapshot Data

318

(1)

Streaming Systems

319

(1)

Case Study: Apache Storm

319

(1)

Message Tracking

320

(1)

Failure Recovery

321

(1)

Case Study: Apache Flink

321

(1)

Checkpointing

322

(2)

Failure Recovery

324

(1)

Batch Systems

324

(1)

Iterative Programs

324

(1)

Case Study: Apache Spark

325

(1)

RDD Recomputing

326

(1)

Checkpointing

326

(1)

Recovery from Failures

327

(1)

Summary

327

(1)

References

327

(2)

Chapter 10 Performance and Productivity

329

(32)

Performance Metrics

329

(1)

System Performance Metrics

330

(1)

Parallel Performance Metrics

330

(1)

Speedup

330

(1)

Strong Scaling

331

(1)

Weak Scaling

332

(1)

Parallel Efficiency

332

(1)

Amdahl's Law

333

(1)

Gustafson's Law

334

(1)

Throughput

334

(1)

Latency

335

(1)

Benchmarks

336

(1)

LINPACK Benchmark

336

(1)

NAS Parallel Benchmark

336

(1)

BigDataBench

336

(1)

TPC Benchmarks

337

(1)

HiBench

337

(1)

Performance Factors

337

(1)

Memory

337

(1)

Execution

338

(1)

Distributed Operators

338

(1)

Disk I/O

339

(1)

Garbage Collection

339

(3)

Finding Issues

342

(1)

Serial Programs

342

(1)

Profiling

342

(1)

Scaling

343

(1)

Strong Scaling

343

(1)

Weak Scaling

344

(1)

Debugging Distributed Applications

344

(1)

Programming Languages

345

(1)

C/C++

346

(1)

Java

346

(1)

Memory Management

347

(1)

Data Structures

348

(1)

Interfacing with Python

348

(2)

Python

350

(1)

C/C++ Code integration

350

(1)

Productivity

351

(1)

Choice of Frameworks

351

(2)

Operating Environment

353

(1)

CPUs and GPUs

353

(2)

Public Clouds

355

(3)

Future of Data-Intensive Applications

358

(1)

Summary

358

(1)

References

359

(2)

Index

361

SUPUN KAMBURUGAMUVE, PhD, is a computer scientist researching and designing large scale data analytics tools. He received his doctorate in Computer Science from Indiana University, Bloomington and architected the data processing systems Twister2 and Cylon.

SALIYA EKANAYAKE, PhD, is a Senior Software Engineer at Microsoft working in the intersection of scaling deep learning systems and parallel computing. He is also a research affiliate at Berkeley Lab. He received his doctorate in Computer Science from Indiana University, Bloomington.

Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood [Pehme köide]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv