Muutke küpsiste eelistusi

Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood [Pehme köide]

  • Formaat: Paperback / softback, 416 pages, kõrgus x laius x paksus: 231x188x25 mm, kaal: 590 g
  • Ilmumisaeg: 15-Nov-2021
  • Kirjastus: John Wiley & Sons Inc
  • ISBN-10: 1119713021
  • ISBN-13: 9781119713029
Teised raamatud teemal:
  • Pehme köide
  • Hind: 51,15 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Tavahind: 60,18 €
  • Säästad 15%
  • Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
  • Kogus:
  • Lisa ostukorvi
  • Tasuta tarne
  • Tellimisaeg 2-4 nädalat
  • Lisa soovinimekirja
  • Formaat: Paperback / softback, 416 pages, kõrgus x laius x paksus: 231x188x25 mm, kaal: 590 g
  • Ilmumisaeg: 15-Nov-2021
  • Kirjastus: John Wiley & Sons Inc
  • ISBN-10: 1119713021
  • ISBN-13: 9781119713029
Teised raamatud teemal:

There is an ever increasing need to store this data, process them and incorporate the knowledge into everyday business operations of the companies. Before big data systems. there were high performance systems designed to do large calculations. Around the time big data became popular, high performance computing systems were mature enough to support the scientific community. But they weren’t ready for the enterprise needs of data analytics. Because of the lack of system support for big data systems at that time, there was a large number of systems created to store and process data. These systems were created according to different design principles and some of them thrived through the years while some didn’t succeed. Because of the diverse nature of systems and tools available for data analytics, there is a need to understand these systems and their applications from a theoretical perspective. These systems are masking the user from underlying details, and they use them without knowing how they work. This works for simple applications but when developing more complex applications that need to scale, users find themselves without the required foundational knowledge to reason about the issues. This knowledge is currently hidden in the systems and research papers.

 The underlying principles behind data processing systems originate from the parallel and distributed computing paradigms. Among the many systems and APIs for data processing, they use the same fundamental ideas under the hood with slightly different variations. We can breakdown data analytics systems according to these principles and study them to understand the inner workings of applications.

This book defines these foundational components of large scale, distributed data processing systems and go into details independently of specific frameworks. It draws examples of current systems to explain how these principles are used in practice. Major design decisions around these foundational components define the performance, type of applications supported and usability. One of the goals of the book is to explain these differences so that readers can take informed decisions when developing applications. Further it will help readers to acquire in-depth knowledge and recognize problems in their applications such as performance issues, distributed operation issues, and fault tolerance aspects.

This book aims to use state of the art research when appropriate to discuss some ideas and future of data analytics tools.

Introduction xxvii
Chapter 1 Data Intensive Applications
1(34)
Anatomy of a Data-Intensive Application
1(1)
A Histogram Example
2(1)
Program
2(1)
Process Management
3(1)
Communication
4(1)
Execution
5(1)
Data Structures
6(1)
Putting It Together
6(1)
Application
6(1)
Resource Management
6(1)
Messaging
7(1)
Data Structures
7(1)
Tasks and Execution
8(1)
Fault Tolerance
8(1)
Remote Execution
8(1)
Parallel Applications
9(1)
Serial Applications
9(1)
Lloyd's K-Means Algorithm
9(2)
Parallelizing Algorithms
11(1)
Decomposition
11(1)
Task Assignment
12(1)
Orchestration
12(1)
Mapping
13(1)
K-Means Algorithm
13(2)
Parallel and Distributed Computing
15(1)
Memory Abstractions
16(1)
Shared Memory
16(2)
Distributed Memory
18(2)
Hybrid (Shared + Distributed) Memory
20(1)
Partitioned Global Address Space Memory
21(1)
Application Classes and Frameworks
22(1)
Parallel Interaction Patterns
22(1)
Pleasingly Parallel
23(1)
Dataflow
23(1)
Iterative
23(1)
Irregular
23(1)
Data Abstractions
24(1)
Data-Intensive Frameworks
24(1)
Components
24(1)
Workflows
25(1)
An Example
25(1)
What Makes It Difficult?
26(1)
Developing Applications
27(1)
Concurrency
27(1)
Data Partitioning
28(1)
Debugging
28(1)
Diverse Environments
28(1)
Computer Networks
29(1)
Synchronization
29(1)
Thread Synchronization
29(1)
Data Synchronization
30(1)
Ordering of Events
31(1)
Faults
31(1)
Consensus
31(1)
Summary
32(1)
References
32(3)
Chapter 2 Data and Storage
35(34)
Storage Systems
35(1)
Storage for Distributed Systems
36(1)
Direct-Attached Storage
37(1)
Storage Area Network
37(1)
Network-Attached Storage
38(1)
DAS or SAN or NAS?
38(1)
Storage Abstractions
39(1)
Block Storage
39(1)
File Systems
40(1)
Object Storage
41(1)
Data Formats
41(1)
XML
42(1)
JSON
43(1)
CSV
44(1)
Apache Parquet
45(2)
Apache Avro
47(1)
Avro Data Definitions (Schema)
48(1)
Code Generation
49(1)
Without Code Generation
49(1)
Avro File
49(1)
Schema Evolution
49(1)
Protocol Buffers, Flat Buffers, and Thrift
50(1)
Data Replication
51(1)
Synchronous and Asynchronous Replication
52(1)
Single-Leader and Multileader Replication
52(1)
Data Locality
53(1)
Disadvantages of Replication
54(1)
Data Partitioning
54(1)
Vertical Partitioning
55(1)
Horizontal Partitioning (Sharding)
55(1)
Hybrid Partitioning
56(1)
Considerations for Partitioning
57(1)
NoSQL Databases
58(1)
Data Models
58(1)
Key-Value Databases
58(1)
Document Databases
59(1)
Wide Column Databases
59(1)
Graph Databases
59(1)
CAP Theorem
60(1)
Message Queuing
61(2)
Message Processing Guarantees
63(1)
Durability of Messages
64(1)
Acknowledgments
64(1)
Storage First Brokers and Transient Brokers
65(1)
Summary
66(1)
References
66(3)
Chapter 3 Computing Resources
69(38)
A Demonstration
71(1)
Computer Clusters
72(1)
Anatomy of a Computer Cluster
73(1)
Data Analytics in Clusters
74(2)
Dedicated Clusters
76(1)
Classic Parallel Systems
76(1)
Big Data Systems
77(2)
Shared Clusters
79(1)
OpenMPI on a Slurm Cluster
79(1)
Spark on a Yarn Cluster
80(1)
Distributed Application Life Cycle
80(1)
Life Cycle Steps
80(1)
Step 1 Preparation of the Job Package
81(1)
Step 2 Resource Acquisition
81(1)
Step 3 Distributing the Application (Job) Artifacts
81(1)
Step 4 Bootstrapping the Distributed Environment
82(1)
Step 5 Monitoring
82(1)
Step 6 Termination
83(1)
Computing Resources
83(1)
Data Centers
83(2)
Physical Machines
85(1)
Network
85(2)
Virtual Machines
87(1)
Containers
87(1)
Processor, Random Access Memory, and Cache
88(1)
Cache
89(1)
Multiple Processors in a Computer
90(1)
Nonuniform Memory Access
90(1)
Uniform Memory Access
91(1)
Hard Disk
92(1)
GPUs
92(1)
Mapping Resources to Applications
92(1)
Cluster Resource Managers
93(1)
Kubernetes
94(1)
Kubernetes Architecture
94(2)
Kubernetes Application Concepts
96(1)
Data-Intensive Applications on Kubernetes
96(2)
Slurm
98(1)
Yarn
99(1)
Job Scheduling
99(2)
Scheduling Policy
101(1)
Objective Functions
101(1)
Throughput and Latency
101(1)
Priorities
102(1)
Lowering Distance Among the Processes
102(1)
Data Locality
102(1)
Completion Deadline
102(1)
Algorithms
103(1)
First in First Out
103(1)
Gang Scheduling
103(1)
List Scheduling
103(1)
Backfill Scheduling
104(1)
Summary
104(1)
References
104(3)
Chapter 4 Data Structures
107(32)
Virtual Memory
108(1)
Paging and TLB
109(2)
Cache
111(1)
The Need for Data Structures
112(1)
Cache and Memory Layout
112(2)
Memory Fragmentation
114(1)
Data Transfer
115(1)
Data Transfer Between Frameworks
115(1)
Cross-Language Data Transfer
115(1)
Object and Text Data
116(1)
Serialization
116(1)
Vectors and Matrices
117(1)
ID Vectors
118(1)
Matrices
118(1)
Row-Major and Column-Major Formats
119(3)
N-Dimensional Arrays/Tensors
122(1)
NumPy
123(2)
Memory Representation
125(1)
K-means with NumPy
126(1)
Sparse Matrices
127(1)
Table
128(1)
Table Formats
129(1)
Column Data Format
129(1)
Row Data Format
130(1)
Apache Arrow
130(1)
Arrow Data Format
131(1)
Primitive Types
131(1)
Variable-Length Data
132(1)
Arrow Serialization
133(1)
Arrow Example
133(1)
Pandas DataFrame
134(2)
Column vs. Row Tables
136(1)
Summary
136(1)
References
136(3)
Chapter 5 Programming Models
139(48)
Introduction
139(1)
Parallel Programming Models
140(1)
Parallel Process Interaction
140(1)
Problem Decomposition
140(1)
Data Structures
140(1)
Data Structures and Operations
141(1)
Data Types
141(2)
Local Operations
143(1)
Distributed Operations
143(1)
Array
144(1)
Tensor
145(1)
Indexing
145(1)
Slicing
146(1)
Broadcasting
146(1)
Table
146(2)
Graph Data
148(2)
Message Passing Model
150(1)
Model
151(1)
Message Passing Frameworks
151(1)
Message Passing Interface
151(2)
Bulk Synchronous Parallel
153(1)
K-Means
154(3)
Distributed Data Model
157(1)
Eager Model
157(1)
Dataflow Model
158(1)
Data Frames, Datasets, and Tables
159(1)
Input and Output
160(1)
Task Graphs (Dataflow Graphs)
160(1)
Model
161(1)
User Program to Task Graph
161(1)
Tasks and Functions
162(1)
Source Task
162(1)
Compute Task
163(1)
Implicit vs. Explicit Parallel Models
163(1)
Remote Execution
163(1)
Components
164(1)
Batch Dataflow
165(1)
Data Abstractions
165(1)
Table Abstraction
165(1)
Matrix/Tensors
165(1)
Functions
166(1)
Source
166(1)
Compute
167(1)
Sink
168(1)
An Example
168(1)
Caching State
169(1)
Evaluation Strategy
170(1)
Lazy Evaluation
171(1)
Eager Evaluation
171(1)
Iterative Computations
172(1)
DOALL Parallel
172(1)
DOACROSS Parallel
172(1)
Pipeline Parallel
173(1)
Task Graph Models for Iterative Computations
173(1)
K-Means Algorithm
174(2)
Streaming Dataflow
176(1)
Data Abstractions
177(1)
Streams
177(1)
Distributed Operations
178(1)
Streaming Functions
178(1)
Sources
178(1)
Compute
179(1)
Sink
179(1)
An Example
179(1)
Windowing
180(1)
Windowing Strategies
181(1)
Operations on Windows
182(1)
Handling Late Events
182(1)
SQL
182(1)
Queries
183(1)
Summary
184(1)
References
184(3)
Chapter 6 Messaging
187(42)
Network Services
188(1)
TCP/IP
188(1)
RDMA
189(1)
Messaging for Data Analytics
189(1)
Anatomy of a Message
190(1)
Data Packing
190(1)
Protocol
191(1)
Message Types
192(1)
Control Messages
192(1)
External Data Sources
192(1)
Data Transfer Messages
192(2)
Distributed Operations
194(1)
How Are They Used?
194(1)
Task Graph
194(1)
Parallel Processes
195(3)
Anatomy of a Distributed Operation
198(1)
Data Abstractions
198(1)
Distributed Operation API
198(1)
Streaming and Batch Operations
199(1)
Streaming Operations
199(1)
Batch Operations
199(1)
Distributed Operations on Arrays
200(1)
Broadcast
200(1)
Reduce and AllReduce
201(1)
Gather and AllGather
202(1)
Scatter
203(1)
AllToAll
204(1)
Optimized Operations
204(1)
Broadcast
205(1)
Reduce
206(1)
AllReduce
206(2)
Gather and AllGather Collective Algorithms
208(1)
Scatter and AllToAll Collective Algorithms
208(1)
Distributed Operations on Tables
209(1)
Shuffle
209(2)
Partitioning Data
211(1)
Handling Large Data
212(1)
Fetch-Based Algorithm (Asynchronous Algorithm)
213(1)
Distributed Synchronization Algorithm
214(1)
GroupBy
214(1)
Aggregate
215(1)
Join
216(3)
Join Algorithms
219(2)
Distributed Joins
221(2)
Performance of Joins
223(1)
More Operations
223(1)
Advanced Topics
224(1)
Data Packing
224(1)
Memory Considerations
224(1)
Message Coalescing
224(1)
Compression
225(1)
Stragglers
225(1)
Nonblocking vs. Blocking Operations
225(1)
Blocking Operations
226(1)
Nonblocking Operations
226(1)
Summary
227(1)
References
227(2)
Chapter 7 Parallel Tasks
229(42)
CPUs
229(1)
Cache
229(1)
False Sharing
230(1)
Vectorization
231(3)
Threads and Processes
234(1)
Concurrency and Parallelism
234(1)
Context Switches and Scheduling
234(1)
Mutual Exclusion
235(1)
User-Level Threads
236(1)
Process Affinity
236(1)
NUMA-Aware Programming
237(1)
Accelerators
237(1)
Task Execution
238(2)
Scheduling
240(1)
Static Scheduling
240(1)
Dynamic Scheduling
240(1)
Doosely Synchronous and Asynchronous Execution
241(1)
Loosely Synchronous Parallel System
242(1)
Asynchronous Parallel System (Fully Distributed)
243(1)
Actor Model
244(1)
Actor
244(1)
Asynchronous Messages
244(1)
Actor Frameworks
245(1)
Execution Models
245(1)
Process Model
246(1)
Thread Model
246(1)
Remote Execution
246(2)
Tasks for Data Analytics
248(1)
SPMD and MPMD Execution
248(1)
Batch Tasks
249(1)
Data Partitions
249(2)
Operations
251(2)
Task Graph Scheduling
253(1)
Threads, CPU Cores, and Partitions
254(1)
Data Locality
255(2)
Execution
257(1)
Streaming Execution
257(1)
State
257(1)
Immutable Data
258(1)
State in Driver
258(1)
Distributed State
259(1)
Streaming Tasks
259(1)
Streams and Data Partitioning
260(1)
Partitions
260(1)
Operations
261(1)
Scheduling
262(1)
Uniform Resources
263(1)
Resource-Aware Scheduling
264(1)
Execution
264(1)
Dynamic Scaling
264(1)
Back Pressure (Flow Control)
265(1)
Rate-Based Flow Control
266(1)
Credit-Based Flow Control
266(1)
State
267(1)
Summary
268(1)
References
268(3)
Chapter 8 Case Studies
271(32)
Apache Hadoop
271(1)
Programming Model
272(2)
Architecture
274(1)
Cluster Resource Management
275(1)
Apache Spark
275(1)
Programming Model
275(1)
RDD API
276(1)
SQL, DataFrames, and DataSets
277(1)
Architecture
278(1)
Resource Managers
278(1)
Task Schedulers
279(1)
Executors
279(1)
Communication Operations
280(1)
Apache Spark Streaming
280(2)
Apache Storm
282(1)
Programming Model
282(2)
Architecture
284(1)
Cluster Resource Managers
285(1)
Communication Operations
286(1)
Kafka Streams
286(1)
Programming Model
286(1)
Architecture
287(1)
PyTorch
288(1)
Programming Model
288(4)
Execution
292(3)
Cylon
295(1)
Programming Model
296(1)
Architecture
296(1)
Execution
297(1)
Communication Operations
298(1)
Rapids cuDF
298(1)
Programming Model
298(1)
Architecture
299(1)
Summary
300(1)
References
300(3)
Chapter 9 Fault Tolerance
303(26)
Dependable Systems and Failures
303(1)
Fault Tolerance Is Not Free
304(1)
Dependable Systems
305(1)
Failures
306(1)
Process Failures
306(1)
Network Failures
307(1)
Node Failures
307(1)
Byzantine Faults
307(1)
Failure Models
308(1)
Failure Detection
308(1)
Recovering from Faults
309(1)
Recovery Methods
310(1)
Stateless Programs
310(1)
Batch Systems
311(1)
Streaming Systems
311(1)
Processing Guarantees
311(1)
Role of Cluster Resource Managers
312(1)
Checkpointing
313(1)
State
313(1)
Consistent Global State
313(1)
Uncoordinated Checkpointing
314(1)
Coordinated Checkpointing
315(1)
Chandy-Lamport Algorithm
315(1)
Batch Systems
316(1)
When to Checkpoint?
317(1)
Snapshot Data
318(1)
Streaming Systems
319(1)
Case Study: Apache Storm
319(1)
Message Tracking
320(1)
Failure Recovery
321(1)
Case Study: Apache Flink
321(1)
Checkpointing
322(2)
Failure Recovery
324(1)
Batch Systems
324(1)
Iterative Programs
324(1)
Case Study: Apache Spark
325(1)
RDD Recomputing
326(1)
Checkpointing
326(1)
Recovery from Failures
327(1)
Summary
327(1)
References
327(2)
Chapter 10 Performance and Productivity
329(32)
Performance Metrics
329(1)
System Performance Metrics
330(1)
Parallel Performance Metrics
330(1)
Speedup
330(1)
Strong Scaling
331(1)
Weak Scaling
332(1)
Parallel Efficiency
332(1)
Amdahl's Law
333(1)
Gustafson's Law
334(1)
Throughput
334(1)
Latency
335(1)
Benchmarks
336(1)
LINPACK Benchmark
336(1)
NAS Parallel Benchmark
336(1)
BigDataBench
336(1)
TPC Benchmarks
337(1)
HiBench
337(1)
Performance Factors
337(1)
Memory
337(1)
Execution
338(1)
Distributed Operators
338(1)
Disk I/O
339(1)
Garbage Collection
339(3)
Finding Issues
342(1)
Serial Programs
342(1)
Profiling
342(1)
Scaling
343(1)
Strong Scaling
343(1)
Weak Scaling
344(1)
Debugging Distributed Applications
344(1)
Programming Languages
345(1)
C/C++
346(1)
Java
346(1)
Memory Management
347(1)
Data Structures
348(1)
Interfacing with Python
348(2)
Python
350(1)
C/C++ Code integration
350(1)
Productivity
351(1)
Choice of Frameworks
351(2)
Operating Environment
353(1)
CPUs and GPUs
353(2)
Public Clouds
355(3)
Future of Data-Intensive Applications
358(1)
Summary
358(1)
References
359(2)
Index 361
SUPUN KAMBURUGAMUVE, PhD, is a computer scientist researching and designing large scale data analytics tools. He received his doctorate in Computer Science from Indiana University, Bloomington and architected the data processing systems Twister2 and Cylon.

SALIYA EKANAYAKE, PhD, is a Senior Software Engineer at Microsoft working in the intersection of scaling deep learning systems and parallel computing. He is also a research affiliate at Berkeley Lab. He received his doctorate in Computer Science from Indiana University, Bloomington.