Muutke küpsiste eelistusi

Algorithms and Data Structures for Massive Datasets [Pehme köide]

  • Formaat: Paperback / softback, 304 pages, kõrgus x laius x paksus: 234x188x17 mm, kaal: 463 g, Illustrations
  • Ilmumisaeg: 05-Jul-2022
  • Kirjastus: Manning Publications
  • ISBN-10: 1617298034
  • ISBN-13: 9781617298035
Teised raamatud teemal:
  • Formaat: Paperback / softback, 304 pages, kõrgus x laius x paksus: 234x188x17 mm, kaal: 463 g, Illustrations
  • Ilmumisaeg: 05-Jul-2022
  • Kirjastus: Manning Publications
  • ISBN-10: 1617298034
  • ISBN-13: 9781617298035
Teised raamatud teemal:
In Algorithms and Data Structures for Massive Datasets, you'll discover methods for reducing and sketching data so it fits in small memory without losing accuracy, and unlock the algorithms and data structures that form the backbone of a big data system.

Data structures and algorithms that are great for traditional software may quickly slow or fail altogether when applied to huge datasets. Algorithms and Data Structures for Massive Datasets introduces a toolbox of new techniques that are perfect for handling modern big data applications.

In Algorithms and Data Structures for Massive Datasets, you'll discover methods for reducing and sketching data so it fits in small memory without losing accuracy, and unlock the algorithms and data structures that form the backbone of a big data system. Filled with fun illustrations and examples from real-world businesses, you'll learn how each of these complex techniques can be practically applied to maximize the accuracy and throughput of big data processing and analytics.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.
Preface xii
Acknowledgments xiv
About this book xvi
About the authors xx
About the cover illustration xxii
1 Introduction
1(16)
1.1 An example
3(5)
An example: How to solve it
4(2)
How to solve it, take two: A book walkthrough
6(2)
1.2 The structure of this book
8(1)
1.3 What makes this book different and whom it is for
9(1)
1.4 Why is massive data so challenging for today's systems?
10(2)
The CPU memory performance gap
10(1)
Memory hierarchy
11(1)
Latency vs. bandwidth
12(1)
What about distributed systems'?
12(1)
1.5 Designing algorithms with hardware in mind
12(5)
Part 1 Hash-based sketches
17(102)
2 Review of hash tables and modern hashing
19(29)
2.1 Ubiquitous hashing
20(2)
2.2 A crash course on data structures
22(2)
2.3 Usage scenarios in modern systems
24(4)
Deduplication in backup/storage solutions
24(2)
Plagiarism detection with MOSS and Rabin-Karp fingerprinting
26(2)
2.4 O(1)--What's the big deal?
28(1)
2.5 Collision resolution: Theory vs. practice
29(3)
2.6 Usage scenario: How Python's diet does it
32(1)
2.7 MurmurHash
33(1)
2.8 Hash tables for distributed systems: Consistent hashing
34(14)
A typical hashing problem
35(1)
Hashring
36(2)
Lookup
38(1)
Adding a new node/resource
39(2)
Removing a node
41(3)
Consistent hashing scenario: Chord
44(2)
Consistent hashing: Programming exercises
46(2)
3 Approximate membership: Bloom and quotient filters
48(27)
3.1 How it works
50(3)
Insert
51(1)
Lookup
51(2)
3.2 Use cases
53(2)
Bloom filters in networks: Squid
53(1)
Bitcoin mobile app
54(1)
3.3 A simple implementation
55(1)
3.4 Configuring a Bloom filter
56(3)
Playing with Bloom filters: Mini experiments
59(1)
3.5 A bit of theory
59(3)
Can we do better?
61(1)
3.6 Bloom filter adaptations and alternatives
62(1)
3.7 Quotient filter
63(9)
Quotienting
64(1)
Understanding metadata bits
65(1)
Inserting into a quotient filter: An example
66(3)
Python code fin-lookup
69(2)
Resizing and merging
71(1)
False positive rate and space considerations
72(1)
3.8 Comparison between Bloom filters and quotient filters
72(3)
4 Frequency estimation and count-min sketch
75(23)
4.1 Majority element
76(3)
General heavy hitters
78(1)
4.2 Count-min sketch: How it works
79(3)
Update
80(1)
Estimate
80(2)
4.3 Use cases
82(6)
Top-k restless sleepers
82(3)
Scaling the distributional similarity of words
85(3)
4.4 Error vs. space in count-min sketch
88(1)
4.5 A simple implementation of count-min sketch
88(3)
Exercises
89(1)
Intuition behind the formula: Math bit
90(1)
4.6 Range queries with count-min sketch
91(7)
Dyadic intervals
91(2)
Update phase
93(1)
Estimate phase
94(1)
Computing dyadic intervals
95(3)
5 Cardinality estimation and HyperLogLog
98(21)
5.1 Counting distinct items in databases
99(1)
5.2 HyperLogLog incremental design
100(9)
The first cut: Probabilistic counting
101(2)
Stochastic averaging, or "when life gives you lemons"
103(2)
LogLog
105(1)
HyperLogLog: Stochastic averaging with harmonic mean
106(3)
5.3 Use case: Catching worms with HLL
109(2)
5.4 But how does it work? A mini experiment
111(3)
The effect of the number of buckets (m)
113(1)
5.5 Use case: Aggregation using HyperLogLog
114(5)
Part 2 Real-Time Analytics
119(76)
6 Streaming data: Bringing everything together
121(21)
6.1 Streaming data system: A meta example
126(6)
Bloom-join
126(2)
Deduplication
128(2)
Load balancing and tracking the network traffic
130(2)
6.2 Practical constraints and concepts in data streams
132(3)
In real time
132(1)
Small time and small space
133(1)
Concept shifts and concept drifts
133(1)
Sliding window model
133(2)
6.3 Math bit: Sampling and estimation
135(7)
Biased sampling strategy
136(3)
Estimation from a representative sample
139(3)
7 Sampling from data streams
142(26)
7.1 Sampling from a landmark stream
143(13)
Bernoulli sampling
143(3)
Reservoir sampling
146(5)
Biased reservoir sampling
151(5)
7.2 Sampling from a sliding window
156(7)
Chain sampling
156(4)
Priority sampling
160(3)
7.3 Sampling algorithms comparison
163(5)
Simulation setup: Algorithms and data
163(5)
8 Approximate quantiles on data streams
168(27)
8.1 Exact quantiles
169(3)
8.2 Approximate quantiles
172(2)
Additive error
172(1)
Relative error
173(1)
Relative error in the data domain
174(1)
8.3 T-digest: How it works
174(10)
Digest
175(2)
Scale functions
177(3)
Merging t-digests
180(3)
Space bounds for t-digest
183(1)
8.4 Q-digest
184(5)
Constructing a q-digest from scratch
184(2)
Merging q-digests
186(2)
Error and space considerations in q-digests
188(1)
Quantile queries with q-digests
188(1)
8.5 Simulation code and results
189(6)
Part 3 Data structures for databases and external memory algorithms
195(72)
9 Introducing the external memory model
197(18)
9.1 External memory model: The preliminaries
199(2)
9.2 Example 1: Finding a minimum
201(3)
Use case: Minimum median income
201(3)
9.3 Example 2: Binary search
204(3)
Bioinformatics use case
204(2)
Runtime analysis
206(1)
9.4 Optimal searching
207(2)
9.5 Example 3: Merging K sorted lists
209(4)
Merging time/date logs
210(3)
External memory model: Simple or simplistic?
213(1)
9.6 What's next
213(2)
10 Data structures for databases: B-trees, Bε-trees, and LSM-trees
215(33)
10.1 How Indexing Works
216(2)
10.2 Data structures in this chapter
218(1)
10.3 B-trees
219(11)
B-tree balancing
220(1)
Lookup
221(1)
Insert
221(3)
Delete
224(3)
Bε-trees
227(2)
How operations on a B+-tree are different
229(1)
Use case: B-trees in MySQL (and many other places)
229(1)
10.4 Math bit: Why are B-tree lookups optimal in external memory?
230(2)
Why B-tree inserts/deletes are not optimal in external memory
231(1)
10.5 Bε-trees
232(8)
Bε-tree: How it works
233(1)
Buffering mechanics
233(3)
Inserts and deletes
236(1)
Lookups
236(1)
Cost analysis
236(2)
Bε-tree: The spectrum of data structures
238(1)
Use case: Bε-trees in TokuDB
238(2)
Make haste slowly, the I/O way
240(1)
10.6 Log-structured merge-trees (LSM-trees)
240(8)
The LSM-tree: How it works
241(4)
LSM-tree cost analysis
245(1)
Use case: LSM-trees in Cassandra
245(3)
11 External memory sorting
248(19)
11.1 Sorting use cases
249(2)
Robot motion planning
249(1)
Cancer genomics
250(1)
11.2 Challenges of sorting in external memory: An example
251(4)
Two-way merge-sort in external memory
252(3)
11.3 External memory merge-sort (M/B-way merge-sort)
255(3)
Searching and sorting in RAM vs. external memory
256(2)
11.4 What about external quick-sort?
258(5)
External memory two-way quick-sort
258(1)
Toward external memory multiway quick-sort
259(1)
Finding enough pivots
260(1)
Finding good enough pivots
261(1)
Putting it all back together
262(1)
11.5 Math bit: Why is external memory merge-sort optimal?
263(1)
11.6 Wrapping up
264(3)
References 267(6)
Index 273