Preface |
|
xii | |
Acknowledgments |
|
xiv | |
About this book |
|
xvi | |
About the authors |
|
xx | |
About the cover illustration |
|
xxii | |
|
|
1 | (16) |
|
|
3 | (5) |
|
An example: How to solve it |
|
|
4 | (2) |
|
How to solve it, take two: A book walkthrough |
|
|
6 | (2) |
|
1.2 The structure of this book |
|
|
8 | (1) |
|
1.3 What makes this book different and whom it is for |
|
|
9 | (1) |
|
1.4 Why is massive data so challenging for today's systems? |
|
|
10 | (2) |
|
The CPU memory performance gap |
|
|
10 | (1) |
|
|
11 | (1) |
|
|
12 | (1) |
|
What about distributed systems'? |
|
|
12 | (1) |
|
1.5 Designing algorithms with hardware in mind |
|
|
12 | (5) |
|
Part 1 Hash-based sketches |
|
|
17 | (102) |
|
2 Review of hash tables and modern hashing |
|
|
19 | (29) |
|
|
20 | (2) |
|
2.2 A crash course on data structures |
|
|
22 | (2) |
|
2.3 Usage scenarios in modern systems |
|
|
24 | (4) |
|
Deduplication in backup/storage solutions |
|
|
24 | (2) |
|
Plagiarism detection with MOSS and Rabin-Karp fingerprinting |
|
|
26 | (2) |
|
2.4 O(1)--What's the big deal? |
|
|
28 | (1) |
|
2.5 Collision resolution: Theory vs. practice |
|
|
29 | (3) |
|
2.6 Usage scenario: How Python's diet does it |
|
|
32 | (1) |
|
|
33 | (1) |
|
2.8 Hash tables for distributed systems: Consistent hashing |
|
|
34 | (14) |
|
A typical hashing problem |
|
|
35 | (1) |
|
|
36 | (2) |
|
|
38 | (1) |
|
Adding a new node/resource |
|
|
39 | (2) |
|
|
41 | (3) |
|
Consistent hashing scenario: Chord |
|
|
44 | (2) |
|
Consistent hashing: Programming exercises |
|
|
46 | (2) |
|
3 Approximate membership: Bloom and quotient filters |
|
|
48 | (27) |
|
|
50 | (3) |
|
|
51 | (1) |
|
|
51 | (2) |
|
|
53 | (2) |
|
Bloom filters in networks: Squid |
|
|
53 | (1) |
|
|
54 | (1) |
|
3.3 A simple implementation |
|
|
55 | (1) |
|
3.4 Configuring a Bloom filter |
|
|
56 | (3) |
|
Playing with Bloom filters: Mini experiments |
|
|
59 | (1) |
|
|
59 | (3) |
|
|
61 | (1) |
|
3.6 Bloom filter adaptations and alternatives |
|
|
62 | (1) |
|
|
63 | (9) |
|
|
64 | (1) |
|
Understanding metadata bits |
|
|
65 | (1) |
|
Inserting into a quotient filter: An example |
|
|
66 | (3) |
|
|
69 | (2) |
|
|
71 | (1) |
|
False positive rate and space considerations |
|
|
72 | (1) |
|
3.8 Comparison between Bloom filters and quotient filters |
|
|
72 | (3) |
|
4 Frequency estimation and count-min sketch |
|
|
75 | (23) |
|
|
76 | (3) |
|
|
78 | (1) |
|
4.2 Count-min sketch: How it works |
|
|
79 | (3) |
|
|
80 | (1) |
|
|
80 | (2) |
|
|
82 | (6) |
|
|
82 | (3) |
|
Scaling the distributional similarity of words |
|
|
85 | (3) |
|
4.4 Error vs. space in count-min sketch |
|
|
88 | (1) |
|
4.5 A simple implementation of count-min sketch |
|
|
88 | (3) |
|
|
89 | (1) |
|
Intuition behind the formula: Math bit |
|
|
90 | (1) |
|
4.6 Range queries with count-min sketch |
|
|
91 | (7) |
|
|
91 | (2) |
|
|
93 | (1) |
|
|
94 | (1) |
|
Computing dyadic intervals |
|
|
95 | (3) |
|
5 Cardinality estimation and HyperLogLog |
|
|
98 | (21) |
|
5.1 Counting distinct items in databases |
|
|
99 | (1) |
|
5.2 HyperLogLog incremental design |
|
|
100 | (9) |
|
The first cut: Probabilistic counting |
|
|
101 | (2) |
|
Stochastic averaging, or "when life gives you lemons" |
|
|
103 | (2) |
|
|
105 | (1) |
|
HyperLogLog: Stochastic averaging with harmonic mean |
|
|
106 | (3) |
|
5.3 Use case: Catching worms with HLL |
|
|
109 | (2) |
|
5.4 But how does it work? A mini experiment |
|
|
111 | (3) |
|
The effect of the number of buckets (m) |
|
|
113 | (1) |
|
5.5 Use case: Aggregation using HyperLogLog |
|
|
114 | (5) |
|
Part 2 Real-Time Analytics |
|
|
119 | (76) |
|
6 Streaming data: Bringing everything together |
|
|
121 | (21) |
|
6.1 Streaming data system: A meta example |
|
|
126 | (6) |
|
|
126 | (2) |
|
|
128 | (2) |
|
Load balancing and tracking the network traffic |
|
|
130 | (2) |
|
6.2 Practical constraints and concepts in data streams |
|
|
132 | (3) |
|
|
132 | (1) |
|
Small time and small space |
|
|
133 | (1) |
|
Concept shifts and concept drifts |
|
|
133 | (1) |
|
|
133 | (2) |
|
6.3 Math bit: Sampling and estimation |
|
|
135 | (7) |
|
|
136 | (3) |
|
Estimation from a representative sample |
|
|
139 | (3) |
|
7 Sampling from data streams |
|
|
142 | (26) |
|
7.1 Sampling from a landmark stream |
|
|
143 | (13) |
|
|
143 | (3) |
|
|
146 | (5) |
|
Biased reservoir sampling |
|
|
151 | (5) |
|
7.2 Sampling from a sliding window |
|
|
156 | (7) |
|
|
156 | (4) |
|
|
160 | (3) |
|
7.3 Sampling algorithms comparison |
|
|
163 | (5) |
|
Simulation setup: Algorithms and data |
|
|
163 | (5) |
|
8 Approximate quantiles on data streams |
|
|
168 | (27) |
|
|
169 | (3) |
|
8.2 Approximate quantiles |
|
|
172 | (2) |
|
|
172 | (1) |
|
|
173 | (1) |
|
Relative error in the data domain |
|
|
174 | (1) |
|
8.3 T-digest: How it works |
|
|
174 | (10) |
|
|
175 | (2) |
|
|
177 | (3) |
|
|
180 | (3) |
|
Space bounds for t-digest |
|
|
183 | (1) |
|
|
184 | (5) |
|
Constructing a q-digest from scratch |
|
|
184 | (2) |
|
|
186 | (2) |
|
Error and space considerations in q-digests |
|
|
188 | (1) |
|
Quantile queries with q-digests |
|
|
188 | (1) |
|
8.5 Simulation code and results |
|
|
189 | (6) |
|
Part 3 Data structures for databases and external memory algorithms |
|
|
195 | (72) |
|
9 Introducing the external memory model |
|
|
197 | (18) |
|
9.1 External memory model: The preliminaries |
|
|
199 | (2) |
|
9.2 Example 1: Finding a minimum |
|
|
201 | (3) |
|
Use case: Minimum median income |
|
|
201 | (3) |
|
9.3 Example 2: Binary search |
|
|
204 | (3) |
|
|
204 | (2) |
|
|
206 | (1) |
|
|
207 | (2) |
|
9.5 Example 3: Merging K sorted lists |
|
|
209 | (4) |
|
|
210 | (3) |
|
External memory model: Simple or simplistic? |
|
|
213 | (1) |
|
|
213 | (2) |
|
10 Data structures for databases: B-trees, Bε-trees, and LSM-trees |
|
|
215 | (33) |
|
|
216 | (2) |
|
10.2 Data structures in this chapter |
|
|
218 | (1) |
|
|
219 | (11) |
|
|
220 | (1) |
|
|
221 | (1) |
|
|
221 | (3) |
|
|
224 | (3) |
|
|
227 | (2) |
|
How operations on a B+-tree are different |
|
|
229 | (1) |
|
Use case: B-trees in MySQL (and many other places) |
|
|
229 | (1) |
|
10.4 Math bit: Why are B-tree lookups optimal in external memory? |
|
|
230 | (2) |
|
Why B-tree inserts/deletes are not optimal in external memory |
|
|
231 | (1) |
|
|
232 | (8) |
|
|
233 | (1) |
|
|
233 | (3) |
|
|
236 | (1) |
|
|
236 | (1) |
|
|
236 | (2) |
|
Bε-tree: The spectrum of data structures |
|
|
238 | (1) |
|
Use case: Bε-trees in TokuDB |
|
|
238 | (2) |
|
Make haste slowly, the I/O way |
|
|
240 | (1) |
|
10.6 Log-structured merge-trees (LSM-trees) |
|
|
240 | (8) |
|
The LSM-tree: How it works |
|
|
241 | (4) |
|
|
245 | (1) |
|
Use case: LSM-trees in Cassandra |
|
|
245 | (3) |
|
11 External memory sorting |
|
|
248 | (19) |
|
|
249 | (2) |
|
|
249 | (1) |
|
|
250 | (1) |
|
11.2 Challenges of sorting in external memory: An example |
|
|
251 | (4) |
|
Two-way merge-sort in external memory |
|
|
252 | (3) |
|
11.3 External memory merge-sort (M/B-way merge-sort) |
|
|
255 | (3) |
|
Searching and sorting in RAM vs. external memory |
|
|
256 | (2) |
|
11.4 What about external quick-sort? |
|
|
258 | (5) |
|
External memory two-way quick-sort |
|
|
258 | (1) |
|
Toward external memory multiway quick-sort |
|
|
259 | (1) |
|
|
260 | (1) |
|
Finding good enough pivots |
|
|
261 | (1) |
|
Putting it all back together |
|
|
262 | (1) |
|
11.5 Math bit: Why is external memory merge-sort optimal? |
|
|
263 | (1) |
|
|
264 | (3) |
References |
|
267 | (6) |
Index |
|
273 | |