Foreword |
|
xix | |
Preface |
|
xxi | |
Notation |
|
xxv | |
I Foundations |
|
1 | (102) |
|
|
2 | (31) |
|
1.1 What Is Information Retrieval? |
|
|
2 | (3) |
|
|
2 | (1) |
|
1.1.2 Other Search Applications |
|
|
3 | (1) |
|
1.1.3 Other IR Applications |
|
|
4 | (1) |
|
1.2 Information Retrieval Systems |
|
|
5 | (4) |
|
1.2.1 Basic IR System Architecture |
|
|
5 | (2) |
|
1.2.2 Documents and Update |
|
|
7 | (1) |
|
1.2.3 Performance Evaluation |
|
|
8 | (1) |
|
1.3 Working with Electronic Text |
|
|
9 | (14) |
|
|
9 | (4) |
|
1.3.2 A Simple Tokenization of English Text |
|
|
13 | (2) |
|
|
15 | (2) |
|
|
17 | (6) |
|
|
23 | (4) |
|
|
24 | (3) |
|
1.5 Open-Source IR Systems |
|
|
27 | (1) |
|
|
27 | (1) |
|
|
27 | (1) |
|
|
28 | (1) |
|
|
28 | (2) |
|
|
30 | (2) |
|
|
32 | (1) |
|
|
33 | (51) |
|
|
33 | (18) |
|
2.1.1 Extended Example: Phrase Search |
|
|
35 | (4) |
|
2.1.2 Implementing Inverted Indices |
|
|
39 | (6) |
|
2.1.3 Documents and Other Elements |
|
|
45 | (6) |
|
2.2 Retrieval and Ranking |
|
|
51 | (15) |
|
2.2.1 The Vector Space Model |
|
|
54 | (6) |
|
|
60 | (3) |
|
|
63 | (3) |
|
|
66 | (10) |
|
2.3.1 Recall and Precision |
|
|
67 | (1) |
|
2.3.2 Effectiveness Measures for Ranked Retrieval |
|
|
68 | (5) |
|
2.3.3 Building a Test Collection |
|
|
73 | (2) |
|
2.3.4 Efficiency Measures |
|
|
75 | (1) |
|
|
76 | (1) |
|
|
77 | (2) |
|
|
79 | (3) |
|
|
82 | (2) |
|
|
84 | (19) |
|
|
85 | (6) |
|
3.1.1 Punctuation and Capitalization |
|
|
85 | (1) |
|
|
86 | (3) |
|
|
89 | (2) |
|
|
91 | (1) |
|
|
92 | (2) |
|
|
94 | (1) |
|
|
95 | (2) |
|
|
97 | (2) |
|
|
99 | (1) |
|
|
100 | (3) |
II Indexing |
|
103 | (154) |
|
4 Static Inverted Indices |
|
|
104 | (33) |
|
4.1 Index Components and Index Life Cycle |
|
|
104 | (2) |
|
|
106 | (4) |
|
|
110 | (4) |
|
4.4 Interleaving Dictionary and Postings Lists |
|
|
114 | (4) |
|
|
118 | (13) |
|
4.5.1 In-Memory Index Construction |
|
|
119 | (6) |
|
4.5.2 Sort-Based Index Construction |
|
|
125 | (2) |
|
4.5.3 Merge-Based Index Construction |
|
|
127 | (4) |
|
4.6 Other Types of Indices |
|
|
131 | (1) |
|
|
132 | (1) |
|
|
132 | (1) |
|
|
133 | (2) |
|
|
135 | (2) |
|
|
137 | (37) |
|
5.1 Query Processing for Ranked Retrieval |
|
|
137 | (23) |
|
5.1.1 Document-at-a-Time Query Processing |
|
|
139 | (6) |
|
5.1.2 Term-at-a-Time Query Processing |
|
|
145 | (6) |
|
5.1.3 Precomputing Score Contributions |
|
|
151 | (2) |
|
|
153 | (1) |
|
5.1.5 Static Index Pruning |
|
|
153 | (7) |
|
5.2 Lightweight Structure |
|
|
160 | (9) |
|
5.2.1 Generalized Concordance Lists |
|
|
160 | (2) |
|
|
162 | (1) |
|
|
163 | (2) |
|
|
165 | (4) |
|
|
169 | (1) |
|
|
170 | (1) |
|
|
171 | (3) |
|
|
174 | (54) |
|
6.1 General-Purpose Data Compression |
|
|
175 | (1) |
|
6.2 Symbolwise Data Compression |
|
|
176 | (15) |
|
6.2.1 Modeling and Coding |
|
|
177 | (4) |
|
|
181 | (5) |
|
|
186 | (3) |
|
6.2.4 Symbolwise Text Compression |
|
|
189 | (2) |
|
6.3 Compressing Postings Lists |
|
|
191 | (25) |
|
6.3.1 Nonparametric Gap Compression |
|
|
192 | (3) |
|
6.3.2 Parametric Gap Compression |
|
|
195 | (6) |
|
6.3.3 Context-Aware Compression Methods |
|
|
201 | (3) |
|
6.3.4 Index Compression for High Query Performance |
|
|
204 | (5) |
|
6.3.5 Compression Effectiveness |
|
|
209 | (3) |
|
6.3.6 Decoding Performance |
|
|
212 | (2) |
|
6.3.7 Document Reordering |
|
|
214 | (2) |
|
6.4 Compressing the Dictionary |
|
|
216 | (6) |
|
|
222 | (1) |
|
|
223 | (1) |
|
|
224 | (1) |
|
|
225 | (3) |
|
7 Dynamic Inverted Indices |
|
|
228 | (29) |
|
|
229 | (2) |
|
7.2 Incremental Index Updates |
|
|
231 | (12) |
|
7.2.1 Contiguous Inverted Lists |
|
|
233 | (6) |
|
7.2.2 Noncontiguous Inverted Lists |
|
|
239 | (4) |
|
|
243 | (7) |
|
|
243 | (2) |
|
|
245 | (5) |
|
7.4 Document Modifications |
|
|
250 | (1) |
|
7.5 Discussion and Further Reading |
|
|
251 | (2) |
|
|
253 | (1) |
|
|
254 | (3) |
III Retrieval and Ranking |
|
257 | (148) |
|
8 Probabilistic Retrieval |
|
|
258 | (28) |
|
|
259 | (2) |
|
8.2 The Binary Independence Model |
|
|
261 | (3) |
|
8.3 The Robertson/Sparck Jones Weighting Formula |
|
|
264 | (2) |
|
|
266 | (5) |
|
8.4.1 Bookstein's Two-Poisson Model |
|
|
267 | (3) |
|
8.4.2 Approximating the Two-Poisson Model |
|
|
270 | (1) |
|
8.4.3 Query Term Frequency |
|
|
271 | (1) |
|
8.5 Document Length: BM25 |
|
|
271 | (2) |
|
|
273 | (4) |
|
|
274 | (1) |
|
8.6.2 Pseudo-Relevance Feedback |
|
|
275 | (2) |
|
|
277 | (2) |
|
8.8 Experimental Comparison |
|
|
279 | (1) |
|
|
280 | (1) |
|
|
281 | (1) |
|
|
282 | (4) |
|
9 Language Modeling and Related Methods |
|
|
286 | (24) |
|
9.1 Generating Queries from Documents |
|
|
287 | (2) |
|
9.2 Language Models and Smoothing |
|
|
289 | (3) |
|
9.3 Ranking with Language Models |
|
|
292 | (4) |
|
9.4 Kullback-Leibler Divergence |
|
|
296 | (2) |
|
9.5 Divergence from Randomness |
|
|
298 | (4) |
|
9.5.1 A Model of Randomness |
|
|
299 | (2) |
|
|
301 | (1) |
|
9.5.3 Document Length Normalization |
|
|
301 | (1) |
|
9.6 Passage Retrieval and Ranking |
|
|
302 | (4) |
|
|
304 | (1) |
|
|
304 | (2) |
|
9.7 Experimental Comparison |
|
|
306 | (1) |
|
|
306 | (1) |
|
|
307 | (1) |
|
|
307 | (3) |
|
10 Categorization and Filtering |
|
|
310 | (66) |
|
|
313 | (18) |
|
10.1.1 Topic-Oriented Batch Filtering |
|
|
313 | (4) |
|
|
317 | (1) |
|
10.1.3 Learning from Historical Examples |
|
|
318 | (2) |
|
10.1.4 Language Categorization |
|
|
320 | (5) |
|
10.1.5 On-Line Adaptive Spam Filtering |
|
|
325 | (4) |
|
10.1.6 Threshold Choice for Binary Categorization |
|
|
329 | (2) |
|
|
331 | (8) |
|
10.2.1 Odds and Odds Ratios |
|
|
333 | (1) |
|
10.2.2 Building Classifiers |
|
|
334 | (2) |
|
|
336 | (2) |
|
10.2.4 Feature Engineering |
|
|
338 | (1) |
|
10.3 Probabilistic Classifiers |
|
|
339 | (10) |
|
10.3.1 Probability Estimates |
|
|
340 | (3) |
|
10.3.2 Combining Probability Estimates |
|
|
343 | (4) |
|
10.3.3 Practical Considerations |
|
|
347 | (2) |
|
|
349 | (5) |
|
10.4.1 Perceptron Algorithm |
|
|
352 | (1) |
|
10.4.2 Support Vector Machines |
|
|
353 | (1) |
|
10.5 Similarity-Based Classifiers |
|
|
354 | (1) |
|
|
354 | (1) |
|
10.5.2 Memory-Based Methods |
|
|
355 | (1) |
|
10.6 Generalized Linear Models |
|
|
355 | (4) |
|
|
357 | (2) |
|
10.7 Information-Theoretic Models |
|
|
359 | (7) |
|
|
360 | (1) |
|
10.7.2 Sequential Compression Models |
|
|
361 | (3) |
|
10.7.3 Decision Trees and Stumps |
|
|
364 | (2) |
|
10.8 Experimental Comparison |
|
|
366 | (5) |
|
10.8.1 Topic-Oriented On-Line Filtering |
|
|
367 | (2) |
|
10.8.2 On-Line Adaptive Spam Filtering |
|
|
369 | (2) |
|
|
371 | (1) |
|
|
372 | (1) |
|
|
373 | (3) |
|
11 Fusion and Metalearning |
|
|
376 | (29) |
|
11.1 Search-Result Fusion |
|
|
377 | (4) |
|
11.1.1 Fixed-Cutoff Aggregation |
|
|
379 | (1) |
|
11.1.2 Rank and Score Aggregation |
|
|
380 | (1) |
|
11.2 Stacking Adaptive Filters |
|
|
381 | (2) |
|
11.3 Stacking Batch Classifiers |
|
|
383 | (2) |
|
11.3.1 Holdout Validation |
|
|
383 | (1) |
|
|
384 | (1) |
|
|
385 | (2) |
|
|
387 | (1) |
|
11.6 Multicategory Ranking and Classification |
|
|
388 | (6) |
|
11.6.1 Document Versus Category Scores |
|
|
389 | (1) |
|
11.6.2 Document Versus Category Rank Fusion |
|
|
390 | (1) |
|
11.6.3 Multicategory Methods |
|
|
391 | (3) |
|
|
394 | (6) |
|
11.7.1 What Is Learning to Rank? |
|
|
395 | (1) |
|
11.7.2 Learning-to-Rank Methods |
|
|
396 | (1) |
|
|
396 | (1) |
|
11.7.4 Learning to Rank for Categorization |
|
|
397 | (1) |
|
11.7.5 Learning for Ranked IR |
|
|
398 | (1) |
|
11.7.6 The LETOR Data Set |
|
|
399 | (1) |
|
|
400 | (1) |
|
|
401 | (1) |
|
|
401 | (4) |
IV Evaluation |
|
405 | (82) |
|
12 Measuring Effectiveness |
|
|
406 | (62) |
|
12.1 Traditional Effectiveness Measures |
|
|
407 | (3) |
|
12.1.1 Recall and Precision |
|
|
407 | (1) |
|
12.1.2 Precision at k Documents (P@k) |
|
|
408 | (1) |
|
12.1.3 Average Precision (AP) |
|
|
408 | (1) |
|
12.1.4 Reciprocal Rank (RR) |
|
|
409 | (1) |
|
12.1.5 Arithmetic Mean Versus Geometric Mean |
|
|
409 | (1) |
|
|
410 | (1) |
|
12.2 The Text REtrieval Conference (TREC) |
|
|
410 | (2) |
|
12.3 Using Statistics in Evaluation |
|
|
412 | (29) |
|
12.3.1 Foundations and Terminology |
|
|
413 | (3) |
|
12.3.2 Confidence Intervals |
|
|
416 | (8) |
|
12.3.3 Comparative Evaluation |
|
|
424 | (3) |
|
12.3.4 Hypothesis Tests Considered Harmful |
|
|
427 | (2) |
|
12.3.5 Paired and Unpaired Differences |
|
|
429 | (1) |
|
12.3.6 Significance Tests |
|
|
430 | (4) |
|
12.3.7 Validity and Power of Statistical Tests |
|
|
434 | (4) |
|
12.3.8 Reporting the Precision of Measurement |
|
|
438 | (1) |
|
|
439 | (2) |
|
12.4 Minimizing Adjudication Effort |
|
|
441 | (10) |
|
12.4.1 Selecting Documents for Adjudication |
|
|
443 | (6) |
|
|
449 | (2) |
|
12.5 Nontraditional Effectiveness Measures |
|
|
451 | (9) |
|
|
451 | (2) |
|
12.5.2 Incomplete and Biased Judgments |
|
|
453 | (2) |
|
12.5.3 Novelty and Diversity |
|
|
455 | (5) |
|
|
460 | (2) |
|
|
462 | (1) |
|
|
463 | (5) |
|
|
468 | (19) |
|
|
468 | (4) |
|
13.1.1 Throughput and Latency |
|
|
469 | (3) |
|
13.1.2 Aggregate Statistics and User Satisfaction |
|
|
472 | (1) |
|
|
472 | (6) |
|
13.2.1 Kendall's Notation |
|
|
474 | (1) |
|
13.2.2 The M/M/1 Queueing Model |
|
|
475 | (2) |
|
13.2.3 Latency Quantiles and Average Utilization |
|
|
477 | (1) |
|
|
478 | (1) |
|
|
479 | (5) |
|
13.4.1 Three-Level Caching |
|
|
480 | (2) |
|
|
482 | (1) |
|
13.4.3 Prefetching Search Results |
|
|
483 | (1) |
|
|
484 | (1) |
|
|
484 | (1) |
|
|
485 | (2) |
V Applications and Extensions |
|
487 | (104) |
|
14 Parallel Information Retrieval |
|
|
488 | (19) |
|
14.1 Parallel Query Processing |
|
|
488 | (10) |
|
14.1.1 Document Partitioning |
|
|
490 | (3) |
|
|
493 | (2) |
|
|
495 | (1) |
|
14.1.4 Redundancy and Fault Tolerance |
|
|
496 | (2) |
|
|
498 | (5) |
|
14.2.1 The Basic Framework |
|
|
498 | (2) |
|
|
500 | (2) |
|
|
502 | (1) |
|
|
502 | (1) |
|
|
503 | (1) |
|
|
504 | (1) |
|
|
505 | (2) |
|
|
507 | (57) |
|
15.1 The Structure of the Web |
|
|
508 | (5) |
|
|
508 | (2) |
|
15.1.2 Static and Dynamic Pages |
|
|
510 | (1) |
|
|
511 | (1) |
|
15.1.4 The Size of the Web |
|
|
511 | (2) |
|
|
513 | (4) |
|
|
513 | (3) |
|
15.2.2 Clickthrough Curves |
|
|
516 | (1) |
|
|
517 | (18) |
|
|
517 | (5) |
|
|
522 | (6) |
|
15.3.3 Properties of PageRank |
|
|
528 | (4) |
|
15.3.4 Other Link Analysis Methods: HITS and SALSA |
|
|
532 | (3) |
|
15.3.5 Other Static Ranking Methods |
|
|
535 | (1) |
|
|
535 | (3) |
|
|
536 | (1) |
|
|
537 | (1) |
|
15.5 Evaluating Web Search |
|
|
538 | (3) |
|
15.5.1 Named Page Finding |
|
|
538 | (2) |
|
15.5.2 Implicit User Feedback |
|
|
540 | (1) |
|
|
541 | (12) |
|
15.6.1 Components of a Crawler |
|
|
542 | (5) |
|
|
547 | (2) |
|
15.6.3 Duplicates and Near-Duplicates |
|
|
549 | (4) |
|
|
553 | (1) |
|
|
553 | (3) |
|
|
554 | (1) |
|
|
555 | (1) |
|
|
555 | (1) |
|
|
556 | (1) |
|
|
556 | (2) |
|
|
558 | (6) |
|
|
564 | (27) |
|
|
565 | (6) |
|
16.1.1 Document Type Definitions |
|
|
568 | (2) |
|
|
570 | (1) |
|
16.2 Paths, Trees, and FLWORs |
|
|
571 | (5) |
|
|
571 | (1) |
|
|
572 | (2) |
|
|
574 | (2) |
|
16.3 Indexing and Query Processing |
|
|
576 | (3) |
|
|
579 | (4) |
|
|
580 | (2) |
|
16.4.2 Overlapping Elements |
|
|
582 | (1) |
|
16.4.3 Retrievable Elements |
|
|
583 | (1) |
|
|
583 | (2) |
|
|
583 | (1) |
|
16.5.2 Effectiveness Measures |
|
|
584 | (1) |
|
|
585 | (2) |
|
|
587 | (1) |
|
|
587 | (4) |
VI Appendix |
|
591 | (6) |
|
|
592 | (5) |
|
A.1 Sequential Versus Random Access on Disk |
|
|
592 | (1) |
|
A.2 Sequential Versus Random Access in RAM |
|
|
593 | (1) |
|
A.3 Pipelined Execution and Branch Prediction |
|
|
594 | (3) |
Index |
|
597 | |