Muutke küpsiste eelistusi

Information Retrieval: Implementing and Evaluating Search Engines [Pehme köide]

, (University of Waterloo), (University of Waterloo)
  • Formaat: Paperback / softback, 632 pages, kõrgus x laius x paksus: 229x203x27 mm, 127 b&w illus.; 254 Illustrations
  • Sari: The MIT Press
  • Ilmumisaeg: 12-Feb-2016
  • Kirjastus: MIT Press
  • ISBN-10: 0262528878
  • ISBN-13: 9780262528870
  • Formaat: Paperback / softback, 632 pages, kõrgus x laius x paksus: 229x203x27 mm, 127 b&w illus.; 254 Illustrations
  • Sari: The MIT Press
  • Ilmumisaeg: 12-Feb-2016
  • Kirjastus: MIT Press
  • ISBN-10: 0262528878
  • ISBN-13: 9780262528870

Information retrieval is the foundation for modern search engines. This textbook offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation. The emphasis is on implementation and experimentation; each chapter includes exercises and suggestions for student projects. Wumpus -- a multiuser open-source information retrieval system developed by one of the authors and available online -- provides model implementations and a basis for student work. The modular structure of the book allows instructors to use it in a variety of graduate-level courses, including courses taught from a database systems perspective, traditional information retrieval courses with a focus on IR theory, and courses covering the basics of Web retrieval. In addition to its classroom use,Information Retrieval will be a valuable reference for professionals in computer science, computer engineering, and software engineering.

Muu info

Winner of Honorable Mention, 2010 American Publishers Award for Professional and Scholarly Excellence (PROSE) in the Computing and Information Sciences category 2010.
Foreword xix
Preface xxi
Notation xxv
I Foundations 1(102)
1 Introduction
2(31)
1.1 What Is Information Retrieval?
2(3)
1.1.1 Web Search
2(1)
1.1.2 Other Search Applications
3(1)
1.1.3 Other IR Applications
4(1)
1.2 Information Retrieval Systems
5(4)
1.2.1 Basic IR System Architecture
5(2)
1.2.2 Documents and Update
7(1)
1.2.3 Performance Evaluation
8(1)
1.3 Working with Electronic Text
9(14)
1.3.1 Text Formats
9(4)
1.3.2 A Simple Tokenization of English Text
13(2)
1.3.3 Term Distributions
15(2)
1.3.4 Language Modeling
17(6)
1.4 Test Collections
23(4)
1.4.1 TREC Tasks
24(3)
1.5 Open-Source IR Systems
27(1)
1.5.1 Lucene
27(1)
1.5.2 Indri
27(1)
1.5.3 Wumpus
28(1)
1.6 Further Reading
28(2)
1.7 Exercises
30(2)
1.8 Bibliography
32(1)
2 Basic Techniques
33(51)
2.1 Inverted Indices
33(18)
2.1.1 Extended Example: Phrase Search
35(4)
2.1.2 Implementing Inverted Indices
39(6)
2.1.3 Documents and Other Elements
45(6)
2.2 Retrieval and Ranking
51(15)
2.2.1 The Vector Space Model
54(6)
2.2.2 Proximity Ranking
60(3)
2.2.3 Boolean Retrieval
63(3)
2.3 Evaluation
66(10)
2.3.1 Recall and Precision
67(1)
2.3.2 Effectiveness Measures for Ranked Retrieval
68(5)
2.3.3 Building a Test Collection
73(2)
2.3.4 Efficiency Measures
75(1)
2.4 Summary
76(1)
2.5 Further Reading
77(2)
2.6 Exercises
79(3)
2.7 Bibliography
82(2)
3 Tokens and Terms
84(19)
3.1 English
85(6)
3.1.1 Punctuation and Capitalization
85(1)
3.1.2 Stemming
86(3)
3.1.3 Stopping
89(2)
3.2 Characters
91(1)
3.3 Character N-Grams
92(2)
3.4 European Languages
94(1)
3.5 CJK Languages
95(2)
3.6 Further Reading
97(2)
3.7 Exercises
99(1)
3.8 Bibliography
100(3)
II Indexing 103(154)
4 Static Inverted Indices
104(33)
4.1 Index Components and Index Life Cycle
104(2)
4.2 The Dictionary
106(4)
4.3 Postings Lists
110(4)
4.4 Interleaving Dictionary and Postings Lists
114(4)
4.5 Index Construction
118(13)
4.5.1 In-Memory Index Construction
119(6)
4.5.2 Sort-Based Index Construction
125(2)
4.5.3 Merge-Based Index Construction
127(4)
4.6 Other Types of Indices
131(1)
4.7 Summary
132(1)
4.8 Further Reading
132(1)
4.9 Exercises
133(2)
4.10 Bibliography
135(2)
5 Query Processing
137(37)
5.1 Query Processing for Ranked Retrieval
137(23)
5.1.1 Document-at-a-Time Query Processing
139(6)
5.1.2 Term-at-a-Time Query Processing
145(6)
5.1.3 Precomputing Score Contributions
151(2)
5.1.4 Impact Ordering
153(1)
5.1.5 Static Index Pruning
153(7)
5.2 Lightweight Structure
160(9)
5.2.1 Generalized Concordance Lists
160(2)
5.2.2 Operators
162(1)
5.2.3 Examples
163(2)
5.2.4 Implementation
165(4)
5.3 Further Reading
169(1)
5.4 Exercises
170(1)
5.5 Bibliography
171(3)
6 Index Compression
174(54)
6.1 General-Purpose Data Compression
175(1)
6.2 Symbolwise Data Compression
176(15)
6.2.1 Modeling and Coding
177(4)
6.2.2 Huffman Coding
181(5)
6.2.3 Arithmetic Coding
186(3)
6.2.4 Symbolwise Text Compression
189(2)
6.3 Compressing Postings Lists
191(25)
6.3.1 Nonparametric Gap Compression
192(3)
6.3.2 Parametric Gap Compression
195(6)
6.3.3 Context-Aware Compression Methods
201(3)
6.3.4 Index Compression for High Query Performance
204(5)
6.3.5 Compression Effectiveness
209(3)
6.3.6 Decoding Performance
212(2)
6.3.7 Document Reordering
214(2)
6.4 Compressing the Dictionary
216(6)
6.5 Summary
222(1)
6.6 Further Reading
223(1)
6.7 Exercises
224(1)
6.8 Bibliography
225(3)
7 Dynamic Inverted Indices
228(29)
7.1 Batch Updates
229(2)
7.2 Incremental Index Updates
231(12)
7.2.1 Contiguous Inverted Lists
233(6)
7.2.2 Noncontiguous Inverted Lists
239(4)
7.3 Document Deletions
243(7)
7.3.1 Invalidation List
243(2)
7.3.2 Garbage Collection
245(5)
7.4 Document Modifications
250(1)
7.5 Discussion and Further Reading
251(2)
7.6 Exercises
253(1)
7.7 Bibliography
254(3)
III Retrieval and Ranking 257(148)
8 Probabilistic Retrieval
258(28)
8.1 Modeling Relevance
259(2)
8.2 The Binary Independence Model
261(3)
8.3 The Robertson/Sparck Jones Weighting Formula
264(2)
8.4 Term Frequency
266(5)
8.4.1 Bookstein's Two-Poisson Model
267(3)
8.4.2 Approximating the Two-Poisson Model
270(1)
8.4.3 Query Term Frequency
271(1)
8.5 Document Length: BM25
271(2)
8.6 Relevance Feedback
273(4)
8.6.1 Term Selection
274(1)
8.6.2 Pseudo-Relevance Feedback
275(2)
8.7 Field Weights: BM25F
277(2)
8.8 Experimental Comparison
279(1)
8.9 Further Reading
280(1)
8.10 Exercises
281(1)
8.11 Bibliography
282(4)
9 Language Modeling and Related Methods
286(24)
9.1 Generating Queries from Documents
287(2)
9.2 Language Models and Smoothing
289(3)
9.3 Ranking with Language Models
292(4)
9.4 Kullback-Leibler Divergence
296(2)
9.5 Divergence from Randomness
298(4)
9.5.1 A Model of Randomness
299(2)
9.5.2 Eliteness
301(1)
9.5.3 Document Length Normalization
301(1)
9.6 Passage Retrieval and Ranking
302(4)
9.6.1 Passage Scoring
304(1)
9.6.2 Implementation
304(2)
9.7 Experimental Comparison
306(1)
9.8 Further Reading
306(1)
9.9 Exercises
307(1)
9.10 Bibliography
307(3)
10 Categorization and Filtering
310(66)
10.1 Detailed Examples
313(18)
10.1.1 Topic-Oriented Batch Filtering
313(4)
10.1.2 On-Line Filtering
317(1)
10.1.3 Learning from Historical Examples
318(2)
10.1.4 Language Categorization
320(5)
10.1.5 On-Line Adaptive Spam Filtering
325(4)
10.1.6 Threshold Choice for Binary Categorization
329(2)
10.2 Classification
331(8)
10.2.1 Odds and Odds Ratios
333(1)
10.2.2 Building Classifiers
334(2)
10.2.3 Learning Modes
336(2)
10.2.4 Feature Engineering
338(1)
10.3 Probabilistic Classifiers
339(10)
10.3.1 Probability Estimates
340(3)
10.3.2 Combining Probability Estimates
343(4)
10.3.3 Practical Considerations
347(2)
10.4 Linear Classifiers
349(5)
10.4.1 Perceptron Algorithm
352(1)
10.4.2 Support Vector Machines
353(1)
10.5 Similarity-Based Classifiers
354(1)
10.5.1 Rocchio's Method
354(1)
10.5.2 Memory-Based Methods
355(1)
10.6 Generalized Linear Models
355(4)
10.6.1 Kernel Methods
357(2)
10.7 Information-Theoretic Models
359(7)
10.7.1 Comparing Models
360(1)
10.7.2 Sequential Compression Models
361(3)
10.7.3 Decision Trees and Stumps
364(2)
10.8 Experimental Comparison
366(5)
10.8.1 Topic-Oriented On-Line Filtering
367(2)
10.8.2 On-Line Adaptive Spam Filtering
369(2)
10.9 Further Reading
371(1)
10.10 Exercises
372(1)
10.11 Bibliography
373(3)
11 Fusion and Metalearning
376(29)
11.1 Search-Result Fusion
377(4)
11.1.1 Fixed-Cutoff Aggregation
379(1)
11.1.2 Rank and Score Aggregation
380(1)
11.2 Stacking Adaptive Filters
381(2)
11.3 Stacking Batch Classifiers
383(2)
11.3.1 Holdout Validation
383(1)
11.3.2 Cross-Validation
384(1)
11.4 Bagging
385(2)
11.5 Boosting
387(1)
11.6 Multicategory Ranking and Classification
388(6)
11.6.1 Document Versus Category Scores
389(1)
11.6.2 Document Versus Category Rank Fusion
390(1)
11.6.3 Multicategory Methods
391(3)
11.7 Learning to Rank
394(6)
11.7.1 What Is Learning to Rank?
395(1)
11.7.2 Learning-to-Rank Methods
396(1)
11.7.3 What to Optimize?
396(1)
11.7.4 Learning to Rank for Categorization
397(1)
11.7.5 Learning for Ranked IR
398(1)
11.7.6 The LETOR Data Set
399(1)
11.8 Further Reading
400(1)
11.9 Exercises
401(1)
11.10 Bibliography
401(4)
IV Evaluation 405(82)
12 Measuring Effectiveness
406(62)
12.1 Traditional Effectiveness Measures
407(3)
12.1.1 Recall and Precision
407(1)
12.1.2 Precision at k Documents (P@k)
408(1)
12.1.3 Average Precision (AP)
408(1)
12.1.4 Reciprocal Rank (RR)
409(1)
12.1.5 Arithmetic Mean Versus Geometric Mean
409(1)
12.1.6 User Satisfaction
410(1)
12.2 The Text REtrieval Conference (TREC)
410(2)
12.3 Using Statistics in Evaluation
412(29)
12.3.1 Foundations and Terminology
413(3)
12.3.2 Confidence Intervals
416(8)
12.3.3 Comparative Evaluation
424(3)
12.3.4 Hypothesis Tests Considered Harmful
427(2)
12.3.5 Paired and Unpaired Differences
429(1)
12.3.6 Significance Tests
430(4)
12.3.7 Validity and Power of Statistical Tests
434(4)
12.3.8 Reporting the Precision of Measurement
438(1)
12.3.9 Meta-Analysis
439(2)
12.4 Minimizing Adjudication Effort
441(10)
12.4.1 Selecting Documents for Adjudication
443(6)
12.4.2 Sampling the Pool
449(2)
12.5 Nontraditional Effectiveness Measures
451(9)
12.5.1 Graded Relevance
451(2)
12.5.2 Incomplete and Biased Judgments
453(2)
12.5.3 Novelty and Diversity
455(5)
12.6 Further Reading
460(2)
12.7 Exercises
462(1)
12.8 Bibliography
463(5)
13 Measuring Efficiency
468(19)
13.1 Efficiency Criteria
468(4)
13.1.1 Throughput and Latency
469(3)
13.1.2 Aggregate Statistics and User Satisfaction
472(1)
13.2 Queueing Theory
472(6)
13.2.1 Kendall's Notation
474(1)
13.2.2 The M/M/1 Queueing Model
475(2)
13.2.3 Latency Quantiles and Average Utilization
477(1)
13.3 Query Scheduling
478(1)
13.4 Caching
479(5)
13.4.1 Three-Level Caching
480(2)
13.4.2 Cache Policies
482(1)
13.4.3 Prefetching Search Results
483(1)
13.5 Further Reading
484(1)
13.6 Exercises
484(1)
13.7 Bibliography
485(2)
V Applications and Extensions 487(104)
14 Parallel Information Retrieval
488(19)
14.1 Parallel Query Processing
488(10)
14.1.1 Document Partitioning
490(3)
14.1.2 Term Partitioning
493(2)
14.1.3 Hybrid Schemes
495(1)
14.1.4 Redundancy and Fault Tolerance
496(2)
14.2 MapReduce
498(5)
14.2.1 The Basic Framework
498(2)
14.2.2 Combiners
500(2)
14.2.3 Secondary Keys
502(1)
14.2.4 Machine Failures
502(1)
14.3 Further Reading
503(1)
14.4 Exercises
504(1)
14.5 Bibliography
505(2)
15 Web Search
507(57)
15.1 The Structure of the Web
508(5)
15.1.1 The Web Graph
508(2)
15.1.2 Static and Dynamic Pages
510(1)
15.1.3 The Hidden Web
511(1)
15.1.4 The Size of the Web
511(2)
15.2 Queries and Users
513(4)
15.2.1 User Intent
513(3)
15.2.2 Clickthrough Curves
516(1)
15.3 Static Ranking
517(18)
15.3.1 Basic PageRank
517(5)
15.3.2 Extended PageRank
522(6)
15.3.3 Properties of PageRank
528(4)
15.3.4 Other Link Analysis Methods: HITS and SALSA
532(3)
15.3.5 Other Static Ranking Methods
535(1)
15.4 Dynamic Ranking
535(3)
15.4.1 Anchor Text
536(1)
15.4.2 Novelty
537(1)
15.5 Evaluating Web Search
538(3)
15.5.1 Named Page Finding
538(2)
15.5.2 Implicit User Feedback
540(1)
15.6 Web Crawlers
541(12)
15.6.1 Components of a Crawler
542(5)
15.6.2 Crawl Order
547(2)
15.6.3 Duplicates and Near-Duplicates
549(4)
15.7 Summary
553(1)
15.8 Further Reading
553(3)
15.8.1 Link Analysis
554(1)
15.8.2 Anchor Text
555(1)
15.8.3 Implicit Feedback
555(1)
15.8.4 Web Crawlers
556(1)
15.9 Exercises
556(2)
15.10 Bibliography
558(6)
16 XML Retrieval
564(27)
16.1 The Essence of XML
565(6)
16.1.1 Document Type Definitions
568(2)
16.1.2 XML Schema
570(1)
16.2 Paths, Trees, and FLWORs
571(5)
16.2.1 XPath
571(1)
16.2.2 NEXT
572(2)
16.2.3 XQuery
574(2)
16.3 Indexing and Query Processing
576(3)
16.4 Ranked Retrieval
579(4)
16.4.1 Ranking Elements
580(2)
16.4.2 Overlapping Elements
582(1)
16.4.3 Retrievable Elements
583(1)
16.5 Evaluation
583(2)
16.5.1 Test Collections
583(1)
16.5.2 Effectiveness Measures
584(1)
16.6 Further Reading
585(2)
16.7 Exercises
587(1)
16.8 Bibliography
587(4)
VI Appendix 591(6)
A Computer Performance
592(5)
A.1 Sequential Versus Random Access on Disk
592(1)
A.2 Sequential Versus Random Access in RAM
593(1)
A.3 Pipelined Execution and Branch Prediction
594(3)
Index 597