Muutke küpsiste eelistusi

Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining [Pehme köide]

  • Formaat: Paperback / softback, 530 pages, kõrgus x laius x paksus: 234x190x27 mm, kaal: 1055 g
  • Sari: ACM Books
  • Ilmumisaeg: 30-Jun-2016
  • Kirjastus: Morgan & Claypool Publishers
  • ISBN-10: 197000116X
  • ISBN-13: 9781970001167
Teised raamatud teemal:
  • Formaat: Paperback / softback, 530 pages, kõrgus x laius x paksus: 234x190x27 mm, kaal: 1055 g
  • Sari: ACM Books
  • Ilmumisaeg: 30-Jun-2016
  • Kirjastus: Morgan & Claypool Publishers
  • ISBN-10: 197000116X
  • ISBN-13: 9781970001167
Teised raamatud teemal:

Recent years have seen a dramatic growth of natural language text data, including web pages, news articles, scientific literature, emails, enterprise documents, and social media such as blog articles, forum posts, product reviews, and tweets. This has led to an increasing demand for powerful software tools to help people analyze and manage vast amounts of text data effectively and efficiently. Unlike data generated by a computer system or sensors, text data are usually generated directly by humans, and are accompanied by semantically rich content. As such, text data are especially valuable for discovering knowledge about human opinions and preferences, in addition to many other kinds of knowledge that we encode in text. In contrast to structured data, which conform to well-defined schemas (thus are relatively easy for computers to handle), text has less explicit structure, requiring computer processing toward understanding of the content encoded in text. The current technology of natural language processing has not yet reached a point to enable a computer to precisely understand natural language text, but a wide range of statistical and heuristic approaches to analysis and management of text data have been developed over the past few decades. They are usually very robust and can be applied to analyze and manage text data in any natural language, and about any topic.

This book provides a systematic introduction to all these approaches, with an emphasis on covering the most useful knowledge and skills required to build a variety of practically useful text information systems. The focus is on text mining applications that can help users analyze patterns in text data to extract and reveal useful knowledge. Information retrieval systems, including search engines and recommender systems, are also covered as supporting technology for text mining applications. The book covers the major concepts, techniques, and ideas in text data mining and information retrieval from a practical viewpoint, and includes many hands-on exercises designed with a companion software toolkit (i.e., MeTA) to help readers learn how to apply techniques of text mining and information retrieval to real-world text data and how to experiment with and improve some of the algorithms for interesting application tasks. The book can be used as a textbook for a computer science undergraduate course or a reference book for practitioners working on relevant problems in analyzing and managing text data.

Preface xv
Acknowledgments xviii
PART I OVERVIEW AND BACKGROUND
1(70)
Chapter 1 Introduction
3(18)
1.1 Functions of Text Information Systems
7(3)
1.2 Conceptual Framework for Text Information Systems
10(3)
1.3 Organization of the Book
13(2)
1.4 How to Use this Book
15(6)
Bibliographic Notes and Further Reading
18(3)
Chapter 2 Background
21(18)
2.1 Basics of Probability and Statistics
21(10)
2.2 Information Theory
31(3)
2.3 Machine Learning
34(5)
Bibliographic Notes and Further Reading
36(1)
Exercises
37(2)
Chapter 3 Text Data Understanding
39(18)
3.1 History and State of the Art in NLP
42(1)
3.2 NLP and Text Information Systems
43(3)
3.3 Text Representation
46(4)
3.4 Statistical Language Models
50(7)
Bibliographic Notes and Further Reading
54(1)
Exercises
55(2)
Chapter 4 MeTA: A Unified Toolkit for Text Data Management and Analysis
57(14)
4.1 Design Philosophy
58(1)
4.2 Setting up MeTA
59(1)
4.3 Architecture
60(1)
4.4 Tokenization with MeTA
61(3)
4.5 Related Toolkits
64(7)
Exercises
65(6)
PART II TEXT DATA ACCESS
71(168)
Chapter 5 Overview of Text Data Access
73(14)
5.1 Access Mode: Pull vs. Push
73(3)
5.2 Multimode Interactive Access
76(2)
5.3 Text Retrieval
78(2)
5.4 Text Retrieval vs. Database Retrieval
80(2)
5.5 Document Selection vs. Document Ranking
82(5)
Bibliographic Notes and Further Reading
84(1)
Exercises
85(2)
Chapter 6 Retrieval Models
87(46)
6.1 Overview
87(1)
6.2 Common Form of a Retrieval Function
88(2)
6.3 Vector Space Retrieval Models
90(20)
6.4 Probabilistic Retrieval Models
110(23)
Bibliographic Notes and Further Reading
128(1)
Exercises
129(4)
Chapter 7 Feedback
133(14)
7.1 Feedback in the Vector Space Model
135(3)
7.2 Feedback in Language Models
138(9)
Bibliographic Notes and Further Reading
144(1)
Exercises
144(3)
Chapter 8 Search Engine Implementation
147(20)
8.1 Tokenizer
148(2)
8.2 Indexer
150(3)
8.3 Scorer
153(4)
8.4 Feedback Implementation
157(1)
8.5 Compression
158(4)
8.6 Caching
162(5)
Bibliographic Notes and Further Reading
165(1)
Exercises
165(2)
Chapter 9 Search Engine Evaluation
167(24)
9.1 Introduction
167(3)
9.2 Evaluation of Set Retrieval
170(4)
9.3 Evaluation of a Ranked List
174(6)
9.4 Evaluation with Multi-level Judgements
180(3)
9.5 Practical Issues in Evaluation
183(8)
Bibliographic Notes and Further Reading
187(1)
Exercises
188(3)
Chapter 10 Web Search
191(30)
10.1 Web Crawling
192(2)
10.2 Web Indexing
194(6)
10.3 Link Analysis
200(8)
10.4 Learning to Rank
208(4)
10.5 The Future of Web Search
212(9)
Bibliographic Notes and Further Reading
216(1)
Exercises
216(5)
Chapter 11 Recommender Systems
221(18)
11.1 Content-based Recommendation
222(7)
11.2 Collaborative Filtering
229(4)
11.3 Evaluation of Recommender Systems
233(6)
Bibliographic Notes and Further Reading
235(1)
Exercises
235(4)
PART III TEXT DATA ANALYSIS
239(204)
Chapter 12 Overview of Text Data Analysis
241(10)
12.1 Motivation: Applications of Text Data Analysis
242(2)
12.2 Text vs. Non-text Data: Humans as Subjective Sensors
244(2)
12.3 Landscape of text mining tasks
246(5)
Chapter 13 Word Association Mining
251(24)
13.1 General idea of word association mining
252(3)
13.2 Discovery of paradigmatic relations
255(5)
13.3 Discovery of Syntagmatic Relations
260(11)
13.4 Evaluation of Word Association Mining
271(4)
Bibliographic Notes and Further Reading
273(1)
Exercises
273(2)
Chapter 14 Text Clustering
275(24)
14.1 Overview of Clustering Techniques
277(2)
14.2 Document Clustering
279(5)
14.3 Term Clustering
284(10)
14.4 Evaluation of Text Clustering
294(5)
Bibliographic Notes and Further Reading
296(1)
Exercises
296(3)
Chapter 15 Text Categorization
299(18)
15.1 Introduction
299(1)
15.2 Overview of Text Categorization Methods
300(2)
15.3 Text Categorization Problem
302(2)
15.4 Features for Text Categorization
304(3)
15.5 Classification Algorithms
307(6)
15.6 Evaluation of Text Categorization
313(4)
Bibliographic Notes and Further Reading
315(1)
Exercises
315(2)
Chapter 16 Text Summarization
317(12)
16.1 Overview of Text Summarization Techniques
318(1)
16.2 Extractive Text Summarization
319(2)
16.3 Abstractive Text Summarization
321(3)
16.4 Evaluation of Text Summarization
324(1)
16.5 Applications of Text Summarization
325(4)
Bibliographic Notes and Further Reading
327(1)
Exercises
327(2)
Chapter 17 Topic Analysis
329(60)
17.1 Topics as Terms
332(3)
17.2 Topics as Word Distributions
335(5)
17.3 Mining One Topic from Text
340(28)
17.4 Probabilistic Latent Semantic Analysis
368(9)
17.5 Extension of PLSA and Latent Dirichlet Allocation
377(6)
17.6 Evaluating Topic Analysis
383(1)
17.7 Summary of Topic Models
384(5)
Bibliographic Notes and Further Reading
385(1)
Exercises
386(3)
Chapter 18 Opinion Mining and Sentiment Analysis
389(24)
18.1 Sentiment Classification
393(3)
18.2 Ordinal Regression
396(4)
18.3 Latent Aspect Rating Analysis
400(9)
18.4 Evaluation of Opinion Mining and Sentiment Analysis
409(4)
Bibliographic Notes and Further Reading
410(1)
Exercises
410(3)
Chapter 19 Joint Analysis of Text and Structured Data
413(30)
19.1 Introduction
413(4)
19.2 Contextual Text Mining
417(2)
19.3 Contextual Probabilistic Latent Semantic Analysis
419(9)
19.4 Topic Analysis with Social Networks as Context
428(5)
19.5 Topic Analysis with Time Series Context
433(6)
19.6 Summary
439(4)
Bibliographic Notes and Further Reading
440(1)
Exercises
440(3)
PART IV UNIFIED TEXT DATA MANAGEMENT ANALYSIS SYSTEM
443(14)
Chapter 20 Toward A Unified System for Text Management and Analysis
445(12)
20.1 Text Analysis Operators
448(4)
20.2 System Architecture
452(1)
20.3 MeTA as a Unified System
453(4)
Appendix A Bayesian Statistics
457(8)
A.1 Binomial Estimation and the Beta Distribution
457(2)
A.2 Pseudo Counts, Smoothing, and Setting Hyperparameters
459(1)
A.3 Generalizing to a Multinomial Distribution
460(1)
A.4 The Dirichlet Distribution
461(2)
A.5 Bayesian Estimate of Multinomial Parameters
463(1)
A.6 Conclusion
464(1)
Appendix B Expectation- Maximization
465(8)
B.1 A Simple Mixture Unigram Language Model
466(1)
B.2 Maximum Likelihood Estimation
466(1)
B.3 Incomplete vs. Complete Data
467(1)
B.4 A Lower Bound of Likelihood
468(1)
B.5 The General Procedure of EM
469(4)
Appendix C KL-divergence and Dirichlet Prior Smoothing
473(4)
C.1 Using KL-divergence for Retrieval
473(2)
C.2 Using Dirichlet Prior Smoothing
475(1)
C.3 Computing the Query Model p(w θQ)
475(2)
References 477(12)
Index 489(20)
Authors' Biographies 509
ChengXiang Zhai is a Professor of Computer Science and Willett Faculty Scholar at the University of Illinois at Urbana-Champaign, where he is also affiliated with the Graduate School of Library and Information Science, Institute for Genomic Biology, and Department of Statistics. He received a Ph.D. in Computer Science from Nanjing University in 1990, and a Ph.D. in Language and Information Technologies from Carnegie Mellon University in 2002. He worked at Clairvoyance Corp. as a Research Scientist and then Senior Research Scientist from 1997-2000. His research interests include information retrieval, text mining, natural language processing, machine learning, biomedical and health informatics, and intelligent education information systems. He has published over 200 research papers in major conferences and journals. He served as an Associate Editor for Information Processing and Management, as an Associate Editor of ACM Transactions on Information Systems, and on the editorial board of Information Retrieval Journal. He was a conference program co-chair of ACM CIKM 2004, NAACL HLT 2007, ACM SIGIR 2009, ECIR 2014, ICTIR 2015, and WWW 2015, and conference general co-chair for ACM CIKM 2016. He is an ACM Distinguished Scientist and a recipient of multiple awards, including the ACM SIGIR 2004 Best Paper Award, the ACM SIGIR 2014 Test of Time Paper Award, Alfred P. Sloan Research Fellowship, IBM Faculty Award, HP Innovation Research Program Award, Microsoft Beyond Search Research Award, and the Presidential Early Career Award for Scientists and Engineers (PECASE).

Sean Massung is a Ph.D. candidate in computer science at the University of Illinois at Urbana-Champaign, where he also received both his B.S. and M.S. degrees. He is a co-founder of META and uses it in all of his research. He has been instructor for CS 225: Data Structures and Programming Principles, CS 410: Text Information Systems, and CS 591txt: Text Mining Seminar. He is included in the 2014 List of Teachers Ranked as Excellent at the University of Illinois and has received an Outstanding Teaching Assistant Award and CS@Illinois Outstanding Research Project Award. He has given talks at Jump Labs Champaign and at UIUC for Data and Information Systems Seminar, Intro to Big Data, and Teaching Assistant Seminar. His research interests include text mining applications in information retrieval, natural language processing, and education.