Muutke küpsiste eelistusi

E-raamat: Practical Text Mining with Perl

(Central Connecticut State University)
Teised raamatud teemal:
  • Formaat - PDF+DRM
  • Hind: 137,02 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
  • Raamatukogudele
Teised raamatud teemal:

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

Bilisoly (Central Connecticut State U.) has written this introductory guide to text mining through the use of Perl, an open-source programming tool that can be downloaded from online sources at no cost to the user. By covering such basics as regular expressions, text pattern methodology and quantitative text summaries, the author provides a tutorial to efficient and thorough text mining applications, including the bags-of-words model, TF-IDF similarity measure, concordance lines and corpus linguistics. Designed primarily for text mining students and professionals who wish to enhance their information access, this book also explores the use of multivariate techniques such as correlation, principal components analysis and clustering. Annotation ©2008 Book News, Inc., Portland, OR (booknews.com)

Provides readers with the methods, algorithms, and means to perform text mining tasks

This book is devoted to the fundamentals of text mining using Perl, an open-source programming tool that is freely available via the Internet (www.perl.org). It covers mining ideas from several perspectives--statistics, data mining, linguistics, and information retrieval--and provides readers with the means to successfully complete text mining tasks on their own.

The book begins with an introduction to regular expressions, a text pattern methodology, and quantitative text summaries, all of which are fundamental tools of analyzing text. Then, it builds upon this foundation to explore:

  • Probability and texts, including the bag-of-words model
  • Information retrieval techniques such as the TF-IDF similarity measure
  • Concordance lines and corpus linguistics
  • Multivariate techniques such as correlation, principal components analysis, and clustering
  • Perl modules, German, and permutation tests

Each chapter is devoted to a single key topic, and the author carefully and thoughtfully introduces mathematical concepts as they arise, allowing readers to learn as they go without having to refer to additional books. The inclusion of numerous exercises and worked-out examples further complements the book's student-friendly format.

Practical Text Mining with Perl is ideal as a textbook for undergraduate and graduate courses in text mining and as a reference for a variety of professionals who are interested in extracting information from text documents.

Arvustused

"Practical Text Mining with Perl is an excellent book for readers at a variety of different programming skill levels Bilisoly's book would serve as a good text for an introductory text mining course, and could be supplemented with lecture notes for Web mining or data mining courses." (Journal of Statistical Software, January 2009)

List of Figures
xiii
List of Tables
xv
Preface xvii
Acknowledgments xxiii
Introduction
1(6)
Overview of this Book
1(1)
Text Mining and Related Fields
2(3)
Pattern Matching
2(1)
Data Structures
3(1)
Probability
3(1)
Information Retrieval
3(1)
Corpus Liguistics
4(1)
Multivariate Statistics
4(1)
Clustering
5(1)
Three Additional Topics
5(1)
Advice for Reading this Book
5(2)
Text Patterns
7(52)
Introduction
7(1)
Regular Expressions
8(7)
First Regex: Finding the Word Cat
8(2)
Character Ranges adn Finding Telephone Numbers
10(2)
Testing Regexes with Perl
12(3)
Finding Words in a Text
15(6)
Regex Summary
15(2)
Nineteenth-Century Literature
17(1)
Perl Variables and the Function split
17(3)
Match Variables
20(1)
Decomposing Poe's ``The Tell-Tale Heart'' into Words
21(7)
Dashes and String Substitutions
23(1)
Hyphens
24(3)
Apostrophes
27(1)
A Simple Concordance
28(6)
Command Line Arugments
33(1)
Writing to Files
33(1)
First Attempt at Extracting Sentences
34(12)
Sentence Segmentation Preliminaries
35(2)
Sentence Segmentation for A Christmas Carol
37(4)
Leftmost Greediness and Sentence Segmentation
41(5)
Regex Odds and Ends
46(6)
Match Variables and Backreferences
47(1)
Regular Expression Operators and Their Output
48(2)
Lookaround
50(2)
References
52(1)
Problems
52(7)
Quantitative Text Summaries
59(46)
Introduction
59(1)
Scalars, Interpolation, and Context in Perl
59(1)
Arrays and Context in Perl
60(4)
Word Lenghts in Poe's ``The Tell-Tale Heart''
64(2)
Arrays and Functions
66(7)
Additing and Removing Entries from Arrays
66(3)
Selecting Subsets of an Array
69(1)
Sorting an Array
69(4)
Hashes
73(4)
Using a Hash
74(3)
Two Text Applications
77(9)
Zipf's Law for a Christmas Carol
77(6)
Perl for Word Games
83(1)
An Aid to Crossword Puzzles
83(1)
Word Anagrams
84(1)
Finding Word in a Set of Letters
85(1)
Complex Data Structures
86(11)
References and Pointers
87(3)
Arrays of Arrays and Beyond
90(2)
Application: Comparing the Words in Two Poe Stories
92(5)
References
97(1)
First Transition
97(1)
Problems
97(8)
Probability and Text Sampling
105(28)
Introduction
105(1)
Probability
105(10)
Probability and Coin Flipping
106(2)
Probabilities and Texts
108(1)
Estimating Letter Probabilities for Poe and Dickens
109(3)
Estimating Letter Bigram Probabilities
112(3)
Conditional Probability
115(3)
Independence
117(1)
Mean and Vairance of Random Variables
118(5)
Sampling and Error Estimates
120(3)
The Bag-of-Words Model for Poe's ``The Black Cat''
123(1)
The Effect of Sample Size
124(4)
Tokens vs. Types in Poe's ``Hans Pfaall''
124(4)
References
128(1)
Problems
129(4)
Applying Information Retrieval to Text Mining
133(28)
Introduction
133(1)
Counting Letters and Words
134(4)
Counting Letters in Poe with Perl
134(2)
Counting Pronouns Occuring in Poe
136(2)
Text Counts and Vectors
138(5)
Vectors and Angles for Two Poe Stories
139(1)
Computing Angles Between Vectors
140(1)
Subroutines in Perl
140(3)
Computing the Angle between Vectors
143(1)
The Term-Document Matrix Applied to Poe
143(4)
Matrix Multiplication
147(3)
Matrix Multiplications Applied to Poe
148(2)
Function of Counts
150(2)
Document Similarity
152(5)
Inverse Document Frequency
153(1)
Poe Story Angles Revisited
154(3)
References
157(1)
Problems
157(4)
Concordance Lines and Corpus Lingusitics
161(30)
Introduction
161(1)
Sampling
162(2)
Statistical Survey Sampling
162(1)
Text Sampling
163(1)
Corpus as Baseline
164(5)
Function vs. Content Words in Dickens, London, and Shelley
168(1)
Concordancing
169(10)
Sorting Concordance Lines
170(1)
Code for Sorting Concordance Lines
171(1)
Applications: Word Usage Differences between London and Shelley
172(4)
Application: Word Morphology of Adverbs
176(3)
Collocations and Concordance Lines
179(6)
More Ways to Sort Concordance Lines
179(2)
Application: Phrasal Verbs in The Call of the Wild
181(3)
Grouping Words: Colors in The Call of the Wild
184(1)
Applications with References
185(2)
Second Transition
187(1)
Problems
188(3)
Multivariate Techniques with Text
191(28)
Introduction
191(1)
Basic Statistics
192(10)
z-Scores Applied to Poe
193(2)
Word Correlations among Poe's Short Stories
195(4)
Correlations and Cosines
199(2)
Correlations and Covariances
201(1)
Basic linear algebra
202(3)
2 by 2 Correlation Matrices
202(3)
Principal Components Analysis
205(6)
Finding the Principal Components
206(1)
PCA Applied to the 68 Poe Short Stories
206(3)
Another PCA Example with Poe's Short Stories
209(1)
Rotations
209(2)
Text Applications
211(1)
A Word on Factor Analysis
211(1)
Applications and References
211(1)
Problems
212(7)
Text Clustering
219(24)
Introduction
219(1)
Clustering
220(15)
Two-Variable Example of k-Means
220(3)
k-Means with R
223(1)
He versus She in Poe's Short Stories
224(5)
Poe Clusters Using Eight Pronouns
229(1)
Clustering Poe Using Principal Components
230(4)
Hierarchical Clustering of Poe's Short Stories
234(1)
A Note on Classification
235(1)
Decision Trees and Overfitting
236(1)
References
236(1)
Last Transition
236(1)
Problems
236(7)
A Sample of Additional Topics
243(16)
Introduction
243(1)
Perl Modules
243(5)
Modules for Number Words
244(1)
The StopWords Module
245(1)
The Sentence Segmentation Module
245(2)
An Object-Oriented Module for Tagging
247(1)
Miscellaneous Modules
248(1)
Other Languages: Analyzing Goethe in German
248(3)
Permutation Tests
251(7)
Runs and Hypothesis Testing
252(2)
Distribution of Character Names in Dickens and London
254(4)
References
258(1)
Appendix A: Overview of Perl for Text Mining
259(16)
Basic Data Structures
259(4)
Special Variables and Arrays
262(1)
Operators
263(3)
Branching and Looping
266(4)
A Few Perl Functions
270(1)
Introduction to Regular Expressions
271(4)
Appendix B: Summary of R used in this Book
275(8)
Basics of R
275(4)
Data Entry
276(1)
Basic Operators
277(1)
Matrix Manipulation
278(1)
This Book's R Code
279(4)
Refernces 283(8)
Index 291
Roger Bilisoly, PhD, is an Assistant Professor of Statistics at Central Connecticut State University, where he developed and teaches a new graduate-level course in text mining for the school's data mining program.