Muutke küpsiste eelistusi
  • Formaat - PDF+DRM
  • Hind: 59,79 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

Textual Statistics with R comprehensively covers the main multidimensional methods in textual statistics supported by a specially-written package in R. Methods discussed include correspondence analysis, clustering, and multiple factor analysis for contigency tables. Each method is illuminated by applications. The book is aimed at researchers and students in statistics, social sciences, hiistory, literature and linguistics. The book will be of interest to anyone from practitioners needing to extract information from texts to students in the field of massive data, where the ability to process textual data is becoming essential.

Arvustused

"Even though textual data science cannot be considered as the youngest sibling of other data science fields, there is still quite a big space to be filled with up-to-date textbooks describing and analyzing various methods and facets of this very interesting topic. In this book, Mónica Bécue-Bertaut tries to fill this gap, giving theoretical and practical instructions about one of the relatively little known, but powerful methods in textual data scienceCorrespondence Analysis (CA)... Extensive graphical images and visualizations represented by various types of plot and diagram are used throughout the material, which provides an even better aid to the reader for grasping the main ideas of the topic... separate mention should be drawn to the language used in the book. It is clear, simple, and even fun to read, providing an understandable way of covering complex topics... Mónica Bécue-Bertaut achieved a good blend of theory and practice in her book, which can be used as a handy resource for students and beginners in data science, as well as for specialists in textual data analysis." - Gia Jgarkava, ISCB December 2019

Foreword xiii
Preface xv
1 Encoding: from a corpus to statistical tables 1(16)
1.1 Textual and contextual data
1(2)
1.1.1 Textual data
1(1)
1.1.2 Contextual data
2(1)
1.1.3 Documents and aggregate documents
2(1)
1.2 Examples and notation
3(2)
1.3 Choosing textual units
5(4)
1.3.1 Graphical forms
6(1)
1.3.2 Lemmas
6(1)
1.3.3 Stems
7(1)
1.3.4 Repeated segments
7(1)
1.3.5 In practice
7(2)
1.4 Preprocessing
9(1)
1.4.1 Unique spelling
9(1)
1.4.2 Partially automated preprocessing
9(1)
1.4.3 Word selection
10(1)
1.5 Word and segment indexes
10(1)
1.6 The Life_UK corpus: preliminary results
10(4)
1.6.1 Verbal content through word and repeated segment indexes
10(3)
1.6.2 Univariate description of contextual variables
13(1)
1.6.3 A note on the frequency range
13(1)
1.7 Implementation with Xplortext
14(1)
1.8 Summary
15(2)
2 Correspondence analysis of textual data 17(26)
2.1 Data and goals
17(2)
2.1.1 Correspondence analysis: a tool for linguistic data analysis
17(1)
2.1.2 Data: a small example
17(1)
2.1.3 Objectives
18(1)
2.2 Associations between documents and words
19(5)
2.2.1 Profile comparisons
19(1)
2.2.2 Independence of documents and words
20(2)
2.2.3 The x2 test
22(1)
2.2.4 Association rates between documents and words
23(1)
2.3 Active row and column clouds
24(2)
2.3.1 Row and column profile spaces
24(1)
2.3.2 Distributional equivalence and the x2 distance
24(1)
2.3.3 Inertia of a cloud
25(1)
2.4 Fitting document and word clouds
26(6)
2.4.1 Factorial axes
26(2)
2.4.2 Visualizing rows and columns
28(4)
2.4.2.1 Category representation
30(1)
2.4.2.2 Word representation
30(2)
2.4.2.3 Transition formulas
32(1)
2.4.2.4 Simultaneous representation of rows and columns
32(1)
2.5 Interpretation aids
32(4)
2.5.1 Eigenvalues and representation quality of the clouds
33(1)
2.5.2 Contribution of documents and words to axis inertia
34(1)
2.5.3 Representation quality of a point
35(1)
2.6 Supplementary rows and columns
36(1)
2.6.1 Supplementary tables
36(1)
2.6.2 Supplementary frequency rows and columns
36(1)
2.6.3 Supplementary quantitative and qualitative variables
37(1)
2.7 Validating the visualization
37(1)
2.8 Interpretation scheme for textual CA results
38(3)
2.9 Implementation with Xplortext
41(1)
2.10 Summary of the CA approach
41(2)
3 Applications of correspondence analysis 43(18)
3.1 Choosing the level of detail for analyses
43(1)
3.2 Correspondence analysis on aggregate free text answers
44(8)
3.2.1 Data and objectives
44(1)
3.2.2 Word selection
44(1)
3.2.3 CA on the aggregate table
44(5)
3.2.3.1 Document representation
45(1)
3.2.3.2 Word representation
46(1)
3.2.3.3 Simultaneous interpretation of the plots
46(3)
3.2.4 Supplementary elements
49(2)
3.2.4.1 Supplementary words
49(1)
3.2.4.2 Supplementary repeated segments
49(1)
3.2.4.3 Supplementary categories
50(1)
3.2.5 Implementation with Xplortext
51(1)
3.3 Direct analysis
52(9)
3.3.1 Data and objectives
52(1)
3.3.2 The main features of direct analysis
53(1)
3.3.3 Direct analysis of the culture question
53(5)
3.3.4 Implementation with Xplortext
58(3)
4 Clustering in textual data science 61(36)
4.1 Clustering documents
61(1)
4.2 Dissimilarity measures between documents
62(1)
4.3 Measuring partition quality
63(1)
4.3.1 Document clusters in the factorial space
63(1)
4.3.2 Partition quality
63(1)
4.4 Dissimilarity measures between document clusters
64(1)
4.4.1 The single-linkage method
64(1)
4.4.2 The complete-linkage method
64(1)
4.4.3 Ward's method
64(1)
4.5 Agglomerative hierarchical clustering
65(2)
4.5.1 Hierarchical tree construction algorithm
65(1)
4.5.2 Selecting the final partition
66(1)
4.5.3 Interpreting clusters
66(1)
4.6 Direct partitioning
67(1)
4.7 Combining clustering methods
68(1)
4.7.1 Consolidating partitions
68(1)
4.7.2 Direct partitioning followed by AHC
68(1)
4.8 A procedure for combining CA and clustering
69(1)
4.9 Example: joint use of CA and AHC
69(5)
4.9.1 Data and objectives
69(5)
4.9.1.1 Data preprocessing using CA
70(1)
4.9.1.2 Constructing the hierarchical tree
70(2)
4.9.1.3 Choosing the final partition
72(2)
4.10 Contiguity-constrained hierarchical clustering
74(2)
4.10.1 Principles and algorithm
74(1)
4.10.2 AHC of age groups with a chronological constraint
75(1)
4.10.3 Implementation with Xplortext
76(1)
4.11 Example: clustering free text answers
76(12)
4.11.1 Data and objectives
76(2)
4.11.2 Data preprocessing
78(6)
4.11.2.1 CA: eigenvalues and total inertia
78(2)
4.11.2.2 Interpreting the first axes
80(4)
4.11.3 AHC: building the tree and choosing the final partition
84(4)
4.12 Describing cluster features
88(7)
4.12.1 Lexical features of clusters
89(2)
4.12.1.1 Describing clusters in terms of characteristic words
89(2)
4.12.1.2 Describing clusters in terms of characteristic documents
91(1)
4.12.2 Describing clusters using contextual variables
91(3)
4.12.2.1 Describing clusters using contextual qualitative variables
91(2)
4.12.2.2 Describing clusters using quantitative contextual variables
93(1)
4.12.3 Implementation with Xplortext
94(1)
4.13 Summary of the use of AHC on factorial coordinates coming from CA
95(2)
5 Lexical characterization of parts of a corpus 97(12)
5.1 Characteristic words
98(1)
5.2 Characteristic words and CA
98(1)
5.3 Characteristic words and clustering
99(2)
5.3.1 Clustering based on verbal content
99(1)
5.3.2 Clustering based on contextual variables
100(1)
5.3.3 Hierarchical words
100(1)
5.4 Characteristic documents
101(1)
5.5 Example: characteristic elements and CA
101(3)
5.5.1 Characteristic words for the categories
101(3)
5.5.2 Characteristic words and factorial planes
104(1)
5.5.3 Documents that characterize categories
104(1)
5.6 Characteristic words in addition to clustering
104(3)
5.7 Implementation with Xplortext
107(2)
6 Multiple factor analysis for textual data 109(26)
6.1 Multiple tables in textual data analysis
109(1)
6.2 Data and objectives
110(4)
6.2.1 Data preprocessing
110(1)
6.2.2 Problems posed by lemmatization
110(1)
6.2.3 Description of the corpora data
111(1)
6.2.4 Indexes of the most frequent words
111(1)
6.2.5 Notation
112(1)
6.2.6 Objectives
113(1)
6.3 Introduction to MFACT
114(2)
6.3.1 The limits of CA on multiple contingency tables
114(1)
6.3.2 How MFACT works
115(1)
6.3.3 Integrating contextual variables
115(1)
6.4 Analysis of multilingual free text answers
116(10)
6.4.1 MFACT: eigenvalues of the global analysis
116(1)
6.4.2 Representation of documents and words
117(4)
6.4.3 Superimposed representation of the global and partial configurations
121(3)
6.4.4 Links between the axes of the global analysis and the separate analyses
124(1)
6.4.5 Representation of the groups of words
125(1)
6.4.6 Implementation with Xplortext
125(1)
6.5 Simultaneous analysis of two open-ended questions: impact of lemmatization
126(6)
6.5.1 Objectives
127(1)
6.5.2 Preliminary steps
127(1)
6.5.3 MFACT on the left and right: lemmatized or non-lemmatized
128(3)
6.5.4 Implementation with Xplortext
131(1)
6.6 Other applications of MFACT in textual data science
132(1)
6.7 MFACT summary
132(3)
7 Applications and analysis workflows 135(52)
7.1 General rules for presenting results
135(2)
7.2 Analyzing bibliographic databases
137(12)
7.2.1 Introduction
137(1)
7.2.2 The lupus data
137(2)
7.2.2.1 The corpus
138(1)
7.2.2.2 Exploratory analysis of the corpus
138(1)
7.2.3 CA of the documents x words table
139(4)
7.2.3.1 The eigenvalues
139(1)
7.2.3.2 Meta-keys and doc-keys
139(4)
7.2.4 Analysis of the year-aggregate table
143(1)
7.2.4.1 Eigenvalues and CA of the lexical table
144(1)
7.2.5 Chronological study of drug names
144(3)
7.2.6 Implementation with Xplortext
147(1)
7.2.7 Conclusions from the study
148(1)
7.3 Badinter's speech: a discursive strategy
149(8)
7.3.1 Introduction
149(1)
7.3.2 Methods
149(1)
7.3.2.1 Breaking up the corpus into documents
149(1)
7.3.2.2 The speech trajectory unveiled by CA
149(1)
7.3.3 Results
150(2)
7.3.4 Argument flow
152(4)
7.3.5 Conclusions on the study of Badinter's speech
156(1)
7.3.6 Implementation with Xplortext
156(1)
7.4 Political speeches
157(16)
7.4.1 Introduction
157(1)
7.4.2 Data and objectives
157(2)
7.4.3 Methodology
159(1)
7.4.4 Results
160(13)
7.4.4.1 Data preprocessing
160(1)
7.4.4.2 Lexicometric characteristics of the 11 speeches and lexical table coding
160(1)
7.4.4.3 Eigenvalues and Cramer's V
160(4)
7.4.4.4 Speech trajectory
164(3)
7.4.4.5 Word representation
167(2)
7.4.4.6 Remarks
169(1)
7.4.4.7 Hierarchical structure of the corpus
170(1)
7.4.4.8 Conclusions
171(2)
7.4.5 Implementation with Xplortext
173(1)
7.5 Corpus of sensory descriptions
173(14)
7.5.1 Introduction
173(1)
7.5.2 Data
174(2)
7.5.2.1 Eight Catalan wines
174(1)
7.5.2.2 Jury
175(1)
7.5.2.3 Verbal categorization
175(1)
7.5.2.4 Encoding the data
175(1)
7.5.3 Objectives
176(1)
7.5.4 Statistical methodology
176(1)
7.5.4.1 MFACT and constructing the mean configuration
176(1)
7.5.4.2 Determining consensual words
177(1)
7.5.5 Results
177(7)
7.5.5.1 Data preprocessing
177(1)
7.5.5.2 Some initial results
178(1)
7.5.5.3 Individual configurations
178(1)
7.5.5.4 MFACT: directions of inertia common to the majority of groups
178(2)
7.5.5.5 MFACT: representing words and documents on the first plane
180(2)
7.5.5.6 Word contributions
182(2)
7.5.5.7 MFACT: group representation
184(1)
7.5.5.8 Consensual words
184(1)
7.5.6 Conclusion
184(2)
7.5.7 Implementation with Xplortext
186(1)
Appendix: Textual data science packages in R 187(2)
Bibliography 189(2)
Index 191
Mónica Bécue-Bertaut is an elected fellow of the International Statistical Institute and was named Chevalier des Palmes Académiques by the French Government. She taught statistics and data science at the Universitat Politènica de Catalunya and offered numerous guest lectures on textual data science in different countries. Dr. Bécue-Bertaut published several books (in French or Spanish) and work chapters (in English) on this last topic. She also participated in the design of software related to textual data science, such as SPAD.T and Xplortext; being this latter an R package.