Foreword |
|
xiii | |
Preface |
|
xv | |
1 Encoding: from a corpus to statistical tables |
|
1 | (16) |
|
1.1 Textual and contextual data |
|
|
1 | (2) |
|
|
1 | (1) |
|
|
2 | (1) |
|
1.1.3 Documents and aggregate documents |
|
|
2 | (1) |
|
1.2 Examples and notation |
|
|
3 | (2) |
|
1.3 Choosing textual units |
|
|
5 | (4) |
|
|
6 | (1) |
|
|
6 | (1) |
|
|
7 | (1) |
|
|
7 | (1) |
|
|
7 | (2) |
|
|
9 | (1) |
|
|
9 | (1) |
|
1.4.2 Partially automated preprocessing |
|
|
9 | (1) |
|
|
10 | (1) |
|
1.5 Word and segment indexes |
|
|
10 | (1) |
|
1.6 The Life_UK corpus: preliminary results |
|
|
10 | (4) |
|
1.6.1 Verbal content through word and repeated segment indexes |
|
|
10 | (3) |
|
1.6.2 Univariate description of contextual variables |
|
|
13 | (1) |
|
1.6.3 A note on the frequency range |
|
|
13 | (1) |
|
1.7 Implementation with Xplortext |
|
|
14 | (1) |
|
|
15 | (2) |
2 Correspondence analysis of textual data |
|
17 | (26) |
|
|
17 | (2) |
|
2.1.1 Correspondence analysis: a tool for linguistic data analysis |
|
|
17 | (1) |
|
2.1.2 Data: a small example |
|
|
17 | (1) |
|
|
18 | (1) |
|
2.2 Associations between documents and words |
|
|
19 | (5) |
|
2.2.1 Profile comparisons |
|
|
19 | (1) |
|
2.2.2 Independence of documents and words |
|
|
20 | (2) |
|
|
22 | (1) |
|
2.2.4 Association rates between documents and words |
|
|
23 | (1) |
|
2.3 Active row and column clouds |
|
|
24 | (2) |
|
2.3.1 Row and column profile spaces |
|
|
24 | (1) |
|
2.3.2 Distributional equivalence and the x2 distance |
|
|
24 | (1) |
|
|
25 | (1) |
|
2.4 Fitting document and word clouds |
|
|
26 | (6) |
|
|
26 | (2) |
|
2.4.2 Visualizing rows and columns |
|
|
28 | (4) |
|
2.4.2.1 Category representation |
|
|
30 | (1) |
|
2.4.2.2 Word representation |
|
|
30 | (2) |
|
2.4.2.3 Transition formulas |
|
|
32 | (1) |
|
2.4.2.4 Simultaneous representation of rows and columns |
|
|
32 | (1) |
|
|
32 | (4) |
|
2.5.1 Eigenvalues and representation quality of the clouds |
|
|
33 | (1) |
|
2.5.2 Contribution of documents and words to axis inertia |
|
|
34 | (1) |
|
2.5.3 Representation quality of a point |
|
|
35 | (1) |
|
2.6 Supplementary rows and columns |
|
|
36 | (1) |
|
2.6.1 Supplementary tables |
|
|
36 | (1) |
|
2.6.2 Supplementary frequency rows and columns |
|
|
36 | (1) |
|
2.6.3 Supplementary quantitative and qualitative variables |
|
|
37 | (1) |
|
2.7 Validating the visualization |
|
|
37 | (1) |
|
2.8 Interpretation scheme for textual CA results |
|
|
38 | (3) |
|
2.9 Implementation with Xplortext |
|
|
41 | (1) |
|
2.10 Summary of the CA approach |
|
|
41 | (2) |
3 Applications of correspondence analysis |
|
43 | (18) |
|
3.1 Choosing the level of detail for analyses |
|
|
43 | (1) |
|
3.2 Correspondence analysis on aggregate free text answers |
|
|
44 | (8) |
|
3.2.1 Data and objectives |
|
|
44 | (1) |
|
|
44 | (1) |
|
3.2.3 CA on the aggregate table |
|
|
44 | (5) |
|
3.2.3.1 Document representation |
|
|
45 | (1) |
|
3.2.3.2 Word representation |
|
|
46 | (1) |
|
3.2.3.3 Simultaneous interpretation of the plots |
|
|
46 | (3) |
|
3.2.4 Supplementary elements |
|
|
49 | (2) |
|
3.2.4.1 Supplementary words |
|
|
49 | (1) |
|
3.2.4.2 Supplementary repeated segments |
|
|
49 | (1) |
|
3.2.4.3 Supplementary categories |
|
|
50 | (1) |
|
3.2.5 Implementation with Xplortext |
|
|
51 | (1) |
|
|
52 | (9) |
|
3.3.1 Data and objectives |
|
|
52 | (1) |
|
3.3.2 The main features of direct analysis |
|
|
53 | (1) |
|
3.3.3 Direct analysis of the culture question |
|
|
53 | (5) |
|
3.3.4 Implementation with Xplortext |
|
|
58 | (3) |
4 Clustering in textual data science |
|
61 | (36) |
|
|
61 | (1) |
|
4.2 Dissimilarity measures between documents |
|
|
62 | (1) |
|
4.3 Measuring partition quality |
|
|
63 | (1) |
|
4.3.1 Document clusters in the factorial space |
|
|
63 | (1) |
|
|
63 | (1) |
|
4.4 Dissimilarity measures between document clusters |
|
|
64 | (1) |
|
4.4.1 The single-linkage method |
|
|
64 | (1) |
|
4.4.2 The complete-linkage method |
|
|
64 | (1) |
|
|
64 | (1) |
|
4.5 Agglomerative hierarchical clustering |
|
|
65 | (2) |
|
4.5.1 Hierarchical tree construction algorithm |
|
|
65 | (1) |
|
4.5.2 Selecting the final partition |
|
|
66 | (1) |
|
4.5.3 Interpreting clusters |
|
|
66 | (1) |
|
|
67 | (1) |
|
4.7 Combining clustering methods |
|
|
68 | (1) |
|
4.7.1 Consolidating partitions |
|
|
68 | (1) |
|
4.7.2 Direct partitioning followed by AHC |
|
|
68 | (1) |
|
4.8 A procedure for combining CA and clustering |
|
|
69 | (1) |
|
4.9 Example: joint use of CA and AHC |
|
|
69 | (5) |
|
4.9.1 Data and objectives |
|
|
69 | (5) |
|
4.9.1.1 Data preprocessing using CA |
|
|
70 | (1) |
|
4.9.1.2 Constructing the hierarchical tree |
|
|
70 | (2) |
|
4.9.1.3 Choosing the final partition |
|
|
72 | (2) |
|
4.10 Contiguity-constrained hierarchical clustering |
|
|
74 | (2) |
|
4.10.1 Principles and algorithm |
|
|
74 | (1) |
|
4.10.2 AHC of age groups with a chronological constraint |
|
|
75 | (1) |
|
4.10.3 Implementation with Xplortext |
|
|
76 | (1) |
|
4.11 Example: clustering free text answers |
|
|
76 | (12) |
|
4.11.1 Data and objectives |
|
|
76 | (2) |
|
4.11.2 Data preprocessing |
|
|
78 | (6) |
|
4.11.2.1 CA: eigenvalues and total inertia |
|
|
78 | (2) |
|
4.11.2.2 Interpreting the first axes |
|
|
80 | (4) |
|
4.11.3 AHC: building the tree and choosing the final partition |
|
|
84 | (4) |
|
4.12 Describing cluster features |
|
|
88 | (7) |
|
4.12.1 Lexical features of clusters |
|
|
89 | (2) |
|
4.12.1.1 Describing clusters in terms of characteristic words |
|
|
89 | (2) |
|
4.12.1.2 Describing clusters in terms of characteristic documents |
|
|
91 | (1) |
|
4.12.2 Describing clusters using contextual variables |
|
|
91 | (3) |
|
4.12.2.1 Describing clusters using contextual qualitative variables |
|
|
91 | (2) |
|
4.12.2.2 Describing clusters using quantitative contextual variables |
|
|
93 | (1) |
|
4.12.3 Implementation with Xplortext |
|
|
94 | (1) |
|
4.13 Summary of the use of AHC on factorial coordinates coming from CA |
|
|
95 | (2) |
5 Lexical characterization of parts of a corpus |
|
97 | (12) |
|
|
98 | (1) |
|
5.2 Characteristic words and CA |
|
|
98 | (1) |
|
5.3 Characteristic words and clustering |
|
|
99 | (2) |
|
5.3.1 Clustering based on verbal content |
|
|
99 | (1) |
|
5.3.2 Clustering based on contextual variables |
|
|
100 | (1) |
|
|
100 | (1) |
|
5.4 Characteristic documents |
|
|
101 | (1) |
|
5.5 Example: characteristic elements and CA |
|
|
101 | (3) |
|
5.5.1 Characteristic words for the categories |
|
|
101 | (3) |
|
5.5.2 Characteristic words and factorial planes |
|
|
104 | (1) |
|
5.5.3 Documents that characterize categories |
|
|
104 | (1) |
|
5.6 Characteristic words in addition to clustering |
|
|
104 | (3) |
|
5.7 Implementation with Xplortext |
|
|
107 | (2) |
6 Multiple factor analysis for textual data |
|
109 | (26) |
|
6.1 Multiple tables in textual data analysis |
|
|
109 | (1) |
|
|
110 | (4) |
|
|
110 | (1) |
|
6.2.2 Problems posed by lemmatization |
|
|
110 | (1) |
|
6.2.3 Description of the corpora data |
|
|
111 | (1) |
|
6.2.4 Indexes of the most frequent words |
|
|
111 | (1) |
|
|
112 | (1) |
|
|
113 | (1) |
|
6.3 Introduction to MFACT |
|
|
114 | (2) |
|
6.3.1 The limits of CA on multiple contingency tables |
|
|
114 | (1) |
|
|
115 | (1) |
|
6.3.3 Integrating contextual variables |
|
|
115 | (1) |
|
6.4 Analysis of multilingual free text answers |
|
|
116 | (10) |
|
6.4.1 MFACT: eigenvalues of the global analysis |
|
|
116 | (1) |
|
6.4.2 Representation of documents and words |
|
|
117 | (4) |
|
6.4.3 Superimposed representation of the global and partial configurations |
|
|
121 | (3) |
|
6.4.4 Links between the axes of the global analysis and the separate analyses |
|
|
124 | (1) |
|
6.4.5 Representation of the groups of words |
|
|
125 | (1) |
|
6.4.6 Implementation with Xplortext |
|
|
125 | (1) |
|
6.5 Simultaneous analysis of two open-ended questions: impact of lemmatization |
|
|
126 | (6) |
|
|
127 | (1) |
|
|
127 | (1) |
|
6.5.3 MFACT on the left and right: lemmatized or non-lemmatized |
|
|
128 | (3) |
|
6.5.4 Implementation with Xplortext |
|
|
131 | (1) |
|
6.6 Other applications of MFACT in textual data science |
|
|
132 | (1) |
|
|
132 | (3) |
7 Applications and analysis workflows |
|
135 | (52) |
|
7.1 General rules for presenting results |
|
|
135 | (2) |
|
7.2 Analyzing bibliographic databases |
|
|
137 | (12) |
|
|
137 | (1) |
|
|
137 | (2) |
|
|
138 | (1) |
|
7.2.2.2 Exploratory analysis of the corpus |
|
|
138 | (1) |
|
7.2.3 CA of the documents x words table |
|
|
139 | (4) |
|
|
139 | (1) |
|
7.2.3.2 Meta-keys and doc-keys |
|
|
139 | (4) |
|
7.2.4 Analysis of the year-aggregate table |
|
|
143 | (1) |
|
7.2.4.1 Eigenvalues and CA of the lexical table |
|
|
144 | (1) |
|
7.2.5 Chronological study of drug names |
|
|
144 | (3) |
|
7.2.6 Implementation with Xplortext |
|
|
147 | (1) |
|
7.2.7 Conclusions from the study |
|
|
148 | (1) |
|
7.3 Badinter's speech: a discursive strategy |
|
|
149 | (8) |
|
|
149 | (1) |
|
|
149 | (1) |
|
7.3.2.1 Breaking up the corpus into documents |
|
|
149 | (1) |
|
7.3.2.2 The speech trajectory unveiled by CA |
|
|
149 | (1) |
|
|
150 | (2) |
|
|
152 | (4) |
|
7.3.5 Conclusions on the study of Badinter's speech |
|
|
156 | (1) |
|
7.3.6 Implementation with Xplortext |
|
|
156 | (1) |
|
|
157 | (16) |
|
|
157 | (1) |
|
7.4.2 Data and objectives |
|
|
157 | (2) |
|
|
159 | (1) |
|
|
160 | (13) |
|
7.4.4.1 Data preprocessing |
|
|
160 | (1) |
|
7.4.4.2 Lexicometric characteristics of the 11 speeches and lexical table coding |
|
|
160 | (1) |
|
7.4.4.3 Eigenvalues and Cramer's V |
|
|
160 | (4) |
|
7.4.4.4 Speech trajectory |
|
|
164 | (3) |
|
7.4.4.5 Word representation |
|
|
167 | (2) |
|
|
169 | (1) |
|
7.4.4.7 Hierarchical structure of the corpus |
|
|
170 | (1) |
|
|
171 | (2) |
|
7.4.5 Implementation with Xplortext |
|
|
173 | (1) |
|
7.5 Corpus of sensory descriptions |
|
|
173 | (14) |
|
|
173 | (1) |
|
|
174 | (2) |
|
7.5.2.1 Eight Catalan wines |
|
|
174 | (1) |
|
|
175 | (1) |
|
7.5.2.3 Verbal categorization |
|
|
175 | (1) |
|
7.5.2.4 Encoding the data |
|
|
175 | (1) |
|
|
176 | (1) |
|
7.5.4 Statistical methodology |
|
|
176 | (1) |
|
7.5.4.1 MFACT and constructing the mean configuration |
|
|
176 | (1) |
|
7.5.4.2 Determining consensual words |
|
|
177 | (1) |
|
|
177 | (7) |
|
7.5.5.1 Data preprocessing |
|
|
177 | (1) |
|
7.5.5.2 Some initial results |
|
|
178 | (1) |
|
7.5.5.3 Individual configurations |
|
|
178 | (1) |
|
7.5.5.4 MFACT: directions of inertia common to the majority of groups |
|
|
178 | (2) |
|
7.5.5.5 MFACT: representing words and documents on the first plane |
|
|
180 | (2) |
|
7.5.5.6 Word contributions |
|
|
182 | (2) |
|
7.5.5.7 MFACT: group representation |
|
|
184 | (1) |
|
|
184 | (1) |
|
|
184 | (2) |
|
7.5.7 Implementation with Xplortext |
|
|
186 | (1) |
Appendix: Textual data science packages in R |
|
187 | (2) |
Bibliography |
|
189 | (2) |
Index |
|
191 | |