Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

E-raamat: Textual Data Science with R

5.00/5 (2 hinnangut Goodreads-ist)

Mónica Bécue-Bertaut

Formaat: 204 pages
Sari: Chapman & Hall/CRC Computer Science & Data Analysis
Ilmumisaeg: 11-Mar-2019
Kirjastus: CRC Press
Keel: eng
ISBN-13: 9781351816366

Teised raamatud teemal:

Formaat - PDF+DRM
Hind: 59,79 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Lisa ostukorvi
Lisa soovinimekirja
See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.

Formaat: 204 pages
Sari: Chapman & Hall/CRC Computer Science & Data Analysis
Ilmumisaeg: 11-Mar-2019
Kirjastus: CRC Press
Keel: eng
ISBN-13: 9781351816366

Teised raamatud teemal:

DRM piirangud

Kopeerimine (copy/paste):

ei ole lubatud
Printimine:

ei ole lubatud
Kasutamine:

Digitaalõiguste kaitse (DRM)
Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

Vajalik tarkvara
Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

Seda e-raamatut ei saa lugeda Amazon Kindle's.

Textual Statistics with R comprehensively covers the main multidimensional methods in textual statistics supported by a specially-written package in R. Methods discussed include correspondence analysis, clustering, and multiple factor analysis for contigency tables. Each method is illuminated by applications. The book is aimed at researchers and students in statistics, social sciences, hiistory, literature and linguistics. The book will be of interest to anyone from practitioners needing to extract information from texts to students in the field of massive data, where the ability to process textual data is becoming essential.

Arvustused

"Even though textual data science cannot be considered as the youngest sibling of other data science fields, there is still quite a big space to be filled with up-to-date textbooks describing and analyzing various methods and facets of this very interesting topic. In this book, Mónica Bécue-Bertaut tries to fill this gap, giving theoretical and practical instructions about one of the relatively little known, but powerful methods in textual data scienceCorrespondence Analysis (CA)... Extensive graphical images and visualizations represented by various types of plot and diagram are used throughout the material, which provides an even better aid to the reader for grasping the main ideas of the topic... separate mention should be drawn to the language used in the book. It is clear, simple, and even fun to read, providing an understandable way of covering complex topics... Mónica Bécue-Bertaut achieved a good blend of theory and practice in her book, which can be used as a handy resource for students and beginners in data science, as well as for specialists in textual data analysis." - Gia Jgarkava, ISCB December 2019

Foreword

xiii

Preface

1 Encoding: from a corpus to statistical tables

(16)

1.1 Textual and contextual data

(2)

1.1.1 Textual data

(1)

1.1.2 Contextual data

(1)

1.1.3 Documents and aggregate documents

(1)

1.2 Examples and notation

(2)

1.3 Choosing textual units

(4)

1.3.1 Graphical forms

(1)

1.3.2 Lemmas

(1)

1.3.3 Stems

(1)

1.3.4 Repeated segments

(1)

1.3.5 In practice

(2)

1.4 Preprocessing

(1)

1.4.1 Unique spelling

(1)

1.4.2 Partially automated preprocessing

(1)

1.4.3 Word selection

(1)

1.5 Word and segment indexes

(1)

1.6 The Life_UK corpus: preliminary results

(4)

1.6.1 Verbal content through word and repeated segment indexes

(3)

1.6.2 Univariate description of contextual variables

(1)

1.6.3 A note on the frequency range

(1)

1.7 Implementation with Xplortext

(1)

1.8 Summary

(2)

2 Correspondence analysis of textual data

(26)

2.1 Data and goals

(2)

2.1.1 Correspondence analysis: a tool for linguistic data analysis

(1)

2.1.2 Data: a small example

(1)

2.1.3 Objectives

(1)

2.2 Associations between documents and words

(5)

2.2.1 Profile comparisons

(1)

2.2.2 Independence of documents and words

(2)

2.2.3 The x2 test

(1)

2.2.4 Association rates between documents and words

(1)

2.3 Active row and column clouds

(2)

2.3.1 Row and column profile spaces

(1)

2.3.2 Distributional equivalence and the x2 distance

(1)

2.3.3 Inertia of a cloud

(1)

2.4 Fitting document and word clouds

(6)

2.4.1 Factorial axes

(2)

2.4.2 Visualizing rows and columns

(4)

2.4.2.1 Category representation

(1)

2.4.2.2 Word representation

(2)

2.4.2.3 Transition formulas

(1)

2.4.2.4 Simultaneous representation of rows and columns

(1)

2.5 Interpretation aids

(4)

2.5.1 Eigenvalues and representation quality of the clouds

(1)

2.5.2 Contribution of documents and words to axis inertia

(1)

2.5.3 Representation quality of a point

(1)

2.6 Supplementary rows and columns

(1)

2.6.1 Supplementary tables

(1)

2.6.2 Supplementary frequency rows and columns

(1)

2.6.3 Supplementary quantitative and qualitative variables

(1)

2.7 Validating the visualization

(1)

2.8 Interpretation scheme for textual CA results

(3)

2.9 Implementation with Xplortext

(1)

2.10 Summary of the CA approach

(2)

3 Applications of correspondence analysis

(18)

3.1 Choosing the level of detail for analyses

(1)

3.2 Correspondence analysis on aggregate free text answers

(8)

3.2.1 Data and objectives

(1)

3.2.2 Word selection

(1)

3.2.3 CA on the aggregate table

(5)

3.2.3.1 Document representation

(1)

3.2.3.2 Word representation

(1)

3.2.3.3 Simultaneous interpretation of the plots

(3)

3.2.4 Supplementary elements

(2)

3.2.4.1 Supplementary words

(1)

3.2.4.2 Supplementary repeated segments

(1)

3.2.4.3 Supplementary categories

(1)

3.2.5 Implementation with Xplortext

(1)

3.3 Direct analysis

(9)

3.3.1 Data and objectives

(1)

3.3.2 The main features of direct analysis

(1)

3.3.3 Direct analysis of the culture question

(5)

3.3.4 Implementation with Xplortext

(3)

4 Clustering in textual data science

(36)

4.1 Clustering documents

(1)

4.2 Dissimilarity measures between documents

(1)

4.3 Measuring partition quality

(1)

4.3.1 Document clusters in the factorial space

(1)

4.3.2 Partition quality

(1)

4.4 Dissimilarity measures between document clusters

(1)

4.4.1 The single-linkage method

(1)

4.4.2 The complete-linkage method

(1)

4.4.3 Ward's method

(1)

4.5 Agglomerative hierarchical clustering

(2)

4.5.1 Hierarchical tree construction algorithm

(1)

4.5.2 Selecting the final partition

(1)

4.5.3 Interpreting clusters

(1)

4.6 Direct partitioning

(1)

4.7 Combining clustering methods

(1)

4.7.1 Consolidating partitions

(1)

4.7.2 Direct partitioning followed by AHC

(1)

4.8 A procedure for combining CA and clustering

(1)

4.9 Example: joint use of CA and AHC

(5)

4.9.1 Data and objectives

(5)

4.9.1.1 Data preprocessing using CA

(1)

4.9.1.2 Constructing the hierarchical tree

(2)

4.9.1.3 Choosing the final partition

(2)

4.10 Contiguity-constrained hierarchical clustering

(2)

4.10.1 Principles and algorithm

(1)

4.10.2 AHC of age groups with a chronological constraint

(1)

4.10.3 Implementation with Xplortext

(1)

4.11 Example: clustering free text answers

(12)

4.11.1 Data and objectives

(2)

4.11.2 Data preprocessing

(6)

4.11.2.1 CA: eigenvalues and total inertia

(2)

4.11.2.2 Interpreting the first axes

(4)

4.11.3 AHC: building the tree and choosing the final partition

(4)

4.12 Describing cluster features

(7)

4.12.1 Lexical features of clusters

(2)

4.12.1.1 Describing clusters in terms of characteristic words

(2)

4.12.1.2 Describing clusters in terms of characteristic documents

(1)

4.12.2 Describing clusters using contextual variables

(3)

4.12.2.1 Describing clusters using contextual qualitative variables

(2)

4.12.2.2 Describing clusters using quantitative contextual variables

(1)

4.12.3 Implementation with Xplortext

(1)

4.13 Summary of the use of AHC on factorial coordinates coming from CA

(2)

5 Lexical characterization of parts of a corpus

(12)

5.1 Characteristic words

(1)

5.2 Characteristic words and CA

(1)

5.3 Characteristic words and clustering

(2)

5.3.1 Clustering based on verbal content

(1)

5.3.2 Clustering based on contextual variables

100

(1)

5.3.3 Hierarchical words

100

(1)

5.4 Characteristic documents

101

(1)

5.5 Example: characteristic elements and CA

101

(3)

5.5.1 Characteristic words for the categories

101

(3)

5.5.2 Characteristic words and factorial planes

104

(1)

5.5.3 Documents that characterize categories

104

(1)

5.6 Characteristic words in addition to clustering

104

(3)

5.7 Implementation with Xplortext

107

(2)

6 Multiple factor analysis for textual data

109

(26)

6.1 Multiple tables in textual data analysis

109

(1)

6.2 Data and objectives

110

(4)

6.2.1 Data preprocessing

110

(1)

6.2.2 Problems posed by lemmatization

110

(1)

6.2.3 Description of the corpora data

111

(1)

6.2.4 Indexes of the most frequent words

111

(1)

6.2.5 Notation

112

(1)

6.2.6 Objectives

113

(1)

6.3 Introduction to MFACT

114

(2)

6.3.1 The limits of CA on multiple contingency tables

114

(1)

6.3.2 How MFACT works

115

(1)

6.3.3 Integrating contextual variables

115

(1)

6.4 Analysis of multilingual free text answers

116

(10)

6.4.1 MFACT: eigenvalues of the global analysis

116

(1)

6.4.2 Representation of documents and words

117

(4)

6.4.3 Superimposed representation of the global and partial configurations

121

(3)

6.4.4 Links between the axes of the global analysis and the separate analyses

124

(1)

6.4.5 Representation of the groups of words

125

(1)

6.4.6 Implementation with Xplortext

125

(1)

6.5 Simultaneous analysis of two open-ended questions: impact of lemmatization

126

(6)

6.5.1 Objectives

127

(1)

6.5.2 Preliminary steps

127

(1)

6.5.3 MFACT on the left and right: lemmatized or non-lemmatized

128

(3)

6.5.4 Implementation with Xplortext

131

(1)

6.6 Other applications of MFACT in textual data science

132

(1)

6.7 MFACT summary

132

(3)

7 Applications and analysis workflows

135

(52)

7.1 General rules for presenting results

135

(2)

7.2 Analyzing bibliographic databases

137

(12)

7.2.1 Introduction

137

(1)

7.2.2 The lupus data

137

(2)

7.2.2.1 The corpus

138

(1)

7.2.2.2 Exploratory analysis of the corpus

138

(1)

7.2.3 CA of the documents x words table

139

(4)

7.2.3.1 The eigenvalues

139

(1)

7.2.3.2 Meta-keys and doc-keys

139

(4)

7.2.4 Analysis of the year-aggregate table

143

(1)

7.2.4.1 Eigenvalues and CA of the lexical table

144

(1)

7.2.5 Chronological study of drug names

144

(3)

7.2.6 Implementation with Xplortext

147

(1)

7.2.7 Conclusions from the study

148

(1)

7.3 Badinter's speech: a discursive strategy

149

(8)

7.3.1 Introduction

149

(1)

7.3.2 Methods

149

(1)

7.3.2.1 Breaking up the corpus into documents

149

(1)

7.3.2.2 The speech trajectory unveiled by CA

149

(1)

7.3.3 Results

150

(2)

7.3.4 Argument flow

152

(4)

7.3.5 Conclusions on the study of Badinter's speech

156

(1)

7.3.6 Implementation with Xplortext

156

(1)

7.4 Political speeches

157

(16)

7.4.1 Introduction

157

(1)

7.4.2 Data and objectives

157

(2)

7.4.3 Methodology

159

(1)

7.4.4 Results

160

(13)

7.4.4.1 Data preprocessing

160

(1)

7.4.4.2 Lexicometric characteristics of the 11 speeches and lexical table coding

160

(1)

7.4.4.3 Eigenvalues and Cramer's V

160

(4)

7.4.4.4 Speech trajectory

164

(3)

7.4.4.5 Word representation

167

(2)

7.4.4.6 Remarks

169

(1)

7.4.4.7 Hierarchical structure of the corpus

170

(1)

7.4.4.8 Conclusions

171

(2)

7.4.5 Implementation with Xplortext

173

(1)

7.5 Corpus of sensory descriptions

173

(14)

7.5.1 Introduction

173

(1)

7.5.2 Data

174

(2)

7.5.2.1 Eight Catalan wines

174

(1)

7.5.2.2 Jury

175

(1)

7.5.2.3 Verbal categorization

175

(1)

7.5.2.4 Encoding the data

175

(1)

7.5.3 Objectives

176

(1)

7.5.4 Statistical methodology

176

(1)

7.5.4.1 MFACT and constructing the mean configuration

176

(1)

7.5.4.2 Determining consensual words

177

(1)

7.5.5 Results

177

(7)

7.5.5.1 Data preprocessing

177

(1)

7.5.5.2 Some initial results

178

(1)

7.5.5.3 Individual configurations

178

(1)

7.5.5.4 MFACT: directions of inertia common to the majority of groups

178

(2)

7.5.5.5 MFACT: representing words and documents on the first plane

180

(2)

7.5.5.6 Word contributions

182

(2)

7.5.5.7 MFACT: group representation

184

(1)

7.5.5.8 Consensual words

184

(1)

7.5.6 Conclusion

184

(2)

7.5.7 Implementation with Xplortext

186

(1)

Appendix: Textual data science packages in R

187

(2)

Bibliography

189

(2)

Index

191

Mónica Bécue-Bertaut is an elected fellow of the International Statistical Institute and was named Chevalier des Palmes Académiques by the French Government. She taught statistics and data science at the Universitat Politčnica de Catalunya and offered numerous guest lectures on textual data science in different countries. Dr. Bécue-Bertaut published several books (in French or Spanish) and work chapters (in English) on this last topic. She also participated in the design of software related to textual data science, such as SPAD.T and Xplortext; being this latter an R package.

Lisainfo e-raamatute kohta

Püsilink: https://www.kriso.ee/db/97813518163662e.html

Märksõnad:

E-raamat: Textual Data Science with R

DRM piirangud

Kopeerimine (copy/paste):

Printimine:

Kasutamine:

Arvustused

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad E-raamatute teemad

Vali ostukorv