List of Figures |
|
xiii | |
List of Tables |
|
xv | |
Acknowledgements |
|
xvii | |
1 Introduction |
|
1 | (12) |
|
1.1 Linguistic Data Analysis |
|
|
3 | (5) |
|
|
3 | (1) |
|
|
3 | (4) |
|
1.1.3 Collecting and analysing data |
|
|
7 | (1) |
|
|
8 | (2) |
|
1.3 Conventions Used in this Book |
|
|
10 | (1) |
|
|
11 | (1) |
|
|
11 | (2) |
2 What's Out There? |
|
13 | (16) |
|
|
13 | (1) |
|
|
13 | (2) |
|
2.3 Synchronic vs. Diachronic Corpora |
|
|
15 | (6) |
|
2.3.1 'Early' synchronic corpora |
|
|
15 | (3) |
|
|
18 | (2) |
|
2.3.3 Examples of diachronic corpora |
|
|
20 | (1) |
|
2.4 General vs. Specific Corpora |
|
|
21 | (4) |
|
2.4.1 Examples of specific corpora |
|
|
22 | (3) |
|
2.5 Static Versus Dynamic Corpora |
|
|
25 | (1) |
|
2.6 Other Sources for Corpora |
|
|
26 | (1) |
|
Solutions to/Comments on the Exercises |
|
|
26 | (2) |
|
|
28 | (1) |
|
Sources and Further Reading |
|
|
28 | (1) |
3 Understanding Corpus Design |
|
29 | (14) |
|
3.1 Food for Thought - General Issues in Corpus Design |
|
|
29 | (4) |
|
|
30 | (1) |
|
|
31 | (1) |
|
3.1.3 Balance and representativeness |
|
|
32 | (1) |
|
|
32 | (1) |
|
3.2 What's in a Text? - Understanding Document Structure |
|
|
33 | (5) |
|
3.2.1 Headers, 'footers' and meta-data |
|
|
34 | (2) |
|
3.2.2 The structure of the (text) body |
|
|
36 | (1) |
|
3.2.3 What's (in) an electronic text? - understanding file formats and their properties |
|
|
37 | (1) |
|
3.3 Understanding Encoding: Character Sets, File Size, etc. |
|
|
38 | (3) |
|
3.3.1 ASCII and legacy encodings |
|
|
38 | (1) |
|
|
39 | (1) |
|
|
40 | (1) |
|
Solutions to/Comments on the Exercises |
|
|
41 | (1) |
|
Sources and Further Reading |
|
|
42 | (1) |
4 Finding and Preparing Your Data |
|
43 | (24) |
|
4.1 Finding Suitable Materials for Analysis |
|
|
44 | (2) |
|
4.1.1 Retrieving data from text archives |
|
|
44 | (1) |
|
4.1.2 Obtaining materials from Project Gutenberg |
|
|
44 | (1) |
|
4.1.3 Obtaining materials from the Oxford Text Archive |
|
|
45 | (1) |
|
4.2 Collecting Written Materials Yourself ('Web as Corpus') |
|
|
46 | (7) |
|
4.2.1 A brief note on plain-text editors |
|
|
46 | (2) |
|
4.2.2 Browser text export |
|
|
48 | (1) |
|
4.2.3 Browser HTML export |
|
|
49 | (1) |
|
4.2.4 Getting web data using ICEweb |
|
|
50 | (2) |
|
4.2.5 Downloading other types of files |
|
|
52 | (1) |
|
4.3 Collecting Spoken Data |
|
|
53 | (3) |
|
4.4 Preparing Written Data for Analysis |
|
|
56 | (6) |
|
4.4.1 'Cleaning up' your data |
|
|
56 | (2) |
|
4.4.2 Extracting text from proprietary document formats |
|
|
58 | (1) |
|
4.4.3 Removing unnecessary header and Tooter' information |
|
|
58 | (1) |
|
4.4.4 Documenting what you've collected |
|
|
59 | (1) |
|
4.4.5 Preparing your data for distribution or archiving |
|
|
60 | (2) |
|
Solutions to/Comments on the Exercises |
|
|
62 | (4) |
|
Sources and Further Reading |
|
|
66 | (1) |
5 Concordancing |
|
67 | (15) |
|
5.1 What's Concordancing? |
|
|
67 | (2) |
|
5.2 Concordancing with AntConc |
|
|
69 | (9) |
|
|
74 | (1) |
|
5.2.2 Saving, pruning and reusing your results |
|
|
75 | (3) |
|
Solutions to/Comments on the Exercises |
|
|
78 | (3) |
|
Sources and Further Reading |
|
|
81 | (1) |
6 Regular Expressions |
|
82 | (19) |
|
|
84 | (2) |
|
6.2 Negative Character Classes |
|
|
86 | (1) |
|
|
86 | (1) |
|
6.4 Anchoring, Grouping and Alternation |
|
|
87 | (5) |
|
|
87 | (1) |
|
6.4.2 Grouping and alternation |
|
|
88 | (2) |
|
6.4.3 Quoting and using special characters |
|
|
90 | (1) |
|
6.4.4 Constraining the context further |
|
|
91 | (1) |
|
|
92 | (1) |
|
Solutions to/Comments on the Exercises |
|
|
93 | (7) |
|
Sources and Further Reading |
|
|
100 | (1) |
7 Understanding Part-of-Speech Tagging and Its Uses |
|
101 | (20) |
|
7.1 A Brief Introduction to (Morpho-Syntactic) Tagsets |
|
|
103 | (6) |
|
7.2 Tagging Your Own Data |
|
|
109 | (4) |
|
Solutions to/Comments on the Exercises |
|
|
113 | (7) |
|
Sources and Further Reading |
|
|
120 | (1) |
8 Using Online Interfaces to Query Mega Corpora |
|
121 | (25) |
|
8.1 Searching the BNC with BNCweb |
|
|
122 | (10) |
|
|
122 | (1) |
|
8.1.2 Basic standard queries |
|
|
123 | (1) |
|
8.1.3 Navigating through and exploring search results |
|
|
124 | (2) |
|
8.1.4 More advanced standard query options |
|
|
126 | (1) |
|
|
126 | (2) |
|
8.1.6 Word and phrase alternation |
|
|
128 | (1) |
|
8.1.7 Restricting searches through PoS tags |
|
|
129 | (2) |
|
8.1.8 Headword and lemma queries |
|
|
131 | (1) |
|
8.2 Exploring COCA through the BYU Web-Interface |
|
|
132 | (5) |
|
|
133 | (2) |
|
8.2.2 Comparing corpora in the BYU interface |
|
|
135 | (2) |
|
Solutions to/Comments on the Exercises |
|
|
137 | (8) |
|
Sources and Further Reading |
|
|
145 | (1) |
9 Basic Frequency Analysis - or What Can (Single) Words Tell Us About Texts? |
|
146 | (47) |
|
9.1 Understanding Basic Units in Texts |
|
|
146 | (5) |
|
|
147 | (2) |
|
|
149 | (2) |
|
9.2 Word (Frequency) Lists in AntConc |
|
|
151 | (9) |
|
9.2.1 Stop words - good or bad? |
|
|
156 | (2) |
|
9.2.2 Defining and using stop words in AntConc |
|
|
158 | (2) |
|
|
160 | (9) |
|
|
160 | (2) |
|
9.3.2 Investigating subcorpora |
|
|
162 | (7) |
|
|
169 | (1) |
|
9.4 Keyword Lists in AntConc and BNCweb |
|
|
169 | (6) |
|
9.4.1 Keyword lists in AntConc |
|
|
169 | (3) |
|
9.4.2 Keyword lists in BNCweb |
|
|
172 | (3) |
|
9.5 Comparing and Reporting Frequency Counts |
|
|
175 | (3) |
|
9.6 Investigating Genre-Specific Distributions in COCA |
|
|
178 | (1) |
|
Solutions to/Comments on the Exercises |
|
|
179 | (13) |
|
Sources and Further Reading |
|
|
192 | (1) |
10 Exploring Words in Context |
|
193 | (34) |
|
10.1 Understanding Extended Units of Text |
|
|
194 | (1) |
|
|
195 | (1) |
|
10.3 N-Grams, Word Clusters and Lexical Bundles |
|
|
196 | (2) |
|
10.4 Exploring (Relatively) Fixed Sequences in BNCweb |
|
|
198 | (1) |
|
10.5 Simple, Sequential Collocations and Colligations |
|
|
198 | (4) |
|
10.5.1 'Simple' collocations |
|
|
198 | (2) |
|
|
200 | (1) |
|
10.5.3 Contextually constrained and proximity searches |
|
|
201 | (1) |
|
10.6 Exploring Colligations in COCA |
|
|
202 | (3) |
|
10.7 N-grams and Clusters in AntConc |
|
|
205 | (2) |
|
10.8 Investigating Collocations Based on Statistical Measures in AntConc, BNCweb and COCA |
|
|
207 | (5) |
|
10.8.1 Calculating collocations |
|
|
207 | (2) |
|
10.8.2 Computing collocations in AntConc |
|
|
209 | (1) |
|
10.8.3 Computing collocations in BNCweb |
|
|
210 | (1) |
|
10.8.4 Computing collocations in COCA |
|
|
211 | (1) |
|
Solutions to/Comments on the Exercises |
|
|
212 | (14) |
|
Sources and Further Reading |
|
|
226 | (1) |
11 Understanding Markup and Annotation |
|
227 | (27) |
|
11.1 From SGML to XML - A Brief Timeline |
|
|
229 | (1) |
|
|
230 | (6) |
|
|
230 | (1) |
|
11.2.2 What does markup/annotation look like? |
|
|
230 | (2) |
|
11.2.3 The 'history' and development of (linguistic) markup |
|
|
232 | (2) |
|
11.2.4 XML and style sheets |
|
|
234 | (2) |
|
11.3 'Simple XML' for Linguistic Annotation |
|
|
236 | (4) |
|
11.4 Colour Coding and Visualisation |
|
|
240 | (6) |
|
11.5 More Complex Forms of Annotation |
|
|
246 | (2) |
|
Solutions to/Comments on the Exercises |
|
|
248 | (5) |
|
Sources and Further Reading |
|
|
253 | (1) |
12 Conclusion and Further Perspectives |
|
254 | (5) |
Appendix A: The CLAWS C5 Tagset |
|
259 | (2) |
Appendix B: The Annotated Dialogue File |
|
261 | (8) |
Appendix C: The CSS Style Sheet |
|
269 | (2) |
Glossary |
|
271 | (6) |
References |
|
277 | (6) |
Index |
|
283 | |