Preface |
|
xv | |
|
|
1 | (14) |
|
1.1 Case study: World Heritage Sites in Danger |
|
|
1 | (6) |
|
1.2 Some remarks on web data quality |
|
|
7 | (2) |
|
1.3 Technologies for disseminating, extracting, and storing web data |
|
|
9 | (4) |
|
1.3.1 Technologies for disseminating content on the Web |
|
|
9 | (2) |
|
1.3.2 Technologies for information extraction from web documents |
|
|
11 | (1) |
|
1.3.3 Technologies for data storage |
|
|
12 | (1) |
|
1.4 Structure of the book |
|
|
13 | (2) |
Part One A Primer on Web and Data Technologies |
|
15 | (204) |
|
|
17 | (24) |
|
2.1 Browser presentation and source code |
|
|
18 | (1) |
|
|
19 | (5) |
|
2.2.1 Tags, elements, and attributes |
|
|
20 | (1) |
|
|
21 | (1) |
|
|
22 | (1) |
|
2.2.4 Reserved and special characters |
|
|
22 | (1) |
|
2.2.5 Document type definition |
|
|
23 | (1) |
|
2.2.6 Spaces and line breaks |
|
|
23 | (1) |
|
|
24 | (8) |
|
|
24 | (1) |
|
2.3.2 The metadata tag <meta> |
|
|
25 | (1) |
|
2.3.3 The external reference tag <link> |
|
|
26 | (1) |
|
2.3.4 Emphasizing tags <b>, <i>, <strong> |
|
|
26 | (1) |
|
2.3.5 The paragraphs tag <p> |
|
|
27 | (1) |
|
2.3.6 Heading tags <hl>, <h2>, <h3>, |
|
|
27 | (1) |
|
2.3.7 Listing content with <u1>, <o1>, and <d1> |
|
|
27 | (1) |
|
2.3.8 The organizational tags <div> and <span> |
|
|
27 | (2) |
|
2.3.9 The <form> tag and its companions |
|
|
29 | (1) |
|
2.3.10 The foreign script tag <script> |
|
|
30 | (2) |
|
2.3.11 Table tags <table>, <tr>, <td>, and <th> |
|
|
32 | (1) |
|
|
32 | (6) |
|
|
33 | (2) |
|
|
35 | (2) |
|
2.4.3 Extracting information in the building process |
|
|
37 | (1) |
|
|
38 | (1) |
|
|
38 | (1) |
|
|
39 | (2) |
|
|
41 | (38) |
|
3.1 A short example XML document |
|
|
42 | (1) |
|
|
43 | (8) |
|
3.2.1 Elements and attributes |
|
|
44 | (2) |
|
|
46 | (2) |
|
3.2.3 Naming and special characters |
|
|
48 | (1) |
|
3.2.4 Comments and character data |
|
|
49 | (1) |
|
|
50 | (1) |
|
3.3 When is an XML document well formed or valid? |
|
|
51 | (2) |
|
3.4 XML extensions and technologies |
|
|
53 | (7) |
|
|
53 | (1) |
|
|
54 | (1) |
|
3.4.3 Example: Really Simple Syndication |
|
|
55 | (3) |
|
3.4.4 Example: scalable vector graphics |
|
|
58 | (2) |
|
3.5 XML and R in practice |
|
|
60 | (8) |
|
|
60 | (3) |
|
3.5.2 Basic operations on XML documents |
|
|
63 | (2) |
|
3.5.3 From XML to data frames or lists |
|
|
65 | (1) |
|
3.5.4 Event-driven parsing |
|
|
66 | (2) |
|
3.6 A short example JSON document |
|
|
68 | (1) |
|
|
69 | (2) |
|
3.8 JSON and R in practice |
|
|
71 | (5) |
|
|
76 | (1) |
|
|
76 | (1) |
|
|
76 | (3) |
|
|
79 | (22) |
|
4.1 XPath—a query language for web documents |
|
|
80 | (1) |
|
4.2 Identifying node sets with XPath |
|
|
81 | (12) |
|
4.2.1 Basic structure of an XPath query |
|
|
81 | (3) |
|
|
84 | (2) |
|
|
86 | (7) |
|
4.3 Extracting node elements |
|
|
93 | (5) |
|
4.3.1 Extending the fun argument |
|
|
94 | (2) |
|
|
96 | (1) |
|
4.3.3 Little XPath helper tools |
|
|
97 | (1) |
|
|
98 | (1) |
|
|
99 | (1) |
|
|
99 | (2) |
|
|
101 | (48) |
|
|
102 | (14) |
|
5.1.1 A short conversation with a web server |
|
|
102 | (2) |
|
|
104 | (2) |
|
|
106 | (2) |
|
|
108 | (1) |
|
|
108 | (1) |
|
|
109 | (7) |
|
5.2 Advanced features of HTTP |
|
|
116 | (8) |
|
|
116 | (5) |
|
|
121 | (2) |
|
|
123 | (1) |
|
5.3 Protocols beyond HTTP |
|
|
124 | (2) |
|
|
124 | (2) |
|
|
126 | (1) |
|
|
126 | (18) |
|
5.4.1 The libcurl library |
|
|
127 | (1) |
|
5.4.2 Basic request methods |
|
|
128 | (3) |
|
5.4.3 A low-level function of RCurl |
|
|
131 | (1) |
|
5.4.4 Maintaining connections across multiple requests |
|
|
132 | (1) |
|
|
133 | (6) |
|
|
139 | (4) |
|
|
143 | (1) |
|
5.4.8 RCurl or httr—what to use? |
|
|
144 | (1) |
|
|
144 | (1) |
|
|
144 | (2) |
|
|
146 | (3) |
|
|
149 | (15) |
|
|
150 | (4) |
|
6.1.1 How JavaScript is used |
|
|
150 | (1) |
|
|
151 | (3) |
|
|
154 | (4) |
|
6.2.1 Loading external HTML/XML documents |
|
|
155 | (2) |
|
|
157 | (1) |
|
6.3 Exploring AJAX with Web Developer Tools |
|
|
158 | (3) |
|
6.3.1 Getting started with Chrome's Web Developer Tools |
|
|
159 | (1) |
|
|
159 | (1) |
|
|
160 | (1) |
|
|
161 | (1) |
|
|
162 | (1) |
|
|
162 | (2) |
|
7 SQL and relational databases |
|
|
164 | (32) |
|
7.1 Overview and terminology |
|
|
165 | (2) |
|
|
167 | (8) |
|
7.2.1 Storing data in tables |
|
|
167 | (3) |
|
|
170 | (4) |
|
7.2.3 Advanced features of relational databases and DBMS |
|
|
174 | (1) |
|
7.3 SQL: a language to communicate with Databases |
|
|
175 | (13) |
|
7.3.1 General remarks on SQL, syntax, and our running example |
|
|
175 | (2) |
|
7.3.2 Data control language—DCL |
|
|
177 | (1) |
|
7.3.3 Data definition language—DDL |
|
|
178 | (2) |
|
7.3.4 Data manipulation language—DML |
|
|
180 | (4) |
|
|
184 | (3) |
|
7.3.6 Transaction control language—TCL |
|
|
187 | (1) |
|
|
188 | (4) |
|
7.4.1 R packages to manage databases |
|
|
188 | (1) |
|
7.4.2 Speaking R-SQL via DBI-based packages |
|
|
189 | (2) |
|
7.4.3 Speaking R-SQL via RODBC |
|
|
191 | (1) |
|
|
192 | (1) |
|
|
193 | (1) |
|
|
193 | (3) |
|
8 Regular expressions and essential string functions |
|
|
196 | (23) |
|
|
198 | (9) |
|
8.1.1 Exact character matching |
|
|
198 | (2) |
|
8.1.2 Generalizing regular expressions |
|
|
200 | (6) |
|
8.1.3 The introductory example reconsidered |
|
|
206 | (1) |
|
|
207 | (7) |
|
8.2.1 The stringr package |
|
|
207 | (4) |
|
8.2.2 A couple more handy functions |
|
|
211 | (3) |
|
8.3 A word on character encodings |
|
|
214 | (2) |
|
|
216 | (1) |
|
|
217 | (1) |
|
|
217 | (2) |
Part Two A Practical Toolbox for Web Scraping and Text Mining |
|
219 | (122) |
|
|
221 | (74) |
|
|
222 | (48) |
|
9.1.1 Downloading ready-made files |
|
|
223 | (3) |
|
9.1.2 Downloading multiple files from an FTP index |
|
|
226 | (2) |
|
9.1.3 Manipulating URLs to access multiple pages |
|
|
228 | (4) |
|
9.1.4 Convenient functions to gather links, lists, and tables from HTML documents |
|
|
232 | (3) |
|
9.1.5 Dealing with HTML forms |
|
|
235 | (10) |
|
9.1.6 HTTP authentication |
|
|
245 | (1) |
|
9.1.7 Connections via HTTPS |
|
|
246 | (1) |
|
|
247 | (4) |
|
9.1.9 Scraping data from AJAX-enriched webpages with Selenium/Rwebdriver |
|
|
251 | (8) |
|
9.1.10 Retrieving data from APIs |
|
|
259 | (7) |
|
9.1.11 Authentication with OAuth |
|
|
266 | (4) |
|
9.2 Extraction strategies |
|
|
270 | (8) |
|
9.2.1 Regular expressions |
|
|
270 | (3) |
|
|
273 | (3) |
|
9.2.3 Application Programming Interfaces |
|
|
276 | (2) |
|
9.3 Web scraping: Good practice |
|
|
278 | (12) |
|
9.3.1 Is web scraping legal? |
|
|
278 | (2) |
|
9.3.2 What is robots.txt? |
|
|
280 | (4) |
|
|
284 | (6) |
|
9.4 Valuable sources of inspiration |
|
|
290 | (1) |
|
|
291 | (1) |
|
|
292 | (1) |
|
|
293 | (2) |
|
10 Statistical text processing |
|
|
295 | (27) |
|
10.1 The running example: Classifying press releases of the British government |
|
|
296 | (2) |
|
10.2 Processing textual data |
|
|
298 | (9) |
|
10.2.1 Large-scale text operations—The tm package |
|
|
298 | (5) |
|
10.2.2 Building a term-document matrix |
|
|
303 | (1) |
|
|
304 | (1) |
|
10.2.4 Sparsity and n-grams |
|
|
305 | (2) |
|
10.3 Supervised learning techniques |
|
|
307 | (6) |
|
10.3.1 Support vector machines |
|
|
309 | (1) |
|
|
309 | (1) |
|
|
309 | (1) |
|
10.3.4 The RTextTools package |
|
|
309 | (1) |
|
10.3.5 Application: Government press releases |
|
|
310 | (3) |
|
10.4 Unsupervised learning techniques |
|
|
313 | (7) |
|
10.4.1 Latent Dirichlet Allocation and correlated topic models |
|
|
314 | (1) |
|
10.4.2 Application: Government press releases |
|
|
314 | (6) |
|
|
320 | (1) |
|
|
320 | (2) |
|
11 Managing data projects |
|
|
322 | (19) |
|
11.1 Interacting with the file system |
|
|
322 | (1) |
|
11.2 Processing multiple documents/links |
|
|
323 | (5) |
|
|
324 | (2) |
|
11.2.2 Using while-loops and control structures |
|
|
326 | (1) |
|
11.2.3 Using the plyr package |
|
|
327 | (1) |
|
11.3 Organizing scraping procedures |
|
|
328 | (6) |
|
11.3.1 Implementation of progress feedback: Messages and progress bars |
|
|
331 | (2) |
|
11.3.2 Error and exception handling |
|
|
333 | (1) |
|
11.4 Executing R scripts on a regular basis |
|
|
334 | (9) |
|
11.4.1 Scheduling tasks on Mac OS and Linux |
|
|
335 | (2) |
|
11.4.2 Scheduling tasks on Windows platforms |
|
|
337 | (4) |
Part Three A Bag of Case Studies |
|
341 | (94) |
|
12 Collaboration networks in the US Senate |
|
|
343 | (16) |
|
12.1 Information on the bills |
|
|
344 | (6) |
|
12.2 Information on the senators |
|
|
350 | (3) |
|
12.3 Analyzing the network structure |
|
|
353 | (5) |
|
12.3.1 Descriptive statistics |
|
|
354 | (2) |
|
|
356 | (2) |
|
|
358 | (1) |
|
13 Parsing information from semistructured documents |
|
|
359 | (12) |
|
13.1 Downloading data from the FTP server |
|
|
360 | (1) |
|
13.2 Parsing semistructured text data |
|
|
361 | (7) |
|
13.3 Visualizing station and temperature data |
|
|
368 | (3) |
|
14 Predicting the 2014 Academy Awards using Twitter |
|
|
371 | (9) |
|
14.1 Twitter APIs: Overview |
|
|
372 | (2) |
|
|
372 | (1) |
|
14.1.2 The Streaming APIs |
|
|
373 | (1) |
|
14.1.3 Collecting and preparing the data |
|
|
373 | (1) |
|
14.2 Twitter-based forecast of the 2014 Academy Awards |
|
|
374 | (5) |
|
14.2.1 Visualizing the data |
|
|
374 | (1) |
|
14.2.2 Mining tweets for predictions |
|
|
375 | (4) |
|
|
379 | (1) |
|
15 Mapping the geographic distribution of names |
|
|
380 | (16) |
|
15.1 Developing a data collection strategy |
|
|
381 | (1) |
|
|
382 | (2) |
|
15.3 Data retrieval and information extraction |
|
|
384 | (3) |
|
|
387 | (2) |
|
15.5 Automating the process |
|
|
389 | (6) |
|
|
395 | (1) |
|
16 Gathering data on mobile phones |
|
|
396 | (20) |
|
|
396 | (8) |
|
16.1.1 Searching mobile phones of a specific brand |
|
|
396 | (4) |
|
16.1.2 Extracting product information |
|
|
400 | (4) |
|
|
404 | (2) |
|
16.2.1 Retrieving data on several producers |
|
|
404 | (1) |
|
|
405 | (1) |
|
|
406 | (2) |
|
|
408 | (8) |
|
16.4.1 General considerations |
|
|
408 | (1) |
|
16.4.2 Table definitions for storage |
|
|
409 | (1) |
|
16.4.3 Table definitions for future storage |
|
|
410 | (1) |
|
16.4.4 View definitions for convenient data access |
|
|
411 | (2) |
|
16.4.5 Functions for storing data |
|
|
413 | (2) |
|
16.4.6 Data storage and inspection |
|
|
415 | (1) |
|
17 Analyzing sentiments of product reviews |
|
|
416 | (19) |
|
|
416 | (1) |
|
|
417 | (9) |
|
17.2.1 Downloading the files |
|
|
417 | (4) |
|
17.2.2 Information extraction |
|
|
421 | (3) |
|
|
424 | (2) |
|
|
426 | (8) |
|
|
426 | (1) |
|
17.3.2 Dictionary-based sentiment analysis |
|
|
427 | (5) |
|
17.3.3 Mining the content of reviews |
|
|
432 | (2) |
|
|
434 | (1) |
References |
|
435 | (7) |
General index |
|
442 | (6) |
Package index |
|
448 | (1) |
Function index |
|
449 | |