Muutke küpsiste eelistusi

E-raamat: Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining

  • Formaat: EPUB+DRM
  • Ilmumisaeg: 18-Dec-2014
  • Kirjastus: John Wiley & Sons Inc
  • Keel: eng
  • ISBN-13: 9781118834800
Teised raamatud teemal:
  • Formaat - EPUB+DRM
  • Hind: 67,86 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
  • Raamatukogudele
  • Formaat: EPUB+DRM
  • Ilmumisaeg: 18-Dec-2014
  • Kirjastus: John Wiley & Sons Inc
  • Keel: eng
  • ISBN-13: 9781118834800
Teised raamatud teemal:

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

"This book provides a unified framework of web scraping and information extraction from text data with R for the social sciences"--

A hands on guide to web scraping and text mining for both beginners and experienced users of R

  • Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL.
  • Provides basic techniques to query web documents and data sets (XPath and regular expressions).
  • An extensive set of exercises are presented to guide the reader through each technique.
  • Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management.
  • Case studies are featured throughout along with examples for each technique presented.
  • R code and solutions to exercises featured in the book are provided on a supporting website.
Preface xv
1 Introduction
1(14)
1.1 Case study: World Heritage Sites in Danger
1(6)
1.2 Some remarks on web data quality
7(2)
1.3 Technologies for disseminating, extracting, and storing web data
9(4)
1.3.1 Technologies for disseminating content on the Web
9(2)
1.3.2 Technologies for information extraction from web documents
11(1)
1.3.3 Technologies for data storage
12(1)
1.4 Structure of the book
13(2)
Part One A Primer on Web and Data Technologies 15(204)
2 HTML
17(24)
2.1 Browser presentation and source code
18(1)
2.2 Syntax rules
19(5)
2.2.1 Tags, elements, and attributes
20(1)
2.2.2 Tree structure
21(1)
2.2.3 Comments
22(1)
2.2.4 Reserved and special characters
22(1)
2.2.5 Document type definition
23(1)
2.2.6 Spaces and line breaks
23(1)
2.3 Tags and attributes
24(8)
2.3.1 The anchor tag <a>
24(1)
2.3.2 The metadata tag <meta>
25(1)
2.3.3 The external reference tag <link>
26(1)
2.3.4 Emphasizing tags <b>, <i>, <strong>
26(1)
2.3.5 The paragraphs tag <p>
27(1)
2.3.6 Heading tags <hl>, <h2>, <h3>,
27(1)
2.3.7 Listing content with <u1>, <o1>, and <d1>
27(1)
2.3.8 The organizational tags <div> and <span>
27(2)
2.3.9 The <form> tag and its companions
29(1)
2.3.10 The foreign script tag <script>
30(2)
2.3.11 Table tags <table>, <tr>, <td>, and <th>
32(1)
2.4 Parsing
32(6)
2.4.1 What is parsing?
33(2)
2.4.2 Discarding nodes
35(2)
2.4.3 Extracting information in the building process
37(1)
Summary
38(1)
Further reading
38(1)
Problems
39(2)
3 XML and JSON
41(38)
3.1 A short example XML document
42(1)
3.2 XML syntax rules
43(8)
3.2.1 Elements and attributes
44(2)
3.2.2 XML structure
46(2)
3.2.3 Naming and special characters
48(1)
3.2.4 Comments and character data
49(1)
3.2.5 XML syntax summary
50(1)
3.3 When is an XML document well formed or valid?
51(2)
3.4 XML extensions and technologies
53(7)
3.4.1 Namespaces
53(1)
3.4.2 Extensions of XML
54(1)
3.4.3 Example: Really Simple Syndication
55(3)
3.4.4 Example: scalable vector graphics
58(2)
3.5 XML and R in practice
60(8)
3.5.1 Parsing XML
60(3)
3.5.2 Basic operations on XML documents
63(2)
3.5.3 From XML to data frames or lists
65(1)
3.5.4 Event-driven parsing
66(2)
3.6 A short example JSON document
68(1)
3.7 JSON syntax rules
69(2)
3.8 JSON and R in practice
71(5)
Summary
76(1)
Further reading
76(1)
Problems
76(3)
4 XPath
79(22)
4.1 XPath—a query language for web documents
80(1)
4.2 Identifying node sets with XPath
81(12)
4.2.1 Basic structure of an XPath query
81(3)
4.2.2 Node relations
84(2)
4.2.3 XPath predicates
86(7)
4.3 Extracting node elements
93(5)
4.3.1 Extending the fun argument
94(2)
4.3.2 XML namespaces
96(1)
4.3.3 Little XPath helper tools
97(1)
Summary
98(1)
Further reading
99(1)
Problems
99(2)
5 HTTP
101(48)
5.1 HTTP fundamentals
102(14)
5.1.1 A short conversation with a web server
102(2)
5.1.2 URL syntax
104(2)
5.1.3 HTTP messages
106(2)
5.1.4 Request methods
108(1)
5.1.5 Status codes
108(1)
5.1.6 Header fields
109(7)
5.2 Advanced features of HTTP
116(8)
5.2.1 Identification
116(5)
5.2.2 Authentication
121(2)
5.2.3 Proxies
123(1)
5.3 Protocols beyond HTTP
124(2)
5.3.1 HTTP Secure
124(2)
5.3.2 FTP
126(1)
5.4 HTTP in action
126(18)
5.4.1 The libcurl library
127(1)
5.4.2 Basic request methods
128(3)
5.4.3 A low-level function of RCurl
131(1)
5.4.4 Maintaining connections across multiple requests
132(1)
5.4.5 Options
133(6)
5.4.6 Debugging
139(4)
5.4.7 Error handling
143(1)
5.4.8 RCurl or httr—what to use?
144(1)
Summary
144(1)
Further reading
144(2)
Problems
146(3)
6 AJAX
149(15)
6.1 JavaScript
150(4)
6.1.1 How JavaScript is used
150(1)
6.1.2 DOM manipulation
151(3)
6.2 XHR
154(4)
6.2.1 Loading external HTML/XML documents
155(2)
6.2.2 Loading JSON
157(1)
6.3 Exploring AJAX with Web Developer Tools
158(3)
6.3.1 Getting started with Chrome's Web Developer Tools
159(1)
6.3.2 The Elements panel
159(1)
6.3.3 The Network panel
160(1)
Summary
161(1)
Further reading
162(1)
Problems
162(2)
7 SQL and relational databases
164(32)
7.1 Overview and terminology
165(2)
7.2 Relational Databases
167(8)
7.2.1 Storing data in tables
167(3)
7.2.2 Normalization
170(4)
7.2.3 Advanced features of relational databases and DBMS
174(1)
7.3 SQL: a language to communicate with Databases
175(13)
7.3.1 General remarks on SQL, syntax, and our running example
175(2)
7.3.2 Data control language—DCL
177(1)
7.3.3 Data definition language—DDL
178(2)
7.3.4 Data manipulation language—DML
180(4)
7.3.5 Clauses
184(3)
7.3.6 Transaction control language—TCL
187(1)
7.4 Databases in action
188(4)
7.4.1 R packages to manage databases
188(1)
7.4.2 Speaking R-SQL via DBI-based packages
189(2)
7.4.3 Speaking R-SQL via RODBC
191(1)
Summary
192(1)
Further reading
193(1)
Problems
193(3)
8 Regular expressions and essential string functions
196(23)
8.1 Regular expressions
198(9)
8.1.1 Exact character matching
198(2)
8.1.2 Generalizing regular expressions
200(6)
8.1.3 The introductory example reconsidered
206(1)
8.2 String processing
207(7)
8.2.1 The stringr package
207(4)
8.2.2 A couple more handy functions
211(3)
8.3 A word on character encodings
214(2)
Summary
216(1)
Further reading
217(1)
Problems
217(2)
Part Two A Practical Toolbox for Web Scraping and Text Mining 219(122)
9 Scraping the Web
221(74)
9.1 Retrieval scenarios
222(48)
9.1.1 Downloading ready-made files
223(3)
9.1.2 Downloading multiple files from an FTP index
226(2)
9.1.3 Manipulating URLs to access multiple pages
228(4)
9.1.4 Convenient functions to gather links, lists, and tables from HTML documents
232(3)
9.1.5 Dealing with HTML forms
235(10)
9.1.6 HTTP authentication
245(1)
9.1.7 Connections via HTTPS
246(1)
9.1.8 Using cookies
247(4)
9.1.9 Scraping data from AJAX-enriched webpages with Selenium/Rwebdriver
251(8)
9.1.10 Retrieving data from APIs
259(7)
9.1.11 Authentication with OAuth
266(4)
9.2 Extraction strategies
270(8)
9.2.1 Regular expressions
270(3)
9.2.2 XPath
273(3)
9.2.3 Application Programming Interfaces
276(2)
9.3 Web scraping: Good practice
278(12)
9.3.1 Is web scraping legal?
278(2)
9.3.2 What is robots.txt?
280(4)
9.3.3 Be friendly!
284(6)
9.4 Valuable sources of inspiration
290(1)
Summary
291(1)
Further reading
292(1)
Problems
293(2)
10 Statistical text processing
295(27)
10.1 The running example: Classifying press releases of the British government
296(2)
10.2 Processing textual data
298(9)
10.2.1 Large-scale text operations—The tm package
298(5)
10.2.2 Building a term-document matrix
303(1)
10.2.3 Data cleansing
304(1)
10.2.4 Sparsity and n-grams
305(2)
10.3 Supervised learning techniques
307(6)
10.3.1 Support vector machines
309(1)
10.3.2 Random Forest
309(1)
10.3.3 Maximum entropy
309(1)
10.3.4 The RTextTools package
309(1)
10.3.5 Application: Government press releases
310(3)
10.4 Unsupervised learning techniques
313(7)
10.4.1 Latent Dirichlet Allocation and correlated topic models
314(1)
10.4.2 Application: Government press releases
314(6)
Summary
320(1)
Further reading
320(2)
11 Managing data projects
322(19)
11.1 Interacting with the file system
322(1)
11.2 Processing multiple documents/links
323(5)
11.2.1 Using for-loops
324(2)
11.2.2 Using while-loops and control structures
326(1)
11.2.3 Using the plyr package
327(1)
11.3 Organizing scraping procedures
328(6)
11.3.1 Implementation of progress feedback: Messages and progress bars
331(2)
11.3.2 Error and exception handling
333(1)
11.4 Executing R scripts on a regular basis
334(9)
11.4.1 Scheduling tasks on Mac OS and Linux
335(2)
11.4.2 Scheduling tasks on Windows platforms
337(4)
Part Three A Bag of Case Studies 341(94)
12 Collaboration networks in the US Senate
343(16)
12.1 Information on the bills
344(6)
12.2 Information on the senators
350(3)
12.3 Analyzing the network structure
353(5)
12.3.1 Descriptive statistics
354(2)
12.3.2 Network analysis
356(2)
12.4 Conclusion
358(1)
13 Parsing information from semistructured documents
359(12)
13.1 Downloading data from the FTP server
360(1)
13.2 Parsing semistructured text data
361(7)
13.3 Visualizing station and temperature data
368(3)
14 Predicting the 2014 Academy Awards using Twitter
371(9)
14.1 Twitter APIs: Overview
372(2)
14.1.1 The REST API
372(1)
14.1.2 The Streaming APIs
373(1)
14.1.3 Collecting and preparing the data
373(1)
14.2 Twitter-based forecast of the 2014 Academy Awards
374(5)
14.2.1 Visualizing the data
374(1)
14.2.2 Mining tweets for predictions
375(4)
14.3 Conclusion
379(1)
15 Mapping the geographic distribution of names
380(16)
15.1 Developing a data collection strategy
381(1)
15.2 Website inspection
382(2)
15.3 Data retrieval and information extraction
384(3)
15.4 Mapping names
387(2)
15.5 Automating the process
389(6)
Summary
395(1)
16 Gathering data on mobile phones
396(20)
16.1 Page exploration
396(8)
16.1.1 Searching mobile phones of a specific brand
396(4)
16.1.2 Extracting product information
400(4)
16.2 Scraping procedure
404(2)
16.2.1 Retrieving data on several producers
404(1)
16.2.2 Data cleansing
405(1)
16.3 Graphical analysis
406(2)
16.4 Data storage
408(8)
16.4.1 General considerations
408(1)
16.4.2 Table definitions for storage
409(1)
16.4.3 Table definitions for future storage
410(1)
16.4.4 View definitions for convenient data access
411(2)
16.4.5 Functions for storing data
413(2)
16.4.6 Data storage and inspection
415(1)
17 Analyzing sentiments of product reviews
416(19)
17.1 Introduction
416(1)
17.2 Collecting the data
417(9)
17.2.1 Downloading the files
417(4)
17.2.2 Information extraction
421(3)
17.2.3 Database storage
424(2)
17.3 Analyzing the data
426(8)
17.3.1 Data preparation
426(1)
17.3.2 Dictionary-based sentiment analysis
427(5)
17.3.3 Mining the content of reviews
432(2)
17.4 Conclusion
434(1)
References 435(7)
General index 442(6)
Package index 448(1)
Function index 449
Simon Munzert is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Christian Rubba is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Peter Meißner is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Dominic Nyhuis is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.