Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

E-raamat: Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining

3.87/5 (29 hinnangut Goodreads-ist)

Simon Munzert, Dominic Nyhuis, Christian Rubba, Peter Meißner

Teised formaadid

Other digital carrier (Hind: 76,98 €) - 12-Dec-2014

Formaat: EPUB+DRM
Ilmumisaeg: 18-Dec-2014
Kirjastus: John Wiley & Sons Inc
Keel: eng
ISBN-13: 9781118834800

Teised raamatud teemal:

Data mining

Formaat - EPUB+DRM
Hind: 67,86 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Lisa ostukorvi
Lisa soovinimekirja
See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
Raamatukogudele

Formaat: EPUB+DRM
Ilmumisaeg: 18-Dec-2014
Kirjastus: John Wiley & Sons Inc
Keel: eng
ISBN-13: 9781118834800

Teised raamatud teemal:

Data mining

DRM piirangud

Kopeerimine (copy/paste):

ei ole lubatud
Printimine:

ei ole lubatud
Kasutamine:

Digitaalõiguste kaitse (DRM)
Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

Vajalik tarkvara
Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

Seda e-raamatut ei saa lugeda Amazon Kindle's.

"This book provides a unified framework of web scraping and information extraction from text data with R for the social sciences"--

A hands on guide to web scraping and text mining for both beginners and experienced users of R

Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL.
Provides basic techniques to query web documents and data sets (XPath and regular expressions).
An extensive set of exercises are presented to guide the reader through each technique.
Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management.
Case studies are featured throughout along with examples for each technique presented.
R code and solutions to exercises featured in the book are provided on a supporting website.

Preface

1 Introduction

(14)

1.1 Case study: World Heritage Sites in Danger

(6)

1.2 Some remarks on web data quality

(2)

1.3 Technologies for disseminating, extracting, and storing web data

(4)

1.3.1 Technologies for disseminating content on the Web

(2)

1.3.2 Technologies for information extraction from web documents

(1)

1.3.3 Technologies for data storage

(1)

1.4 Structure of the book

(2)

Part One A Primer on Web and Data Technologies

(204)

2 HTML

(24)

2.1 Browser presentation and source code

(1)

2.2 Syntax rules

(5)

2.2.1 Tags, elements, and attributes

(1)

2.2.2 Tree structure

(1)

2.2.3 Comments

(1)

2.2.4 Reserved and special characters

(1)

2.2.5 Document type definition

(1)

2.2.6 Spaces and line breaks

(1)

2.3 Tags and attributes

(8)

2.3.1 The anchor tag <a>

(1)

2.3.2 The metadata tag <meta>

(1)

2.3.3 The external reference tag <link>

(1)

2.3.4 Emphasizing tags <b>, <i>, <strong>

(1)

2.3.5 The paragraphs tag <p>

(1)

2.3.6 Heading tags <hl>, <h2>, <h3>,

(1)

2.3.7 Listing content with <u1>, <o1>, and <d1>

(1)

2.3.8 The organizational tags <div> and <span>

(2)

2.3.9 The <form> tag and its companions

(1)

2.3.10 The foreign script tag <script>

(2)

2.3.11 Table tags <table>, <tr>, <td>, and <th>

(1)

2.4 Parsing

(6)

2.4.1 What is parsing?

(2)

2.4.2 Discarding nodes

(2)

2.4.3 Extracting information in the building process

(1)

Summary

(1)

Further reading

(1)

Problems

(2)

3 XML and JSON

(38)

3.1 A short example XML document

(1)

3.2 XML syntax rules

(8)

3.2.1 Elements and attributes

(2)

3.2.2 XML structure

(2)

3.2.3 Naming and special characters

(1)

3.2.4 Comments and character data

(1)

3.2.5 XML syntax summary

(1)

3.3 When is an XML document well formed or valid?

(2)

3.4 XML extensions and technologies

(7)

3.4.1 Namespaces

(1)

3.4.2 Extensions of XML

(1)

3.4.3 Example: Really Simple Syndication

(3)

3.4.4 Example: scalable vector graphics

(2)

3.5 XML and R in practice

(8)

3.5.1 Parsing XML

(3)

3.5.2 Basic operations on XML documents

(2)

3.5.3 From XML to data frames or lists

(1)

3.5.4 Event-driven parsing

(2)

3.6 A short example JSON document

(1)

3.7 JSON syntax rules

(2)

3.8 JSON and R in practice

(5)

Summary

(1)

Further reading

(1)

Problems

(3)

4 XPath

(22)

4.1 XPath—a query language for web documents

(1)

4.2 Identifying node sets with XPath

(12)

4.2.1 Basic structure of an XPath query

(3)

4.2.2 Node relations

(2)

4.2.3 XPath predicates

(7)

4.3 Extracting node elements

(5)

4.3.1 Extending the fun argument

(2)

4.3.2 XML namespaces

(1)

4.3.3 Little XPath helper tools

(1)

Summary

(1)

Further reading

(1)

Problems

(2)

5 HTTP

101

(48)

5.1 HTTP fundamentals

102

(14)

5.1.1 A short conversation with a web server

102

(2)

5.1.2 URL syntax

104

(2)

5.1.3 HTTP messages

106

(2)

5.1.4 Request methods

108

(1)

5.1.5 Status codes

108

(1)

5.1.6 Header fields

109

(7)

5.2 Advanced features of HTTP

116

(8)

5.2.1 Identification

116

(5)

5.2.2 Authentication

121

(2)

5.2.3 Proxies

123

(1)

5.3 Protocols beyond HTTP

124

(2)

5.3.1 HTTP Secure

124

(2)

5.3.2 FTP

126

(1)

5.4 HTTP in action

126

(18)

5.4.1 The libcurl library

127

(1)

5.4.2 Basic request methods

128

(3)

5.4.3 A low-level function of RCurl

131

(1)

5.4.4 Maintaining connections across multiple requests

132

(1)

5.4.5 Options

133

(6)

5.4.6 Debugging

139

(4)

5.4.7 Error handling

143

(1)

5.4.8 RCurl or httr—what to use?

144

(1)

Summary

144

(1)

Further reading

144

(2)

Problems

146

(3)

6 AJAX

149

(15)

6.1 JavaScript

150

(4)

6.1.1 How JavaScript is used

150

(1)

6.1.2 DOM manipulation

151

(3)

6.2 XHR

154

(4)

6.2.1 Loading external HTML/XML documents

155

(2)

6.2.2 Loading JSON

157

(1)

6.3 Exploring AJAX with Web Developer Tools

158

(3)

6.3.1 Getting started with Chrome's Web Developer Tools

159

(1)

6.3.2 The Elements panel

159

(1)

6.3.3 The Network panel

160

(1)

Summary

161

(1)

Further reading

162

(1)

Problems

162

(2)

7 SQL and relational databases

164

(32)

7.1 Overview and terminology

165

(2)

7.2 Relational Databases

167

(8)

7.2.1 Storing data in tables

167

(3)

7.2.2 Normalization

170

(4)

7.2.3 Advanced features of relational databases and DBMS

174

(1)

7.3 SQL: a language to communicate with Databases

175

(13)

7.3.1 General remarks on SQL, syntax, and our running example

175

(2)

7.3.2 Data control language—DCL

177

(1)

7.3.3 Data definition language—DDL

178

(2)

7.3.4 Data manipulation language—DML

180

(4)

7.3.5 Clauses

184

(3)

7.3.6 Transaction control language—TCL

187

(1)

7.4 Databases in action

188

(4)

7.4.1 R packages to manage databases

188

(1)

7.4.2 Speaking R-SQL via DBI-based packages

189

(2)

7.4.3 Speaking R-SQL via RODBC

191

(1)

Summary

192

(1)

Further reading

193

(1)

Problems

193

(3)

8 Regular expressions and essential string functions

196

(23)

8.1 Regular expressions

198

(9)

8.1.1 Exact character matching

198

(2)

8.1.2 Generalizing regular expressions

200

(6)

8.1.3 The introductory example reconsidered

206

(1)

8.2 String processing

207

(7)

8.2.1 The stringr package

207

(4)

8.2.2 A couple more handy functions

211

(3)

8.3 A word on character encodings

214

(2)

Summary

216

(1)

Further reading

217

(1)

Problems

217

(2)

Part Two A Practical Toolbox for Web Scraping and Text Mining

219

(122)

9 Scraping the Web

221

(74)

9.1 Retrieval scenarios

222

(48)

9.1.1 Downloading ready-made files

223

(3)

9.1.2 Downloading multiple files from an FTP index

226

(2)

9.1.3 Manipulating URLs to access multiple pages

228

(4)

9.1.4 Convenient functions to gather links, lists, and tables from HTML documents

232

(3)

9.1.5 Dealing with HTML forms

235

(10)

9.1.6 HTTP authentication

245

(1)

9.1.7 Connections via HTTPS

246

(1)

9.1.8 Using cookies

247

(4)

9.1.9 Scraping data from AJAX-enriched webpages with Selenium/Rwebdriver

251

(8)

9.1.10 Retrieving data from APIs

259

(7)

9.1.11 Authentication with OAuth

266

(4)

9.2 Extraction strategies

270

(8)

9.2.1 Regular expressions

270

(3)

9.2.2 XPath

273

(3)

9.2.3 Application Programming Interfaces

276

(2)

9.3 Web scraping: Good practice

278

(12)

9.3.1 Is web scraping legal?

278

(2)

9.3.2 What is robots.txt?

280

(4)

9.3.3 Be friendly!

284

(6)

9.4 Valuable sources of inspiration

290

(1)

Summary

291

(1)

Further reading

292

(1)

Problems

293

(2)

10 Statistical text processing

295

(27)

10.1 The running example: Classifying press releases of the British government

296

(2)

10.2 Processing textual data

298

(9)

10.2.1 Large-scale text operations—The tm package

298

(5)

10.2.2 Building a term-document matrix

303

(1)

10.2.3 Data cleansing

304

(1)

10.2.4 Sparsity and n-grams

305

(2)

10.3 Supervised learning techniques

307

(6)

10.3.1 Support vector machines

309

(1)

10.3.2 Random Forest

309

(1)

10.3.3 Maximum entropy

309

(1)

10.3.4 The RTextTools package

309

(1)

10.3.5 Application: Government press releases

310

(3)

10.4 Unsupervised learning techniques

313

(7)

10.4.1 Latent Dirichlet Allocation and correlated topic models

314

(1)

10.4.2 Application: Government press releases

314

(6)

Summary

320

(1)

Further reading

320

(2)

11 Managing data projects

322

(19)

11.1 Interacting with the file system

322

(1)

11.2 Processing multiple documents/links

323

(5)

11.2.1 Using for-loops

324

(2)

11.2.2 Using while-loops and control structures

326

(1)

11.2.3 Using the plyr package

327

(1)

11.3 Organizing scraping procedures

328

(6)

11.3.1 Implementation of progress feedback: Messages and progress bars

331

(2)

11.3.2 Error and exception handling

333

(1)

11.4 Executing R scripts on a regular basis

334

(9)

11.4.1 Scheduling tasks on Mac OS and Linux

335

(2)

11.4.2 Scheduling tasks on Windows platforms

337

(4)

Part Three A Bag of Case Studies

341

(94)

12 Collaboration networks in the US Senate

343

(16)

12.1 Information on the bills

344

(6)

12.2 Information on the senators

350

(3)

12.3 Analyzing the network structure

353

(5)

12.3.1 Descriptive statistics

354

(2)

12.3.2 Network analysis

356

(2)

12.4 Conclusion

358

(1)

13 Parsing information from semistructured documents

359

(12)

13.1 Downloading data from the FTP server

360

(1)

13.2 Parsing semistructured text data

361

(7)

13.3 Visualizing station and temperature data

368

(3)

14 Predicting the 2014 Academy Awards using Twitter

371

(9)

14.1 Twitter APIs: Overview

372

(2)

14.1.1 The REST API

372

(1)

14.1.2 The Streaming APIs

373

(1)

14.1.3 Collecting and preparing the data

373

(1)

14.2 Twitter-based forecast of the 2014 Academy Awards

374

(5)

14.2.1 Visualizing the data

374

(1)

14.2.2 Mining tweets for predictions

375

(4)

14.3 Conclusion

379

(1)

15 Mapping the geographic distribution of names

380

(16)

15.1 Developing a data collection strategy

381

(1)

15.2 Website inspection

382

(2)

15.3 Data retrieval and information extraction

384

(3)

15.4 Mapping names

387

(2)

15.5 Automating the process

389

(6)

Summary

395

(1)

16 Gathering data on mobile phones

396

(20)

16.1 Page exploration

396

(8)

16.1.1 Searching mobile phones of a specific brand

396

(4)

16.1.2 Extracting product information

400

(4)

16.2 Scraping procedure

404

(2)

16.2.1 Retrieving data on several producers

404

(1)

16.2.2 Data cleansing

405

(1)

16.3 Graphical analysis

406

(2)

16.4 Data storage

408

(8)

16.4.1 General considerations

408

(1)

16.4.2 Table definitions for storage

409

(1)

16.4.3 Table definitions for future storage

410

(1)

16.4.4 View definitions for convenient data access

411

(2)

16.4.5 Functions for storing data

413

(2)

16.4.6 Data storage and inspection

415

(1)

17 Analyzing sentiments of product reviews

416

(19)

17.1 Introduction

416

(1)

17.2 Collecting the data

417

(9)

17.2.1 Downloading the files

417

(4)

17.2.2 Information extraction

421

(3)

17.2.3 Database storage

424

(2)

17.3 Analyzing the data

426

(8)

17.3.1 Data preparation

426

(1)

17.3.2 Dictionary-based sentiment analysis

427

(5)

17.3.3 Mining the content of reviews

432

(2)

17.4 Conclusion

434

(1)

References

435

(7)

General index

442

(6)

Package index

448

(1)

Function index

449

Simon Munzert is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Christian Rubba is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Peter Meißner is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Dominic Nyhuis is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Lisainfo e-raamatute kohta

Püsilink: https://www.kriso.ee/db/97811188348006e.html

Märksõnad:

E-raamat: Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining

DRM piirangud

Kopeerimine (copy/paste):

Printimine:

Kasutamine:

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad E-raamatute teemad

Vali ostukorv