Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

E-raamat: Webbots, Spiders, and Screen Scrapers, 2nd Edition

3.77/5 (106 hinnangut Goodreads-ist)

Michael Schrenk

Formaat: EPUB+DRM
Ilmumisaeg: 01-Mar-2012
Kirjastus: No Starch Press,US
Keel: eng
ISBN-13: 9781593274320

Teised raamatud teemal:

Web programming

Formaat - EPUB+DRM
Hind: 29,03 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Lisa ostukorvi
Lisa soovinimekirja
See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.

Formaat: EPUB+DRM
Ilmumisaeg: 01-Mar-2012
Kirjastus: No Starch Press,US
Keel: eng
ISBN-13: 9781593274320

Teised raamatud teemal:

Web programming

DRM piirangud

Kopeerimine (copy/paste):

ei ole lubatud
Printimine:

ei ole lubatud
Kasutamine:

Digitaalõiguste kaitse (DRM)
Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

Vajalik tarkvara
Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

Seda e-raamatut ei saa lugeda Amazon Kindle's.

The Internet is bigger and better than what a mere browser allows. Webbots, Spiders, and Screen Scrapers is for programmers and businesspeople who want to take full advantage of the vast resources available on the Web. The book first outlines the deficiencies of browsers, and then explains how these deficiencies can be exploited in the design and deployment of task-specific webbots.

As they follow along, readers learn how to write stealthy webbots that send and receive email and text messages, manage cookies, and decode encrypted files. Sample projects reinforce these new skills so that readers can create more sophisticated bots to track online prices, download entire websites, and bid on auctions in their closing moments. This second edition of Webbots, Spiders, and Screen Scrapers has been completely updated and revised to cover the latest trends in web crawling, including new chapters on text parsing, browser macros, anonymizers, and more.

About the Author

xxiii

About the Technical Reviewer

xxiii

Acknowledgments

xxv

Introduction

(6)

Old-School Client-Server Technology

(1)

The Problem with Browsers

(1)

What to Expect from This Book

(1)

Learn from My Mistakes

(1)

Master Webbot Techniques

(1)

Leverage Existing Scripts

(1)

About the Website

(1)

About the Code

(1)

Requirements

(1)

Hardware

(1)

Software

(1)

Internet Access

(1)

A Disclaimer (This Is Important)

(1)

PART I FUNDAMENTAL CONCEPTS AND TECHNIQUES

(84)

1 What's in It for You?

(6)

Uncovering the Internet's True Potential

(1)

What's in It for Developers?

(1)

Webbot Developers Are in Demand

(1)

Webbots Are Fun to Write

(1)

Webbots Facilitate "Constructive Hacking"

(1)

What's in It for Business Leaders?

(1)

Customize the Internet for Your Business

(1)

Capitalize on the Public's Inexperience with Webbots

(1)

Accomplish a Lot with a Small Investment

(1)

Final Thoughts

(3)

2 Ideas for Webbot Projects

(8)

Inspiration from Browser Limitations

(3)

Webbots That Aggregate and Filter Information for Relevance

(1)

Webbots That Interpret What They Find Online

(1)

Webbots That Act on Your Behalf

(1)

A Few Crazy Ideas to Get You Started

(4)

Help Out a Busy Executive

(1)

Save Money by Automating Tasks

(1)

Protect Intellectual Property

(1)

Monitor Opportunities

(1)

Verify Access Rights on a Website

(1)

Create an Online Clipping Service

(1)

Plot Unauthorized Wi-Fi Networks

(1)

Track Web Technologies

(1)

Allow Incompatible Systems to Communicate

(1)

Final Thoughts

(1)

3 Downloading Web Pages

(14)

Think About Files, Not Web Pages

(1)

Downloading Files with PHP's Built-in Functions

(3)

Downloading Files with fopen() and fgets()

(2)

Downloading Files with file()

(1)

Introducing PHP/CURL

(2)

Multiple Transfer Protocols

(1)

Form Submission

(1)

Basic Authentication

(1)

Redirection

(1)

Agent Name Spoofing

(1)

Referer Management

(1)

Socket Management

(1)

Installing PHP/CURL

(1)

LIB_http

(5)

Familiarizing Yourself with the Default Values

(1)

Using LIB_http

(3)

Learning More About HTTP Headers

(1)

Examining LIB_http's Source Code

(1)

Final Thoughts

(2)

4 Basic Parsing Techniques

(12)

Content Is Mixed with Markup

(1)

Parsing Poorly Written HTML

(1)

Standard Parse Routines

(1)

Using LIB_parse

(5)

Splitting a String at a Delimiter: split_string()

(1)

Parsing Text Between Delimiters: return_between()

(1)

Parsing a Data Set into an Array: parse_array()

(1)

Parsing Attribute Values: get_attribute()

(1)

Removing Unwanted Text: remove()

(1)

Useful PHP Functions

(2)

Detecting Whether a String Is Within Another String

(1)

Replacing a Portion of a String with Another String

(1)

Parsing Unformatted Text

(1)

Measuring the Similarity of Strings

(1)

Final Thoughts

(3)

Don't Trust a Poorly Coded Web Page

(1)

Parse in Small Steps

(1)

Don't Render Parsed Text While Debugging

(1)

Use Regular Expressions Sparingly

(2)

5 Advanced Parsing with Regular Expressions

(14)

Pattern Matching, the Key to Regular Expressions

(1)

PHP Regular Expression Types

(2)

PHP Regular Expressions Functions

(2)

Resemblance to PHP Built-In Functions

(1)

Learning Patterns Through Examples

(3)

Parsing Numbers

(1)

Detecting a Series of Characters

(1)

Matching Alpha Characters

(1)

Matching on Wildcards

(1)

Specifying Alternate Matches

(1)

Regular Expressions Groupings and Ranges

(1)

Regular Expressions of Particular Interest to Webbot Developers

(5)

Parsing Phone Numbers

(4)

Where to Go from Here

(1)

When Regular Expressions Are (or Aren't) the Right Parsing Tool

(2)

Strengths of Regular Expressions

(1)

Disadvantages of Pattern Matching While Parsing Web Pages

(2)

Which Are Faster: Regular Expressions or PHP's Built-In Functions?

(1)

Final Thoughts

(1)

6 Automating form Submission

(14)

Reverse Engineering Form Interfaces

(1)

Form Handlers, Data Fields, Methods, and Event Triggers

(5)

Form Handlers

(1)

Data Fields

(1)

Methods

(2)

Multipart Encoding

(1)

Event Triggers

(1)

Unpredictable Forms

(1)

JavaScript Can Change a Form Just Before Submission

(1)

Form HTML Is Often Unreadable by Humans

(1)

Cookies Aren't Included in the Form, but Can Affect Operation

(1)

Analyzing a Form

(3)

Final Thoughts

(3)

Don't Blow Your Cover

(1)

Correctly Emulate Browsers

(1)

Avoid Form Errors

(2)

7 Managing Large Amounts of Data

(14)

Organizing Data

(8)

Naming Conventions

(1)

Storing Data in Structured Files

(1)

Storing Text in a Database

(3)

Storing Images in a Database

(2)

Database or File?

(1)

Making Data Smaller

(4)

Storing References to Image Files

(1)

Compressing Data

(2)

Removing Formatting

(1)

Thumbnailing Images

(1)

Final Thoughts

(1)

PART II PROJECTS

(80)

8 Price-Monitoring Webbots

(8)

The Target

(1)

Designing the Parsing Script

(1)

Initialization and Downloading the Target

(5)

Further Exploration

100

(1)

9 Image-Capturing Webbots

101

(8)

Example Image-Capturing Webbot

102

(1)

Creating the Image-Capturing Webbot

102

(6)

Binary-Safe Download Routine

103

(1)

Directory Structure

104

(1)

The Main Script

105

(3)

Further Exploration

108

(1)

Final Thoughts

108

(1)

10 Link-Verification Webbots

109

(8)

Creating the Link-Verification Webbot

109

(5)

Initializing the Webbot and Downloading the Target

109

(1)

Setting the Page Base

110

(1)

Parsing the Links

111

(1)

Running a Verification Loop

111

(1)

Generating Fully Resolved URLs

112

(1)

Downloading the Linked Page

113

(1)

Displaying the Page Status

113

(1)

Running the Webbot

114

(1)

LIB_http_codes

114

(1)

LIB_resolve_addresses

115

(1)

Further Exploration

115

(2)

11 Search-Ranking Webbots

117

(12)

Description of a Search Result Page

118

(2)

What the Search-Ranking Webbot Does

120

(1)

Running the Search-Ranking Webbot

120

(1)

How the Search-Ranking Webbot Works

120

(1)

The Search-Ranking Webbot Script

121

(5)

Initializing Variables

121

(1)

Starting the Loop

122

(1)

Fetching the Search Results

123

(1)

Parsing the Search Results

123

(3)

Final Thoughts

126

(1)

Be Kind to Your Sources

126

(1)

Search Sites May Treat Webbots Differently Than Browsers

126

(1)

Spidering Search Engines Is a Bad Idea

126

(1)

Familiarize Yourself with the Google API

127

(1)

Further Exploration

127

(2)

12 Aggregation Webbots

129

(10)

Choosing Data Sources for Webbots

130

(1)

Example Aggregation Webbot

131

(4)

Familiarizing Yourself with RSS Feeds

131

(2)

Writing the Aggregation Webbot

133

(2)

Adding Filtering to Your Aggregation Webbot

135

(2)

Further Exploration

137

(2)

13 FTP Webbots

139

(6)

Example FTP Webbot

140

(2)

PHP and FTP

142

(1)

Further Exploration

143

(2)

14 Webbots that Read Email

145

(8)

The POP3 Protocol

146

(3)

Logging into a POP3 Mail Server

146

(1)

Reading Mail from a POP3 Mail Server

146

(3)

Executing POP3 Commands with a Webbot

149

(2)

Further Exploration

151

(2)

Email-Controlled Webbots

151

(1)

Email Interfaces

152

(1)

15 Webbots That Send Email

153

(10)

Email, Webbols, and Spam

153

(1)

Sending Mail with SMTP and PHP

154

(3)

Configuring PHP to Send Mail

154

(1)

Sending an Email with mail()

155

(2)

Writing a Webbot That Sends Email Notifications

157

(3)

Keeping Legitimate Mail out of Spam Filters

158

(1)

Sending HTML-Formatted Email

159

(1)

Further Exploration

160

(3)

Using Returned Emails to Prune Access Lists

160

(1)

Using Email as Notification That Your Webbot Ran

161

(1)

Leveraging Wireless Technologies

161

(1)

Writing Webbots That Send Text Messages

161

(2)

16 Converting a Website into a Function

163

(8)

Writing a Function Interface

164

(5)

Defining the Interface

165

(1)

Analyzing the Target Web Page

165

(2)

Using describe_zipcode()

167

(2)

Final Thoughts

169

(2)

Distributing Resources

169

(1)

Using Standard Interfaces

170

(1)

Designing a Custom Lightweight "Web Service"

170

(1)

PART III ADVANCED TECHNICAL CONSIDERATIONS

171

(92)

17 Spiders

173

(12)

How Spiders Work

174

(1)

Example Spider

175

(1)

LIB_simple_spider

176

(4)

harvest_links()

177

(1)

archive_links()

178

(1)

get_domain()

178

(1)

exclude_link()

179

(1)

Experimenting with the Spider

180

(1)

Adding the Payload

181

(1)

Further Exploration

181

(4)

Save Links in a Database

181

(1)

Separate the Harvest and Payload

182

(1)

Distribute Tasks Across Multiple Computers

182

(1)

Regulate Page Requests

183

(2)

18 Procurement Webbots and Snipers

185

(8)

Procurement Webbot Theory

186

(2)

Get Purchase Criteria

186

(1)

Authenticate Buyer

187

(1)

Verify Item

187

(1)

Evaluate Purchase Triggers

187

(1)

Make Purchase

187

(1)

Evaluate Results

188

(1)

Sniper Theory

188

(3)

Get Purchase Criteria

188

(1)

Authenticate Buyer

189

(1)

Verify Item

189

(1)

Synchronize Clocks

189

(2)

Time to Bid?

191

(1)

Submit Bid

191

(1)

Evaluate Results

191

(1)

Testing Your Own Webbots and Snipers

191

(1)

Further Exploration

191

(1)

Final Thoughts

192

(1)

19 Webbots and Cryptography

193

(4)

Designing Webbots That Use Encryption

194

(1)

SSL and PHP Built-in Functions

194

(1)

Encryption and PHP/CURL

194

(1)

A Quick Overview of Web Encryption

195

(1)

Final Thoughts

196

(1)

20 Authentication

197

(12)

What Is Authentication?

197

(2)

Types of Online Authentication

198

(1)

Strengthening Authentication by Combining Techniques

198

(1)

Authentication and Webbots

199

(1)

Example Scripts and Practice Pages

199

(1)

Basic Authentication

199

(3)

Session Authentication

202

(5)

Authentication with Cookie Sessions

202

(3)

Authentication with Query Sessions

205

(2)

Final Thoughts

207

(2)

21 Advanced Cookie Management

209

(6)

How Cookies Work

209

(2)

PHP/CURL and Cookies

211

(1)

How Cookies Challenge Webbot Design

212

(2)

Purging Temporary Cookies

212

(1)

Managing Multiple Users' Cookies

213

(1)

Further Exploration

214

(1)

22 Scheduling Webbots and Spiders

215

(12)

Preparing Your Webbots to Run as Scheduled Tasks

216

(1)

The Windows XP Task Scheduler

216

(4)

Scheduling a Webbot to Run Daily

217

(1)

Complex Schedules

218

(2)

The Windows 7 Task Scheduler

220

(3)

Non-calendar-based Triggers

223

(2)

Final Thoughts

225

(2)

Determine the Webbot's Best Periodicity

225

(1)

Avoid Single Points of Failure

225

(1)

Add Variety to Your Schedule

225

(2)

23 Scraping Difficult Websites with Browser Macros

227

(12)

Barriers to Effective Web Scraping

229

(1)

AJAX

229

(1)

Bizarre JavaScript and Cookie Behavior

229

(1)

Flash

229

(1)

Overcoming Webscraping Barriers with Browser Macros

230

(7)

What Is a Browser Macro?

230

(1)

The Ultimate Browser-Like Webbot

230

(1)

Installing and Using iMacros

230

(1)

Creating Your First Macro

231

(6)

Final Thoughts

237

(2)

Are Macros Really Necessary?

237

(1)

Other Uses

237

(2)

24 Hacking Imacros

239

(10)

Hacking iMacros for Added Functionality

240

(7)

Reasons for Not Using the iMacros Scripting Engine

240

(1)

Creating a Dynamic Macro

241

(4)

Launching iMacros Automatically

245

(2)

Further Exploration

247

(2)

25 Deployment and Scaling

249

(14)

One-to-Many Environment

250

(1)

One-to-One Environment

251

(1)

Many-to-Many Environment

251

(1)

Many-to-One Environment

252

(1)

Scaling and Denial-of-Service Attacks

252

(1)

Even Simple Webbots Can Generate a Lot of Traffic

252

(1)

Inefficiencies at the Target

252

(1)

The Problems with Scaling Too Well

253

(1)

Creating Multiple Instances of a Webbot

253

(2)

Forking Processes

253

(1)

Leveraging the Operating System

254

(1)

Distributing the Task over Multiple Computers

254

(1)

Managing a Botnet

255

(7)

Botnet Communication Methods

255

(7)

Further Exploration

262

(1)

PART IV LARGER CONSIDERATIONS

263

(64)

26 Designing Stealthy Webbots and Spiders

265

(8)

Why Design a Stealthy Webbot?

265

(4)

Log Files

266

(3)

Log-Monitoring Software

269

(1)

Stealth Means Simulating Human Patterns

269

(1)

Be Kind to Your Resources

269

(1)

Run Your Webbot During Busy Hours

270

(1)

Don't Run Your Webbot at the Same Time Each Day

270

(1)

Don't Run Your Webbot on Holidays and Weekends

270

(1)

Use Random, Intra-fetch Delays

270

(1)

Final Thoughts

270

(3)

27 Proxies

273

(12)

What Is a Proxy?

273

(1)

Proxies in the Virtual World

274

(1)

Why Webbot Developers Use Proxies

274

(3)

Using Proxies to Become Anonymous

274

(3)

Using a Proxy to Be Somewhere Else

277

(1)

Using a Proxy Server

277

(1)

Using a Proxy in a Browser

278

(1)

Using a Proxy with PHP/CURL

278

(1)

Types of Proxy Servers

278

(5)

Open Proxies

279

(2)

Tor

281

(1)

Commercial Proxies

282

(1)

Final Thoughts

283

(2)

Anonymity Is a Process, Not a Feature

283

(1)

Creating Your Own Proxy Service

283

(2)

28 Writing Fault-Tolerant Webbots

285

(12)

Types of Webbol Fault Tolerance

286

(9)

Adapting to Changes in URLs

286

(5)

Adapting to Changes in Page Content

291

(1)

Adapting to Changes in Forms

292

(2)

Adapting to Changes in Cookie Management

294

(1)

Adapting to Network Outages and Network Congestion

294

(1)

Error Handlers

295

(1)

Further Exploration

296

(1)

29 Designing Webbot-Friendly Websites

297

(12)

Optimizing Web Pages for Search Engine Spiders

297

(3)

Well-Defined Links

298

(1)

Google Bombs and Spam Indexing

298

(1)

Title Tags

298

(1)

Meta Tags

299

(1)

Header Tags

299

(1)

Image alt Attributes

300

(1)

Web Design Techniques That Hinder Search Engine Spiders

300

(1)

JavaScript

300

(1)

Non-ASCII Content

301

(1)

Designing Data-Only Interfaces

301

(6)

XML

301

(1)

Lightweight Data Exchange

302

(3)

SOAP

305

(1)

REST

306

(1)

Final Thoughts

307

(2)

30 Killing Spiders

309

(8)

Asking Nicely

310

(2)

Create a Terms of Service Agreement

310

(1)

Use the robots.txt File

311

(1)

Use the Robots Meta Tag

312

(1)

Building Speed Bumps

312

(3)

Selectively Allow Access to Specific Web Agents

312

(1)

Use Obfuscation

313

(1)

Use Cookies, Encryption, JavaScript, and Redirection

313

(1)

Authenticate Users

314

(1)

Update Your Site Often

314

(1)

Embed Text in Other Media

314

(1)

Setting Traps

315

(1)

Create a Spider Trap

315

(1)

Fun Things to Do with Unwanted Spiders

316

(1)

Final Thoughts

316

(1)

31 Keeping Webbots Out of Trouble

317

(10)

It's All About Respect

318

(1)

319

(3)

Do Consult Resources

319

(1)

Don't Be an Armchair Lawyer

319

(3)

Trespass to Chattels

322

(2)

Internet Law

324

(1)

Final Thoughts

325

(2)

A PHP/CURL REFERENCE

327

(10)

Creating a Minimal PHP/CURL Session

327

(1)

Initiating PHP/CURL Sessions

328

(1)

Setting PHP/CURL Options

328

(5)

Curlopt_Url

329

(1)

Curlopt_Returntransfer

329

(1)

Curlopt_Referer

329

(1)

Curlopt_Followlocation And Curlopt_Maxredirs

329

(1)

Curlopt_Useragent

330

(1)

Curlopt_Nobody And Curlopt_Header

330

(1)

Curlopt_Timeout

331

(1)

Curlopt_Cookiefile And Curlopt_Cookiejar

331

(1)

Curlopt_Httpheader

331

(1)

Curlopt_Ssl_Verifypeer

332

(1)

Curlopt_Userpwd And Curlopt_Unrestricted_Auth

332

(1)

Curlopt_Post And Curlopt_Postfields

332

(1)

Curlopt_Verbose

333

(1)

Curlopt_Port

333

(1)

Executing the PHP/CURL Command

333

(2)

Retrieving PHP/CURL Session Information

334

(1)

Viewing PHP/CURL Errors

334

(1)

Closing PHP/CURL Sessions

335

(2)

B STATUS CODES

337

(4)

HTTP Codes

337

(2)

NNTP Codes

339

(2)

C SMS GATEWAYS

341

(4)

Sending Text Messages

342

(1)

Reading Text Messages

342

(1)

A Sampling of Text Message Email Addresses

342

(3)

Index

345

Lisainfo e-raamatute kohta

Püsilink: https://www.kriso.ee/db/97815932743206e.html

Märksõnad:

E-raamat: Webbots, Spiders, and Screen Scrapers, 2nd Edition

DRM piirangud

Kopeerimine (copy/paste):

Printimine:

Kasutamine:

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad E-raamatute teemad

Vali ostukorv