Muutke küpsiste eelistusi

E-raamat: Webbots, Spiders, and Screen Scrapers, 2nd Edition

  • Formaat: EPUB+DRM
  • Ilmumisaeg: 01-Mar-2012
  • Kirjastus: No Starch Press,US
  • Keel: eng
  • ISBN-13: 9781593274320
Teised raamatud teemal:
  • Formaat - EPUB+DRM
  • Hind: 29,03 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
  • Formaat: EPUB+DRM
  • Ilmumisaeg: 01-Mar-2012
  • Kirjastus: No Starch Press,US
  • Keel: eng
  • ISBN-13: 9781593274320
Teised raamatud teemal:

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

The Internet is bigger and better than what a mere browser allows. Webbots, Spiders, and Screen Scrapers is for programmers and businesspeople who want to take full advantage of the vast resources available on the Web. The book first outlines the deficiencies of browsers, and then explains how these deficiencies can be exploited in the design and deployment of task-specific webbots.



As they follow along, readers learn how to write stealthy webbots that send and receive email and text messages, manage cookies, and decode encrypted files. Sample projects reinforce these new skills so that readers can create more sophisticated bots to track online prices, download entire websites, and bid on auctions in their closing moments. This second edition of Webbots, Spiders, and Screen Scrapers has been completely updated and revised to cover the latest trends in web crawling, including new chapters on text parsing, browser macros, anonymizers, and more.

About the Author xxiii
About the Technical Reviewer xxiii
Acknowledgments xxv
Introduction 1(6)
Old-School Client-Server Technology
2(1)
The Problem with Browsers
2(1)
What to Expect from This Book
2(1)
Learn from My Mistakes
3(1)
Master Webbot Techniques
3(1)
Leverage Existing Scripts
3(1)
About the Website
3(1)
About the Code
4(1)
Requirements
5(1)
Hardware
5(1)
Software
6(1)
Internet Access
6(1)
A Disclaimer (This Is Important)
6(1)
PART I FUNDAMENTAL CONCEPTS AND TECHNIQUES
7(84)
1 What's in It for You?
9(6)
Uncovering the Internet's True Potential
9(1)
What's in It for Developers?
10(1)
Webbot Developers Are in Demand
10(1)
Webbots Are Fun to Write
11(1)
Webbots Facilitate "Constructive Hacking"
11(1)
What's in It for Business Leaders?
11(1)
Customize the Internet for Your Business
12(1)
Capitalize on the Public's Inexperience with Webbots
12(1)
Accomplish a Lot with a Small Investment
12(1)
Final Thoughts
12(3)
2 Ideas for Webbot Projects
15(8)
Inspiration from Browser Limitations
15(3)
Webbots That Aggregate and Filter Information for Relevance
16(1)
Webbots That Interpret What They Find Online
17(1)
Webbots That Act on Your Behalf
17(1)
A Few Crazy Ideas to Get You Started
18(4)
Help Out a Busy Executive
18(1)
Save Money by Automating Tasks
19(1)
Protect Intellectual Property
19(1)
Monitor Opportunities
20(1)
Verify Access Rights on a Website
20(1)
Create an Online Clipping Service
20(1)
Plot Unauthorized Wi-Fi Networks
21(1)
Track Web Technologies
21(1)
Allow Incompatible Systems to Communicate
21(1)
Final Thoughts
22(1)
3 Downloading Web Pages
23(14)
Think About Files, Not Web Pages
24(1)
Downloading Files with PHP's Built-in Functions
25(3)
Downloading Files with fopen() and fgets()
25(2)
Downloading Files with file()
27(1)
Introducing PHP/CURL
28(2)
Multiple Transfer Protocols
28(1)
Form Submission
28(1)
Basic Authentication
28(1)
Cookies
29(1)
Redirection
29(1)
Agent Name Spoofing
29(1)
Referer Management
30(1)
Socket Management
30(1)
Installing PHP/CURL
30(1)
LIB_http
30(5)
Familiarizing Yourself with the Default Values
31(1)
Using LIB_http
31(3)
Learning More About HTTP Headers
34(1)
Examining LIB_http's Source Code
35(1)
Final Thoughts
35(2)
4 Basic Parsing Techniques
37(12)
Content Is Mixed with Markup
37(1)
Parsing Poorly Written HTML
38(1)
Standard Parse Routines
38(1)
Using LIB_parse
39(5)
Splitting a String at a Delimiter: split_string()
39(1)
Parsing Text Between Delimiters: return_between()
40(1)
Parsing a Data Set into an Array: parse_array()
41(1)
Parsing Attribute Values: get_attribute()
42(1)
Removing Unwanted Text: remove()
43(1)
Useful PHP Functions
44(2)
Detecting Whether a String Is Within Another String
44(1)
Replacing a Portion of a String with Another String
45(1)
Parsing Unformatted Text
45(1)
Measuring the Similarity of Strings
46(1)
Final Thoughts
46(3)
Don't Trust a Poorly Coded Web Page
46(1)
Parse in Small Steps
46(1)
Don't Render Parsed Text While Debugging
47(1)
Use Regular Expressions Sparingly
47(2)
5 Advanced Parsing with Regular Expressions
49(14)
Pattern Matching, the Key to Regular Expressions
50(1)
PHP Regular Expression Types
50(2)
PHP Regular Expressions Functions
50(2)
Resemblance to PHP Built-In Functions
52(1)
Learning Patterns Through Examples
52(3)
Parsing Numbers
53(1)
Detecting a Series of Characters
53(1)
Matching Alpha Characters
53(1)
Matching on Wildcards
54(1)
Specifying Alternate Matches
54(1)
Regular Expressions Groupings and Ranges
55(1)
Regular Expressions of Particular Interest to Webbot Developers
55(5)
Parsing Phone Numbers
55(4)
Where to Go from Here
59(1)
When Regular Expressions Are (or Aren't) the Right Parsing Tool
60(2)
Strengths of Regular Expressions
60(1)
Disadvantages of Pattern Matching While Parsing Web Pages
60(2)
Which Are Faster: Regular Expressions or PHP's Built-In Functions?
62(1)
Final Thoughts
62(1)
6 Automating form Submission
63(14)
Reverse Engineering Form Interfaces
64(1)
Form Handlers, Data Fields, Methods, and Event Triggers
65(5)
Form Handlers
65(1)
Data Fields
66(1)
Methods
67(2)
Multipart Encoding
69(1)
Event Triggers
70(1)
Unpredictable Forms
70(1)
JavaScript Can Change a Form Just Before Submission
70(1)
Form HTML Is Often Unreadable by Humans
70(1)
Cookies Aren't Included in the Form, but Can Affect Operation
70(1)
Analyzing a Form
71(3)
Final Thoughts
74(3)
Don't Blow Your Cover
74(1)
Correctly Emulate Browsers
75(1)
Avoid Form Errors
75(2)
7 Managing Large Amounts of Data
77(14)
Organizing Data
77(8)
Naming Conventions
78(1)
Storing Data in Structured Files
79(1)
Storing Text in a Database
80(3)
Storing Images in a Database
83(2)
Database or File?
85(1)
Making Data Smaller
85(4)
Storing References to Image Files
85(1)
Compressing Data
86(2)
Removing Formatting
88(1)
Thumbnailing Images
89(1)
Final Thoughts
90(1)
PART II PROJECTS
91(80)
8 Price-Monitoring Webbots
93(8)
The Target
94(1)
Designing the Parsing Script
95(1)
Initialization and Downloading the Target
95(5)
Further Exploration
100(1)
9 Image-Capturing Webbots
101(8)
Example Image-Capturing Webbot
102(1)
Creating the Image-Capturing Webbot
102(6)
Binary-Safe Download Routine
103(1)
Directory Structure
104(1)
The Main Script
105(3)
Further Exploration
108(1)
Final Thoughts
108(1)
10 Link-Verification Webbots
109(8)
Creating the Link-Verification Webbot
109(5)
Initializing the Webbot and Downloading the Target
109(1)
Setting the Page Base
110(1)
Parsing the Links
111(1)
Running a Verification Loop
111(1)
Generating Fully Resolved URLs
112(1)
Downloading the Linked Page
113(1)
Displaying the Page Status
113(1)
Running the Webbot
114(1)
LIB_http_codes
114(1)
LIB_resolve_addresses
115(1)
Further Exploration
115(2)
11 Search-Ranking Webbots
117(12)
Description of a Search Result Page
118(2)
What the Search-Ranking Webbot Does
120(1)
Running the Search-Ranking Webbot
120(1)
How the Search-Ranking Webbot Works
120(1)
The Search-Ranking Webbot Script
121(5)
Initializing Variables
121(1)
Starting the Loop
122(1)
Fetching the Search Results
123(1)
Parsing the Search Results
123(3)
Final Thoughts
126(1)
Be Kind to Your Sources
126(1)
Search Sites May Treat Webbots Differently Than Browsers
126(1)
Spidering Search Engines Is a Bad Idea
126(1)
Familiarize Yourself with the Google API
127(1)
Further Exploration
127(2)
12 Aggregation Webbots
129(10)
Choosing Data Sources for Webbots
130(1)
Example Aggregation Webbot
131(4)
Familiarizing Yourself with RSS Feeds
131(2)
Writing the Aggregation Webbot
133(2)
Adding Filtering to Your Aggregation Webbot
135(2)
Further Exploration
137(2)
13 FTP Webbots
139(6)
Example FTP Webbot
140(2)
PHP and FTP
142(1)
Further Exploration
143(2)
14 Webbots that Read Email
145(8)
The POP3 Protocol
146(3)
Logging into a POP3 Mail Server
146(1)
Reading Mail from a POP3 Mail Server
146(3)
Executing POP3 Commands with a Webbot
149(2)
Further Exploration
151(2)
Email-Controlled Webbots
151(1)
Email Interfaces
152(1)
15 Webbots That Send Email
153(10)
Email, Webbols, and Spam
153(1)
Sending Mail with SMTP and PHP
154(3)
Configuring PHP to Send Mail
154(1)
Sending an Email with mail()
155(2)
Writing a Webbot That Sends Email Notifications
157(3)
Keeping Legitimate Mail out of Spam Filters
158(1)
Sending HTML-Formatted Email
159(1)
Further Exploration
160(3)
Using Returned Emails to Prune Access Lists
160(1)
Using Email as Notification That Your Webbot Ran
161(1)
Leveraging Wireless Technologies
161(1)
Writing Webbots That Send Text Messages
161(2)
16 Converting a Website into a Function
163(8)
Writing a Function Interface
164(5)
Defining the Interface
165(1)
Analyzing the Target Web Page
165(2)
Using describe_zipcode()
167(2)
Final Thoughts
169(2)
Distributing Resources
169(1)
Using Standard Interfaces
170(1)
Designing a Custom Lightweight "Web Service"
170(1)
PART III ADVANCED TECHNICAL CONSIDERATIONS
171(92)
17 Spiders
173(12)
How Spiders Work
174(1)
Example Spider
175(1)
LIB_simple_spider
176(4)
harvest_links()
177(1)
archive_links()
178(1)
get_domain()
178(1)
exclude_link()
179(1)
Experimenting with the Spider
180(1)
Adding the Payload
181(1)
Further Exploration
181(4)
Save Links in a Database
181(1)
Separate the Harvest and Payload
182(1)
Distribute Tasks Across Multiple Computers
182(1)
Regulate Page Requests
183(2)
18 Procurement Webbots and Snipers
185(8)
Procurement Webbot Theory
186(2)
Get Purchase Criteria
186(1)
Authenticate Buyer
187(1)
Verify Item
187(1)
Evaluate Purchase Triggers
187(1)
Make Purchase
187(1)
Evaluate Results
188(1)
Sniper Theory
188(3)
Get Purchase Criteria
188(1)
Authenticate Buyer
189(1)
Verify Item
189(1)
Synchronize Clocks
189(2)
Time to Bid?
191(1)
Submit Bid
191(1)
Evaluate Results
191(1)
Testing Your Own Webbots and Snipers
191(1)
Further Exploration
191(1)
Final Thoughts
192(1)
19 Webbots and Cryptography
193(4)
Designing Webbots That Use Encryption
194(1)
SSL and PHP Built-in Functions
194(1)
Encryption and PHP/CURL
194(1)
A Quick Overview of Web Encryption
195(1)
Final Thoughts
196(1)
20 Authentication
197(12)
What Is Authentication?
197(2)
Types of Online Authentication
198(1)
Strengthening Authentication by Combining Techniques
198(1)
Authentication and Webbots
199(1)
Example Scripts and Practice Pages
199(1)
Basic Authentication
199(3)
Session Authentication
202(5)
Authentication with Cookie Sessions
202(3)
Authentication with Query Sessions
205(2)
Final Thoughts
207(2)
21 Advanced Cookie Management
209(6)
How Cookies Work
209(2)
PHP/CURL and Cookies
211(1)
How Cookies Challenge Webbot Design
212(2)
Purging Temporary Cookies
212(1)
Managing Multiple Users' Cookies
213(1)
Further Exploration
214(1)
22 Scheduling Webbots and Spiders
215(12)
Preparing Your Webbots to Run as Scheduled Tasks
216(1)
The Windows XP Task Scheduler
216(4)
Scheduling a Webbot to Run Daily
217(1)
Complex Schedules
218(2)
The Windows 7 Task Scheduler
220(3)
Non-calendar-based Triggers
223(2)
Final Thoughts
225(2)
Determine the Webbot's Best Periodicity
225(1)
Avoid Single Points of Failure
225(1)
Add Variety to Your Schedule
225(2)
23 Scraping Difficult Websites with Browser Macros
227(12)
Barriers to Effective Web Scraping
229(1)
AJAX
229(1)
Bizarre JavaScript and Cookie Behavior
229(1)
Flash
229(1)
Overcoming Webscraping Barriers with Browser Macros
230(7)
What Is a Browser Macro?
230(1)
The Ultimate Browser-Like Webbot
230(1)
Installing and Using iMacros
230(1)
Creating Your First Macro
231(6)
Final Thoughts
237(2)
Are Macros Really Necessary?
237(1)
Other Uses
237(2)
24 Hacking Imacros
239(10)
Hacking iMacros for Added Functionality
240(7)
Reasons for Not Using the iMacros Scripting Engine
240(1)
Creating a Dynamic Macro
241(4)
Launching iMacros Automatically
245(2)
Further Exploration
247(2)
25 Deployment and Scaling
249(14)
One-to-Many Environment
250(1)
One-to-One Environment
251(1)
Many-to-Many Environment
251(1)
Many-to-One Environment
252(1)
Scaling and Denial-of-Service Attacks
252(1)
Even Simple Webbots Can Generate a Lot of Traffic
252(1)
Inefficiencies at the Target
252(1)
The Problems with Scaling Too Well
253(1)
Creating Multiple Instances of a Webbot
253(2)
Forking Processes
253(1)
Leveraging the Operating System
254(1)
Distributing the Task over Multiple Computers
254(1)
Managing a Botnet
255(7)
Botnet Communication Methods
255(7)
Further Exploration
262(1)
PART IV LARGER CONSIDERATIONS
263(64)
26 Designing Stealthy Webbots and Spiders
265(8)
Why Design a Stealthy Webbot?
265(4)
Log Files
266(3)
Log-Monitoring Software
269(1)
Stealth Means Simulating Human Patterns
269(1)
Be Kind to Your Resources
269(1)
Run Your Webbot During Busy Hours
270(1)
Don't Run Your Webbot at the Same Time Each Day
270(1)
Don't Run Your Webbot on Holidays and Weekends
270(1)
Use Random, Intra-fetch Delays
270(1)
Final Thoughts
270(3)
27 Proxies
273(12)
What Is a Proxy?
273(1)
Proxies in the Virtual World
274(1)
Why Webbot Developers Use Proxies
274(3)
Using Proxies to Become Anonymous
274(3)
Using a Proxy to Be Somewhere Else
277(1)
Using a Proxy Server
277(1)
Using a Proxy in a Browser
278(1)
Using a Proxy with PHP/CURL
278(1)
Types of Proxy Servers
278(5)
Open Proxies
279(2)
Tor
281(1)
Commercial Proxies
282(1)
Final Thoughts
283(2)
Anonymity Is a Process, Not a Feature
283(1)
Creating Your Own Proxy Service
283(2)
28 Writing Fault-Tolerant Webbots
285(12)
Types of Webbol Fault Tolerance
286(9)
Adapting to Changes in URLs
286(5)
Adapting to Changes in Page Content
291(1)
Adapting to Changes in Forms
292(2)
Adapting to Changes in Cookie Management
294(1)
Adapting to Network Outages and Network Congestion
294(1)
Error Handlers
295(1)
Further Exploration
296(1)
29 Designing Webbot-Friendly Websites
297(12)
Optimizing Web Pages for Search Engine Spiders
297(3)
Well-Defined Links
298(1)
Google Bombs and Spam Indexing
298(1)
Title Tags
298(1)
Meta Tags
299(1)
Header Tags
299(1)
Image alt Attributes
300(1)
Web Design Techniques That Hinder Search Engine Spiders
300(1)
JavaScript
300(1)
Non-ASCII Content
301(1)
Designing Data-Only Interfaces
301(6)
XML
301(1)
Lightweight Data Exchange
302(3)
SOAP
305(1)
REST
306(1)
Final Thoughts
307(2)
30 Killing Spiders
309(8)
Asking Nicely
310(2)
Create a Terms of Service Agreement
310(1)
Use the robots.txt File
311(1)
Use the Robots Meta Tag
312(1)
Building Speed Bumps
312(3)
Selectively Allow Access to Specific Web Agents
312(1)
Use Obfuscation
313(1)
Use Cookies, Encryption, JavaScript, and Redirection
313(1)
Authenticate Users
314(1)
Update Your Site Often
314(1)
Embed Text in Other Media
314(1)
Setting Traps
315(1)
Create a Spider Trap
315(1)
Fun Things to Do with Unwanted Spiders
316(1)
Final Thoughts
316(1)
31 Keeping Webbots Out of Trouble
317(10)
It's All About Respect
318(1)
Copyright
319(3)
Do Consult Resources
319(1)
Don't Be an Armchair Lawyer
319(3)
Trespass to Chattels
322(2)
Internet Law
324(1)
Final Thoughts
325(2)
A PHP/CURL REFERENCE
327(10)
Creating a Minimal PHP/CURL Session
327(1)
Initiating PHP/CURL Sessions
328(1)
Setting PHP/CURL Options
328(5)
Curlopt_Url
329(1)
Curlopt_Returntransfer
329(1)
Curlopt_Referer
329(1)
Curlopt_Followlocation And Curlopt_Maxredirs
329(1)
Curlopt_Useragent
330(1)
Curlopt_Nobody And Curlopt_Header
330(1)
Curlopt_Timeout
331(1)
Curlopt_Cookiefile And Curlopt_Cookiejar
331(1)
Curlopt_Httpheader
331(1)
Curlopt_Ssl_Verifypeer
332(1)
Curlopt_Userpwd And Curlopt_Unrestricted_Auth
332(1)
Curlopt_Post And Curlopt_Postfields
332(1)
Curlopt_Verbose
333(1)
Curlopt_Port
333(1)
Executing the PHP/CURL Command
333(2)
Retrieving PHP/CURL Session Information
334(1)
Viewing PHP/CURL Errors
334(1)
Closing PHP/CURL Sessions
335(2)
B STATUS CODES
337(4)
HTTP Codes
337(2)
NNTP Codes
339(2)
C SMS GATEWAYS
341(4)
Sending Text Messages
342(1)
Reading Text Messages
342(1)
A Sampling of Text Message Email Addresses
342(3)
Index 345