About the Author |
|
xxiii | |
About the Technical Reviewer |
|
xxiii | |
Acknowledgments |
|
xxv | |
Introduction |
|
1 | (6) |
|
Old-School Client-Server Technology |
|
|
2 | (1) |
|
The Problem with Browsers |
|
|
2 | (1) |
|
What to Expect from This Book |
|
|
2 | (1) |
|
|
3 | (1) |
|
|
3 | (1) |
|
Leverage Existing Scripts |
|
|
3 | (1) |
|
|
3 | (1) |
|
|
4 | (1) |
|
|
5 | (1) |
|
|
5 | (1) |
|
|
6 | (1) |
|
|
6 | (1) |
|
A Disclaimer (This Is Important) |
|
|
6 | (1) |
|
PART I FUNDAMENTAL CONCEPTS AND TECHNIQUES |
|
|
7 | (84) |
|
|
9 | (6) |
|
Uncovering the Internet's True Potential |
|
|
9 | (1) |
|
What's in It for Developers? |
|
|
10 | (1) |
|
Webbot Developers Are in Demand |
|
|
10 | (1) |
|
|
11 | (1) |
|
Webbots Facilitate "Constructive Hacking" |
|
|
11 | (1) |
|
What's in It for Business Leaders? |
|
|
11 | (1) |
|
Customize the Internet for Your Business |
|
|
12 | (1) |
|
Capitalize on the Public's Inexperience with Webbots |
|
|
12 | (1) |
|
Accomplish a Lot with a Small Investment |
|
|
12 | (1) |
|
|
12 | (3) |
|
2 Ideas for Webbot Projects |
|
|
15 | (8) |
|
Inspiration from Browser Limitations |
|
|
15 | (3) |
|
Webbots That Aggregate and Filter Information for Relevance |
|
|
16 | (1) |
|
Webbots That Interpret What They Find Online |
|
|
17 | (1) |
|
Webbots That Act on Your Behalf |
|
|
17 | (1) |
|
A Few Crazy Ideas to Get You Started |
|
|
18 | (4) |
|
Help Out a Busy Executive |
|
|
18 | (1) |
|
Save Money by Automating Tasks |
|
|
19 | (1) |
|
Protect Intellectual Property |
|
|
19 | (1) |
|
|
20 | (1) |
|
Verify Access Rights on a Website |
|
|
20 | (1) |
|
Create an Online Clipping Service |
|
|
20 | (1) |
|
Plot Unauthorized Wi-Fi Networks |
|
|
21 | (1) |
|
|
21 | (1) |
|
Allow Incompatible Systems to Communicate |
|
|
21 | (1) |
|
|
22 | (1) |
|
|
23 | (14) |
|
Think About Files, Not Web Pages |
|
|
24 | (1) |
|
Downloading Files with PHP's Built-in Functions |
|
|
25 | (3) |
|
Downloading Files with fopen() and fgets() |
|
|
25 | (2) |
|
Downloading Files with file() |
|
|
27 | (1) |
|
|
28 | (2) |
|
Multiple Transfer Protocols |
|
|
28 | (1) |
|
|
28 | (1) |
|
|
28 | (1) |
|
|
29 | (1) |
|
|
29 | (1) |
|
|
29 | (1) |
|
|
30 | (1) |
|
|
30 | (1) |
|
|
30 | (1) |
|
|
30 | (5) |
|
Familiarizing Yourself with the Default Values |
|
|
31 | (1) |
|
|
31 | (3) |
|
Learning More About HTTP Headers |
|
|
34 | (1) |
|
Examining LIB_http's Source Code |
|
|
35 | (1) |
|
|
35 | (2) |
|
4 Basic Parsing Techniques |
|
|
37 | (12) |
|
Content Is Mixed with Markup |
|
|
37 | (1) |
|
Parsing Poorly Written HTML |
|
|
38 | (1) |
|
|
38 | (1) |
|
|
39 | (5) |
|
Splitting a String at a Delimiter: split_string() |
|
|
39 | (1) |
|
Parsing Text Between Delimiters: return_between() |
|
|
40 | (1) |
|
Parsing a Data Set into an Array: parse_array() |
|
|
41 | (1) |
|
Parsing Attribute Values: get_attribute() |
|
|
42 | (1) |
|
Removing Unwanted Text: remove() |
|
|
43 | (1) |
|
|
44 | (2) |
|
Detecting Whether a String Is Within Another String |
|
|
44 | (1) |
|
Replacing a Portion of a String with Another String |
|
|
45 | (1) |
|
|
45 | (1) |
|
Measuring the Similarity of Strings |
|
|
46 | (1) |
|
|
46 | (3) |
|
Don't Trust a Poorly Coded Web Page |
|
|
46 | (1) |
|
|
46 | (1) |
|
Don't Render Parsed Text While Debugging |
|
|
47 | (1) |
|
Use Regular Expressions Sparingly |
|
|
47 | (2) |
|
5 Advanced Parsing with Regular Expressions |
|
|
49 | (14) |
|
Pattern Matching, the Key to Regular Expressions |
|
|
50 | (1) |
|
PHP Regular Expression Types |
|
|
50 | (2) |
|
PHP Regular Expressions Functions |
|
|
50 | (2) |
|
Resemblance to PHP Built-In Functions |
|
|
52 | (1) |
|
Learning Patterns Through Examples |
|
|
52 | (3) |
|
|
53 | (1) |
|
Detecting a Series of Characters |
|
|
53 | (1) |
|
Matching Alpha Characters |
|
|
53 | (1) |
|
|
54 | (1) |
|
Specifying Alternate Matches |
|
|
54 | (1) |
|
Regular Expressions Groupings and Ranges |
|
|
55 | (1) |
|
Regular Expressions of Particular Interest to Webbot Developers |
|
|
55 | (5) |
|
|
55 | (4) |
|
|
59 | (1) |
|
When Regular Expressions Are (or Aren't) the Right Parsing Tool |
|
|
60 | (2) |
|
Strengths of Regular Expressions |
|
|
60 | (1) |
|
Disadvantages of Pattern Matching While Parsing Web Pages |
|
|
60 | (2) |
|
Which Are Faster: Regular Expressions or PHP's Built-In Functions? |
|
|
62 | (1) |
|
|
62 | (1) |
|
6 Automating form Submission |
|
|
63 | (14) |
|
Reverse Engineering Form Interfaces |
|
|
64 | (1) |
|
Form Handlers, Data Fields, Methods, and Event Triggers |
|
|
65 | (5) |
|
|
65 | (1) |
|
|
66 | (1) |
|
|
67 | (2) |
|
|
69 | (1) |
|
|
70 | (1) |
|
|
70 | (1) |
|
JavaScript Can Change a Form Just Before Submission |
|
|
70 | (1) |
|
Form HTML Is Often Unreadable by Humans |
|
|
70 | (1) |
|
Cookies Aren't Included in the Form, but Can Affect Operation |
|
|
70 | (1) |
|
|
71 | (3) |
|
|
74 | (3) |
|
|
74 | (1) |
|
Correctly Emulate Browsers |
|
|
75 | (1) |
|
|
75 | (2) |
|
7 Managing Large Amounts of Data |
|
|
77 | (14) |
|
|
77 | (8) |
|
|
78 | (1) |
|
Storing Data in Structured Files |
|
|
79 | (1) |
|
Storing Text in a Database |
|
|
80 | (3) |
|
Storing Images in a Database |
|
|
83 | (2) |
|
|
85 | (1) |
|
|
85 | (4) |
|
Storing References to Image Files |
|
|
85 | (1) |
|
|
86 | (2) |
|
|
88 | (1) |
|
|
89 | (1) |
|
|
90 | (1) |
|
|
91 | (80) |
|
8 Price-Monitoring Webbots |
|
|
93 | (8) |
|
|
94 | (1) |
|
Designing the Parsing Script |
|
|
95 | (1) |
|
Initialization and Downloading the Target |
|
|
95 | (5) |
|
|
100 | (1) |
|
9 Image-Capturing Webbots |
|
|
101 | (8) |
|
Example Image-Capturing Webbot |
|
|
102 | (1) |
|
Creating the Image-Capturing Webbot |
|
|
102 | (6) |
|
Binary-Safe Download Routine |
|
|
103 | (1) |
|
|
104 | (1) |
|
|
105 | (3) |
|
|
108 | (1) |
|
|
108 | (1) |
|
10 Link-Verification Webbots |
|
|
109 | (8) |
|
Creating the Link-Verification Webbot |
|
|
109 | (5) |
|
Initializing the Webbot and Downloading the Target |
|
|
109 | (1) |
|
|
110 | (1) |
|
|
111 | (1) |
|
Running a Verification Loop |
|
|
111 | (1) |
|
Generating Fully Resolved URLs |
|
|
112 | (1) |
|
Downloading the Linked Page |
|
|
113 | (1) |
|
Displaying the Page Status |
|
|
113 | (1) |
|
|
114 | (1) |
|
|
114 | (1) |
|
|
115 | (1) |
|
|
115 | (2) |
|
11 Search-Ranking Webbots |
|
|
117 | (12) |
|
Description of a Search Result Page |
|
|
118 | (2) |
|
What the Search-Ranking Webbot Does |
|
|
120 | (1) |
|
Running the Search-Ranking Webbot |
|
|
120 | (1) |
|
How the Search-Ranking Webbot Works |
|
|
120 | (1) |
|
The Search-Ranking Webbot Script |
|
|
121 | (5) |
|
|
121 | (1) |
|
|
122 | (1) |
|
Fetching the Search Results |
|
|
123 | (1) |
|
Parsing the Search Results |
|
|
123 | (3) |
|
|
126 | (1) |
|
|
126 | (1) |
|
Search Sites May Treat Webbots Differently Than Browsers |
|
|
126 | (1) |
|
Spidering Search Engines Is a Bad Idea |
|
|
126 | (1) |
|
Familiarize Yourself with the Google API |
|
|
127 | (1) |
|
|
127 | (2) |
|
|
129 | (10) |
|
Choosing Data Sources for Webbots |
|
|
130 | (1) |
|
Example Aggregation Webbot |
|
|
131 | (4) |
|
Familiarizing Yourself with RSS Feeds |
|
|
131 | (2) |
|
Writing the Aggregation Webbot |
|
|
133 | (2) |
|
Adding Filtering to Your Aggregation Webbot |
|
|
135 | (2) |
|
|
137 | (2) |
|
|
139 | (6) |
|
|
140 | (2) |
|
|
142 | (1) |
|
|
143 | (2) |
|
14 Webbots that Read Email |
|
|
145 | (8) |
|
|
146 | (3) |
|
Logging into a POP3 Mail Server |
|
|
146 | (1) |
|
Reading Mail from a POP3 Mail Server |
|
|
146 | (3) |
|
Executing POP3 Commands with a Webbot |
|
|
149 | (2) |
|
|
151 | (2) |
|
|
151 | (1) |
|
|
152 | (1) |
|
15 Webbots That Send Email |
|
|
153 | (10) |
|
|
153 | (1) |
|
Sending Mail with SMTP and PHP |
|
|
154 | (3) |
|
Configuring PHP to Send Mail |
|
|
154 | (1) |
|
Sending an Email with mail() |
|
|
155 | (2) |
|
Writing a Webbot That Sends Email Notifications |
|
|
157 | (3) |
|
Keeping Legitimate Mail out of Spam Filters |
|
|
158 | (1) |
|
Sending HTML-Formatted Email |
|
|
159 | (1) |
|
|
160 | (3) |
|
Using Returned Emails to Prune Access Lists |
|
|
160 | (1) |
|
Using Email as Notification That Your Webbot Ran |
|
|
161 | (1) |
|
Leveraging Wireless Technologies |
|
|
161 | (1) |
|
Writing Webbots That Send Text Messages |
|
|
161 | (2) |
|
16 Converting a Website into a Function |
|
|
163 | (8) |
|
Writing a Function Interface |
|
|
164 | (5) |
|
|
165 | (1) |
|
Analyzing the Target Web Page |
|
|
165 | (2) |
|
|
167 | (2) |
|
|
169 | (2) |
|
|
169 | (1) |
|
Using Standard Interfaces |
|
|
170 | (1) |
|
Designing a Custom Lightweight "Web Service" |
|
|
170 | (1) |
|
PART III ADVANCED TECHNICAL CONSIDERATIONS |
|
|
171 | (92) |
|
|
173 | (12) |
|
|
174 | (1) |
|
|
175 | (1) |
|
|
176 | (4) |
|
|
177 | (1) |
|
|
178 | (1) |
|
|
178 | (1) |
|
|
179 | (1) |
|
Experimenting with the Spider |
|
|
180 | (1) |
|
|
181 | (1) |
|
|
181 | (4) |
|
|
181 | (1) |
|
Separate the Harvest and Payload |
|
|
182 | (1) |
|
Distribute Tasks Across Multiple Computers |
|
|
182 | (1) |
|
|
183 | (2) |
|
18 Procurement Webbots and Snipers |
|
|
185 | (8) |
|
Procurement Webbot Theory |
|
|
186 | (2) |
|
|
186 | (1) |
|
|
187 | (1) |
|
|
187 | (1) |
|
Evaluate Purchase Triggers |
|
|
187 | (1) |
|
|
187 | (1) |
|
|
188 | (1) |
|
|
188 | (3) |
|
|
188 | (1) |
|
|
189 | (1) |
|
|
189 | (1) |
|
|
189 | (2) |
|
|
191 | (1) |
|
|
191 | (1) |
|
|
191 | (1) |
|
Testing Your Own Webbots and Snipers |
|
|
191 | (1) |
|
|
191 | (1) |
|
|
192 | (1) |
|
19 Webbots and Cryptography |
|
|
193 | (4) |
|
Designing Webbots That Use Encryption |
|
|
194 | (1) |
|
SSL and PHP Built-in Functions |
|
|
194 | (1) |
|
|
194 | (1) |
|
A Quick Overview of Web Encryption |
|
|
195 | (1) |
|
|
196 | (1) |
|
|
197 | (12) |
|
|
197 | (2) |
|
Types of Online Authentication |
|
|
198 | (1) |
|
Strengthening Authentication by Combining Techniques |
|
|
198 | (1) |
|
Authentication and Webbots |
|
|
199 | (1) |
|
Example Scripts and Practice Pages |
|
|
199 | (1) |
|
|
199 | (3) |
|
|
202 | (5) |
|
Authentication with Cookie Sessions |
|
|
202 | (3) |
|
Authentication with Query Sessions |
|
|
205 | (2) |
|
|
207 | (2) |
|
21 Advanced Cookie Management |
|
|
209 | (6) |
|
|
209 | (2) |
|
|
211 | (1) |
|
How Cookies Challenge Webbot Design |
|
|
212 | (2) |
|
Purging Temporary Cookies |
|
|
212 | (1) |
|
Managing Multiple Users' Cookies |
|
|
213 | (1) |
|
|
214 | (1) |
|
22 Scheduling Webbots and Spiders |
|
|
215 | (12) |
|
Preparing Your Webbots to Run as Scheduled Tasks |
|
|
216 | (1) |
|
The Windows XP Task Scheduler |
|
|
216 | (4) |
|
Scheduling a Webbot to Run Daily |
|
|
217 | (1) |
|
|
218 | (2) |
|
The Windows 7 Task Scheduler |
|
|
220 | (3) |
|
Non-calendar-based Triggers |
|
|
223 | (2) |
|
|
225 | (2) |
|
Determine the Webbot's Best Periodicity |
|
|
225 | (1) |
|
Avoid Single Points of Failure |
|
|
225 | (1) |
|
Add Variety to Your Schedule |
|
|
225 | (2) |
|
23 Scraping Difficult Websites with Browser Macros |
|
|
227 | (12) |
|
Barriers to Effective Web Scraping |
|
|
229 | (1) |
|
|
229 | (1) |
|
Bizarre JavaScript and Cookie Behavior |
|
|
229 | (1) |
|
|
229 | (1) |
|
Overcoming Webscraping Barriers with Browser Macros |
|
|
230 | (7) |
|
|
230 | (1) |
|
The Ultimate Browser-Like Webbot |
|
|
230 | (1) |
|
Installing and Using iMacros |
|
|
230 | (1) |
|
Creating Your First Macro |
|
|
231 | (6) |
|
|
237 | (2) |
|
Are Macros Really Necessary? |
|
|
237 | (1) |
|
|
237 | (2) |
|
|
239 | (10) |
|
Hacking iMacros for Added Functionality |
|
|
240 | (7) |
|
Reasons for Not Using the iMacros Scripting Engine |
|
|
240 | (1) |
|
|
241 | (4) |
|
Launching iMacros Automatically |
|
|
245 | (2) |
|
|
247 | (2) |
|
25 Deployment and Scaling |
|
|
249 | (14) |
|
|
250 | (1) |
|
|
251 | (1) |
|
|
251 | (1) |
|
|
252 | (1) |
|
Scaling and Denial-of-Service Attacks |
|
|
252 | (1) |
|
Even Simple Webbots Can Generate a Lot of Traffic |
|
|
252 | (1) |
|
Inefficiencies at the Target |
|
|
252 | (1) |
|
The Problems with Scaling Too Well |
|
|
253 | (1) |
|
Creating Multiple Instances of a Webbot |
|
|
253 | (2) |
|
|
253 | (1) |
|
Leveraging the Operating System |
|
|
254 | (1) |
|
Distributing the Task over Multiple Computers |
|
|
254 | (1) |
|
|
255 | (7) |
|
Botnet Communication Methods |
|
|
255 | (7) |
|
|
262 | (1) |
|
PART IV LARGER CONSIDERATIONS |
|
|
263 | (64) |
|
26 Designing Stealthy Webbots and Spiders |
|
|
265 | (8) |
|
Why Design a Stealthy Webbot? |
|
|
265 | (4) |
|
|
266 | (3) |
|
|
269 | (1) |
|
Stealth Means Simulating Human Patterns |
|
|
269 | (1) |
|
Be Kind to Your Resources |
|
|
269 | (1) |
|
Run Your Webbot During Busy Hours |
|
|
270 | (1) |
|
Don't Run Your Webbot at the Same Time Each Day |
|
|
270 | (1) |
|
Don't Run Your Webbot on Holidays and Weekends |
|
|
270 | (1) |
|
Use Random, Intra-fetch Delays |
|
|
270 | (1) |
|
|
270 | (3) |
|
|
273 | (12) |
|
|
273 | (1) |
|
Proxies in the Virtual World |
|
|
274 | (1) |
|
Why Webbot Developers Use Proxies |
|
|
274 | (3) |
|
Using Proxies to Become Anonymous |
|
|
274 | (3) |
|
Using a Proxy to Be Somewhere Else |
|
|
277 | (1) |
|
|
277 | (1) |
|
Using a Proxy in a Browser |
|
|
278 | (1) |
|
Using a Proxy with PHP/CURL |
|
|
278 | (1) |
|
|
278 | (5) |
|
|
279 | (2) |
|
|
281 | (1) |
|
|
282 | (1) |
|
|
283 | (2) |
|
Anonymity Is a Process, Not a Feature |
|
|
283 | (1) |
|
Creating Your Own Proxy Service |
|
|
283 | (2) |
|
28 Writing Fault-Tolerant Webbots |
|
|
285 | (12) |
|
Types of Webbol Fault Tolerance |
|
|
286 | (9) |
|
Adapting to Changes in URLs |
|
|
286 | (5) |
|
Adapting to Changes in Page Content |
|
|
291 | (1) |
|
Adapting to Changes in Forms |
|
|
292 | (2) |
|
Adapting to Changes in Cookie Management |
|
|
294 | (1) |
|
Adapting to Network Outages and Network Congestion |
|
|
294 | (1) |
|
|
295 | (1) |
|
|
296 | (1) |
|
29 Designing Webbot-Friendly Websites |
|
|
297 | (12) |
|
Optimizing Web Pages for Search Engine Spiders |
|
|
297 | (3) |
|
|
298 | (1) |
|
Google Bombs and Spam Indexing |
|
|
298 | (1) |
|
|
298 | (1) |
|
|
299 | (1) |
|
|
299 | (1) |
|
|
300 | (1) |
|
Web Design Techniques That Hinder Search Engine Spiders |
|
|
300 | (1) |
|
|
300 | (1) |
|
|
301 | (1) |
|
Designing Data-Only Interfaces |
|
|
301 | (6) |
|
|
301 | (1) |
|
Lightweight Data Exchange |
|
|
302 | (3) |
|
|
305 | (1) |
|
|
306 | (1) |
|
|
307 | (2) |
|
|
309 | (8) |
|
|
310 | (2) |
|
Create a Terms of Service Agreement |
|
|
310 | (1) |
|
|
311 | (1) |
|
|
312 | (1) |
|
|
312 | (3) |
|
Selectively Allow Access to Specific Web Agents |
|
|
312 | (1) |
|
|
313 | (1) |
|
Use Cookies, Encryption, JavaScript, and Redirection |
|
|
313 | (1) |
|
|
314 | (1) |
|
|
314 | (1) |
|
Embed Text in Other Media |
|
|
314 | (1) |
|
|
315 | (1) |
|
|
315 | (1) |
|
Fun Things to Do with Unwanted Spiders |
|
|
316 | (1) |
|
|
316 | (1) |
|
31 Keeping Webbots Out of Trouble |
|
|
317 | (10) |
|
|
318 | (1) |
|
|
319 | (3) |
|
|
319 | (1) |
|
Don't Be an Armchair Lawyer |
|
|
319 | (3) |
|
|
322 | (2) |
|
|
324 | (1) |
|
|
325 | (2) |
|
|
327 | (10) |
|
Creating a Minimal PHP/CURL Session |
|
|
327 | (1) |
|
Initiating PHP/CURL Sessions |
|
|
328 | (1) |
|
|
328 | (5) |
|
|
329 | (1) |
|
|
329 | (1) |
|
|
329 | (1) |
|
Curlopt_Followlocation And Curlopt_Maxredirs |
|
|
329 | (1) |
|
|
330 | (1) |
|
Curlopt_Nobody And Curlopt_Header |
|
|
330 | (1) |
|
|
331 | (1) |
|
Curlopt_Cookiefile And Curlopt_Cookiejar |
|
|
331 | (1) |
|
|
331 | (1) |
|
|
332 | (1) |
|
Curlopt_Userpwd And Curlopt_Unrestricted_Auth |
|
|
332 | (1) |
|
Curlopt_Post And Curlopt_Postfields |
|
|
332 | (1) |
|
|
333 | (1) |
|
|
333 | (1) |
|
Executing the PHP/CURL Command |
|
|
333 | (2) |
|
Retrieving PHP/CURL Session Information |
|
|
334 | (1) |
|
|
334 | (1) |
|
Closing PHP/CURL Sessions |
|
|
335 | (2) |
|
|
337 | (4) |
|
|
337 | (2) |
|
|
339 | (2) |
|
|
341 | (4) |
|
|
342 | (1) |
|
|
342 | (1) |
|
A Sampling of Text Message Email Addresses |
|
|
342 | (3) |
Index |
|
345 | |