Perl & LWP: Fetching Web Pages, Parsing HTML, Writing Spiders & More

Front Cover
"O'Reilly Media, Inc.", Jun 20, 2002 - Computers - 262 pages

Perl soared to popularity as a language for creating and managing web content, but with LWP (Library for WWW in Perl), Perl is equally adept at consuming information on the Web. LWP is a suite of modules for fetching and processing web pages.The Web is a vast data source that contains everything from stock prices to movie credits, and with LWP all that data is just a few lines of code away. Anything you do on the Web, whether it's buying or selling, reading or writing, uploading or downloading, news to e-commerce, can be controlled with Perl and LWP. You can automate Web-based purchase orders as easily as you can set up a program to download MP3 files from a web site.Perl & LWP covers:

  • Understanding LWP and its design
  • Fetching and analyzing URLs
  • Extracting information from HTML using regular expressions and tokens
  • Working with the structure of HTML documents using trees
  • Setting and inspecting HTTP headers and response codes
  • Managing cookies
  • Accessing information that requires authentication
  • Extracting links
  • Cooperating with proxy caches
  • Writing web spiders (also known as robots) in a safe fashion
Perl & LWP includes many step-by-step examples that show how to apply the various techniques. Programs to extract information from the web sites of BBC News, Altavista, ABEBooks.com, and the Weather Underground, to name just a few, are explained in detail, so that you understand how and why they work.Perl programmers who want to automate and mine the web can pick up this book and be immediately productive. Written by a contributor to LWP, and with a foreword by one of LWP's creators, Perl & LWP is the authoritative guide to this powerful and popular toolkit.
 

Contents

Introduction to Web Automation
1
Web Basics
15
The LWP Class Model
31
URLs
48
Forms
58
Simple HTML Processing with Regular Expressions
85
HTML Processing with Tokens
100
Tokenizing Walkthrough
119
Cookies Authentication and Advanced Requests
165
Spiders
178
LWP Modules
199
HTTP Status Codes
203
Common MIME Types
205
Language Tags
207
Common Content Encodings
209
ASCII Table
211

HTML Processing with Trees
132
Modifying HTML with Trees
148
Users View of ObjectOriented Modules
224
Index
235

Other editions - View all

Common terms and phrases

Popular passages

Page xii - Conventions Used in This Book The following typographic conventions are used in this book: • Code lines, commands, statements, variables, and any text you type or see onscreen appears in a mono typeface.
Page 6 - Files found in blib/arch: installing files in blib/lib into architecture dependent library tree Installing /Library/Perl/5.8.l/darwin-thread-multi-2level/Data/Dumper.pm Writing ///Library/Perl/5.8.1/ darwin-thread-multi-2level/auto/Data/Dumper/.packlist Appending installation info to ///System/Library/Perl/ 5.

Bibliographic information