Web Automation with LWP

 

LWP (Library for World Wide Web in Perl) is a set of Perl modules and object-oriented classes for getting data from the Web and for extracting information from HTML. Common tasks include fetching web pages, extracting information using regular expressions, and submitting forms.

A more sophisticated way to accomplish what LWP provides is to use web services, which emit XML rather than HTML. Services such as SOAP and XML-RPC can make the remote web service appear to be a set of functions called from within your program.

To test whether you already have LWP installed:

% perl -MLWP -le "print(LWP->VERSION)"

There are two ways to install modules: using the CPAN shell or the old-fashioned manual way.

 

Install LWP from the CPAN Shell

If you have never used the CPAN shell, you will need to configure it before you can use it. Invoke the CPAN shell...

% perl -MCPAN -eshell

If you've never run it before, you'll see this:

We have to reconfigure CPAN.pm due to following uninitialized parameters:

...followed by a number of questions, for most of which you can use the default setting.

To install LWP, run...

cpan> install Bundle::LWP

...which installs the libwww-perl distribution and the URI and HTML-Parser.

To install the HTML-Tree distribution:

cpan> install HTML::Tree

If CPAN does not work for you, you can install modules manually. Here are the modules that you will want to install...

Distribution CPAN directory

MIME-Base64

authors/id/G/GA/GAAS

libnet

authors/id/G/GB/GBAAR

HTML-Tagset

authors/id/S/SBURKE

HTML-Parser

authors/id/G/GA/GAAS

URI

authors/id/G/GA/GAAS/URI

Compress-Zlib

authors/id/P/PM/PMQS/Compress-Zlib

Digest-MD5

authors/id/G/GA/GAAS/Digest-MD5

libwww-perl

authors/id/G/GA/GAAS/libwww-perl

HTML-Tree

authors/id/S/SB/SBURKE/HTML-Tree

You can fetch these modules individually from sites listed at www.cpan.org/SITES.html.

To install gzipped tar distributions...

% tar xzf distribution.tar.gz
% cd distribution
% perl Makefile.PL
% make
% make test

 


LWP Examples

 

Count words on page

Here is an example of how to fetch the skywayradio.com home page and count the number of times Unix is mentioned.

#!/usr/bin/perl -w
use strict;
use LWP::Simple;

my $catalog = get("http://www.skywayradio.com/index.html");
my $count = 0;
$count++ while $catalog =~ m{unix}gi;

print "$count\n";

The LWP::Simple module's get( ) function returns the document at a given URL or undef if an error occurred. A regular expression match in a loop counts the number of occurrences.

 

Print header information

Here is an example that prints the header information:

#!/usr/bin/perl -w
use strict;
use LWP;
  
my $browser = LWP::UserAgent->new( );
my $response = $browser->get("http://www.skywayradio.com/");
print $response->header("Server"), "\n";

 

POST submission to a batch job servlet

The LWP::UserAgent object $browser makes requests of a server and creates HTTP::Response objects such as $response to represent the server's reply. Here is an example, postbatch.pl, that uses LWP::UserAgent to post a submission to a batch job servlet.

#!/usr/bin/perl -w
###
### postbatch.pl -  Send post submission to batch job servlet
###

use strict;
use LWP;

my $jobCode = $ARGV[0] || die "jobCode to run\n";
$jobCode = uc $jobCode;

### No all-digit jobCode
 die "$jobCode is invalid.\n"
  unless $jobCode =~ m/^[A-Z0-9]{2,7}$/
     and $jobCode !~ m/^\d+$/;

my $browser = LWP::UserAgent->new;


### customRealm is WebSphere 5.x
$browser->credentials(
  'wstest2.domain.net:9080', 
  'customRealm', 
  'user' => 'password'
);

my $response = $browser->post(
  'http://wstest2.domain.net:9080/buyer-jobexec/jobExec',
  [
    'jobCode'  => $jobCode,
  ],
);

print $response->content;

die "Error: ", $response->status_line
 unless $response->is_success;

if($response->content =~ m/JobExecId/)
{
  print "$jobCode submitted.\n";
}
elsif($response->content =~ m/Could not load a Job Setup record/)
{
  print "Could not load a Job Setup record for $jobCode\n";
}
else
{
  print "Unable to submit $jobCode\n";
}
exit;

Usage:

./postbatch.pl IMP
JobExecId: "167"
IMP submitted.
% postbatch.pl XYZ 
Could not load a Job Setup record for IMP

 

Extract image locations

Here is some code that will extract image locations...

#!/usr/bin/perl -w
  
use strict;
use LWP::Simple;
use HTML::TokeParser;
  
my $html   = get("http://www.skywayradio.com/");
my $stream = HTML::TokeParser->new(\$html);
my %image  = ( );
  


### Return an array reference 
###
### If the first array element is S, it is a
### token representing the start of a tag. 
###
### The second array element is the type of tag 
### The third array element is a hash mapping 
### attribute to value. 
### The %image hash holds the images we find.  
###

while (my $token = $stream->get_token) 
{
    if ($token->[0] eq 'S' && $token->[1] eq 'img') 
   
     {
        # store src value in %image
        $image{ $token->[2]{'src'} }++;
    }
}
  
foreach my $pic (sort keys %image) {
    print "$pic\n";
}

To extract image locations with a tree

#!/usr/bin/perl -w
  
use strict;
use LWP::Simple;
use HTML::TreeBuilder;
  
my $html = get("http://www.skywayradio.com/");
my $root = HTML::TreeBuilder->new_from_content($html);
my %images;
foreach my $node ($root->find_by_tag_name('img')) 
{
    $images{ $node->attr('src') }++;
}
  
foreach my $pic (sort keys %images) 
{
    print "$pic\n";
}

 

See Also

  1. http://search.cpan.org
  2. http://kobesearch.cpan.org
  3. http://www.cpan.org/misc/cpan-faq.html

 

Back | Home | Next