Web Automation with LWP
LWP (Library for World Wide Web in Perl) is a set of Perl modules and object-oriented classes for getting data from the Web and for extracting information from HTML. Common tasks include fetching web pages, extracting information using regular expressions, and submitting forms.
A more sophisticated way to accomplish what LWP provides is to use web services, which emit XML rather than HTML. Services such as SOAP and XML-RPC can make the remote web service appear to be a set of functions called from within your program.
To test whether you already have LWP installed:
% perl -MLWP -le "print(LWP->VERSION)"There are two ways to install modules: using the CPAN shell or the old-fashioned manual way.
Install LWP from the CPAN Shell
If you have never used the CPAN shell, you will need to configure it before you can use it. Invoke the CPAN shell...
% perl -MCPAN -eshellIf you've never run it before, you'll see this:
We have to reconfigure CPAN.pm due to following uninitialized parameters:...followed by a number of questions, for most of which you can use the default setting.
To install LWP, run...
cpan> install Bundle::LWP...which installs the libwww-perl distribution and the URI and HTML-Parser.
To install the HTML-Tree distribution:
cpan> install HTML::TreeIf CPAN does not work for you, you can install modules manually. Here are the modules that you will want to install...
Distribution CPAN directory MIME-Base64
authors/id/G/GA/GAAS
libnet
authors/id/G/GB/GBAAR
HTML-Tagset
authors/id/S/SBURKE
HTML-Parser
authors/id/G/GA/GAAS
URI
authors/id/G/GA/GAAS/URI
Compress-Zlib
authors/id/P/PM/PMQS/Compress-Zlib
Digest-MD5
authors/id/G/GA/GAAS/Digest-MD5
libwww-perl
authors/id/G/GA/GAAS/libwww-perl
HTML-Tree
authors/id/S/SB/SBURKE/HTML-Tree
You can fetch these modules individually from sites listed at www.cpan.org/SITES.html.
To install gzipped tar distributions...
% tar xzf distribution.tar.gz % cd distribution % perl Makefile.PL % make % make test
LWP Examples
Count words on page
Here is an example of how to fetch the skywayradio.com home page and count the number of times Unix is mentioned.
#!/usr/bin/perl -w use strict; use LWP::Simple; my $catalog = get("http://www.skywayradio.com/index.html"); my $count = 0; $count++ while $catalog =~ m{unix}gi; print "$count\n";The LWP::Simple module's get( ) function returns the document at a given URL or undef if an error occurred. A regular expression match in a loop counts the number of occurrences.
Print header information
Here is an example that prints the header information:
#!/usr/bin/perl -w use strict; use LWP; my $browser = LWP::UserAgent->new( ); my $response = $browser->get("http://www.skywayradio.com/"); print $response->header("Server"), "\n";
POST submission to a batch job servlet
The LWP::UserAgent object $browser makes requests of a server and creates HTTP::Response objects such as $response to represent the server's reply. Here is an example, postbatch.pl, that uses LWP::UserAgent to post a submission to a batch job servlet.
#!/usr/bin/perl -w ### ### postbatch.pl - Send post submission to batch job servlet ### use strict; use LWP; my $jobCode = $ARGV[0] || die "jobCode to run\n"; $jobCode = uc $jobCode; ### No all-digit jobCode die "$jobCode is invalid.\n" unless $jobCode =~ m/^[A-Z0-9]{2,7}$/ and $jobCode !~ m/^\d+$/; my $browser = LWP::UserAgent->new; ### customRealm is WebSphere 5.x $browser->credentials( 'wstest2.domain.net:9080', 'customRealm', 'user' => 'password' ); my $response = $browser->post( 'http://wstest2.domain.net:9080/buyer-jobexec/jobExec', [ 'jobCode' => $jobCode, ], ); print $response->content; die "Error: ", $response->status_line unless $response->is_success; if($response->content =~ m/JobExecId/) { print "$jobCode submitted.\n"; } elsif($response->content =~ m/Could not load a Job Setup record/) { print "Could not load a Job Setup record for $jobCode\n"; } else { print "Unable to submit $jobCode\n"; } exit;Usage:
./postbatch.pl IMP JobExecId: "167" IMP submitted. % postbatch.pl XYZ Could not load a Job Setup record for IMP
Extract image locations
Here is some code that will extract image locations...
#!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::TokeParser; my $html = get("http://www.skywayradio.com/"); my $stream = HTML::TokeParser->new(\$html); my %image = ( ); ### Return an array reference ### ### If the first array element is S, it is a ### token representing the start of a tag. ### ### The second array element is the type of tag ### The third array element is a hash mapping ### attribute to value. ### The %image hash holds the images we find. ### while (my $token = $stream->get_token) { if ($token->[0] eq 'S' && $token->[1] eq 'img') { # store src value in %image $image{ $token->[2]{'src'} }++; } } foreach my $pic (sort keys %image) { print "$pic\n"; }To extract image locations with a tree
#!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::TreeBuilder; my $html = get("http://www.skywayradio.com/"); my $root = HTML::TreeBuilder->new_from_content($html); my %images; foreach my $node ($root->find_by_tag_name('img')) { $images{ $node->attr('src') }++; } foreach my $pic (sort keys %images) { print "$pic\n"; }
See Also
Back | Home | Next