Creating a Web Crawler in 3 Steps Issac Goldstand Mirimar Networks
The 3 steps Creating the User Agent Creating the content parser Tying it together
Step 1 – Creating the User Agent Lib-WWW Perl (LWP) OO interface to creating user agents for interacting with remote websites and web applications We will look at LWP::RobotUA
Creating the LWP Object User agent Cookie jar Timeout
Robot UA extras Robot rules Delay use_sleep
Implementation of Step 1 use LWP::RobotUA; # First, create the user agent - MyBot/1.0 my $ua=LWP::RobotUA->new('MyBot/1.0', \ $ua->delay(15/60); # 15 seconds delay $ua->use_sleep(1); # Sleep if delayed
Step 2 – Creating the content parser HTML::Parser Event-driven parser mechanism OO and function oriented interfaces Hooks to functions at certain points
Subclassing HTML::Parser Biggest issue is non-persistence CGI authors may be used to this, but still makes for many caveats You must implement your own state preservation mechanism
Implementation of Step 2 package My::LinkParser; # Parser class use base qw(HTML::Parser); use constant START=>0; # Define simple constants use constant GOT_NAME=>1; sub state { # Simple access methods return $_[0]->{STATE}; } sub author { return $_[0]->{AUTHOR}; }
Implementation of Step 2 (cont) sub reset { # Clear parser state my $self=shift; undef $self->{AUTHOR}; $self->{STATE}=START; return 0; } sub start { # Parser hook my($self, $tagname, $attr, $attrseq, $origtext) if ($tagname eq "meta" && lc($attr->{name}) eq "author") { $self->{STATE}=GOT_NAME; $self->{AUTHOR}=$attr->{content}; }
Shortcut HTML::SimpleLinkExtor Simple package to extract links from HTML Handles many links – we only want HREF type links
Step 3 – Tying it together Simple application Instanciate objects Enter request loop Spit data to somewhere Add parsed links to queue
Implementation of Step 3 for (my $i=0;$i<10;$i++) { # Parse loop my # Get HTTP response if ($response->is_success) { # If reponse is OK $p->reset; $p->parse($response->content); # Parse for author $p->eof; if ($p->state==1) { # If state is FOUND_AUTHOR $authors{$p->author}++; # then add author count } else { $authors{'Not Specified'}++; # otherwise add default count } $linkex->parse($response->content); # parse for links # and add links to queue }
End result #!/usr/bin/perl use strict; use LWP::RobotUA; use HTML::Parser; use HTML::SimpleLinkExtor; # List of URLs to visit my %authors; my # First, create & setup the user agent $ua->delay(15/60); # 15 seconds delay $ua->use_sleep(1); # Sleep if delayed my $p=My::LinkParser->new; # Create parsers my $linkex=HTML::SimpleLinkExtor->new; $urls[0]=" # Initialize list of URLs
End result for (my $i=0;$i<10;$i++) { # Parse loop my # Get HTTP response if ($response->is_success) { # If reponse is OK $p->reset; $p->parse($response->content); # Parse for author $p->eof; if ($p->state==1) { # If state is FOUND_AUTHOR $authors{$p->author}++; # then add author count } else { $authors{'Not Specified'}++; # otherwise add default count } $linkex->parse($response->content); # parse for links # and add links to queue } print "Results:\n"; # Print results map {print "$_\t$authors{$_}\n"} keys %authors;
End result package My::LinkParser; # Parser class use base qw(HTML::Parser); use constant START=>0; # Define simple constants use constant GOT_NAME=>1; sub state { # Simple access methods return $_[0]->{STATE}; } sub author { return $_[0]->{AUTHOR}; } sub reset { # Clear parser state my $self=shift; undef $self->{AUTHOR}; $self->{STATE}=START; return 0; }
End result sub start { # Parser hook my($self, $tagname, $attr, $attrseq, $origtext) if ($tagname eq "meta" && lc($attr->{name}) eq "author") { $self->{STATE}=GOT_NAME; $self->{AUTHOR}=$attr->{content}; }
What’s missing? Full URLs for relative links Non-HTTP links Queues & caches Persistent storage Link (and data) validation
In review Create robot user agent to crawl websites nicely Create parsers to extract data from sites, and links to the next sites Create a simple program to parse a queue of URLs
Thank you! For more information: Issac Goldstand