Creating a Web Crawler in 3 Steps Issac Goldstand Mirimar Networks

Creating a Web Crawler in 3 Steps Issac Goldstand isaac@cpan.org Mirimar Networks http://www.mirimar.net/

The 3 steps Creating the User Agent Creating the content parser Tying it together

Step 1 – Creating the User Agent Lib-WWW Perl (LWP) OO interface to creating user agents for interacting with remote websites and web applications We will look at LWP::RobotUA

Creating the LWP Object User agent Cookie jar Timeout

Robot UA extras Robot rules Delay use_sleep

Implementation of Step 1 use LWP::RobotUA; # First, create the user agent - MyBot/1.0 my $ua=LWP::RobotUA->new('MyBot/1.0', \ 'isaac@cpan.org'); $ua->delay(15/60); # 15 seconds delay $ua->use_sleep(1); # Sleep if delayed

Step 2 – Creating the content parser HTML::Parser Event-driven parser mechanism OO and function oriented interfaces Hooks to functions at certain points

Subclassing HTML::Parser Biggest issue is non-persistence CGI authors may be used to this, but still makes for many caveats You must implement your own state preservation mechanism

Implementation of Step 2 package My::LinkParser; # Parser class use base qw(HTML::Parser); use constant START=>0; # Define simple constants use constant GOT_NAME=>1; sub state { # Simple access methods return $_[0]->{STATE}; } sub author { return $_[0]->{AUTHOR}; }

Implementation of Step 2 (cont) sub reset { # Clear parser state my $self=shift; undef $self->{AUTHOR}; $self->{STATE}=START; return 0; } sub start { # Parser hook my($self, $tagname, $attr, $attrseq, $origtext) = @_; if ($tagname eq "meta" && lc($attr->{name}) eq "author") { $self->{STATE}=GOT_NAME; $self->{AUTHOR}=$attr->{content}; }

Shortcut HTML::SimpleLinkExtor Simple package to extract links from HTML Handles many links – we only want HREF type links

Step 3 – Tying it together Simple application Instanciate objects Enter request loop Spit data to somewhere Add parsed links to queue

Implementation of Step 3 for (my $i=0;$i<10;$i++) { # Parse loop my $response=$ua->get(pop @urls); # Get HTTP response if ($response->is_success) { # If reponse is OK $p->reset; $p->parse($response->content); # Parse for author $p->eof; if ($p->state==1) { # If state is FOUND_AUTHOR $authors{$p->author}++; # then add author count } else { $authors{'Not Specified'}++; # otherwise add default count } $linkex->parse($response->content); # parse for links unshift @urls,$linkex->a; # and add links to queue }

End result #!/usr/bin/perl use strict; use LWP::RobotUA; use HTML::Parser; use HTML::SimpleLinkExtor; my @urls; # List of URLs to visit my %authors; my $ua=LWP::RobotUA->new('AuthorBot/1.0','isaac@cpan.org'); # First, create & setup the user agent $ua->delay(15/60); # 15 seconds delay $ua->use_sleep(1); # Sleep if delayed my $p=My::LinkParser->new; # Create parsers my $linkex=HTML::SimpleLinkExtor->new; $urls[0]="http://www.beamartyr.net/"; # Initialize list of URLs

End result for (my $i=0;$i<10;$i++) { # Parse loop my $response=$ua->get(pop @urls); # Get HTTP response if ($response->is_success) { # If reponse is OK $p->reset; $p->parse($response->content); # Parse for author $p->eof; if ($p->state==1) { # If state is FOUND_AUTHOR $authors{$p->author}++; # then add author count } else { $authors{'Not Specified'}++; # otherwise add default count } $linkex->parse($response->content); # parse for links unshift @urls,$linkex->a; # and add links to queue } print "Results:\n"; # Print results map {print "$_\t$authors{$_}\n"} keys %authors;

End result package My::LinkParser; # Parser class use base qw(HTML::Parser); use constant START=>0; # Define simple constants use constant GOT_NAME=>1; sub state { # Simple access methods return $_[0]->{STATE}; } sub author { return $_[0]->{AUTHOR}; } sub reset { # Clear parser state my $self=shift; undef $self->{AUTHOR}; $self->{STATE}=START; return 0; }

End result sub start { # Parser hook my($self, $tagname, $attr, $attrseq, $origtext) = @_; if ($tagname eq "meta" && lc($attr->{name}) eq "author") { $self->{STATE}=GOT_NAME; $self->{AUTHOR}=$attr->{content}; }

What’s missing? Full URLs for relative links Non-HTTP links Queues & caches Persistent storage Link (and data) validation

In review Create robot user agent to crawl websites nicely Create parsers to extract data from sites, and links to the next sites Create a simple program to parse a queue of URLs

Thank you! For more information: Issac Goldstand isaac@cpan.org http://www.beamartyr.net/ http://www.mirimar.net/

Creating a Web Crawler in 3 Steps Issac Goldstand Mirimar Networks

Similar presentations

Presentation on theme: "Creating a Web Crawler in 3 Steps Issac Goldstand Mirimar Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Creating a Web Crawler in 3 Steps Issac Goldstand Mirimar Networks

Similar presentations

Presentation on theme: "Creating a Web Crawler in 3 Steps Issac Goldstand Mirimar Networks"— Presentation transcript:

Similar presentations

About project

Feedback