Presentation is loading. Please wait.

Presentation is loading. Please wait.

Creating a Web Crawler in 3 Steps Issac Goldstand Mirimar Networks

Similar presentations


Presentation on theme: "Creating a Web Crawler in 3 Steps Issac Goldstand Mirimar Networks"— Presentation transcript:

1 Creating a Web Crawler in 3 Steps Issac Goldstand isaac@cpan.org Mirimar Networks http://www.mirimar.net/

2 The 3 steps Creating the User Agent Creating the content parser Tying it together

3 Step 1 – Creating the User Agent Lib-WWW Perl (LWP) OO interface to creating user agents for interacting with remote websites and web applications We will look at LWP::RobotUA

4 Creating the LWP Object User agent Cookie jar Timeout

5 Robot UA extras Robot rules Delay use_sleep

6 Implementation of Step 1 use LWP::RobotUA; # First, create the user agent - MyBot/1.0 my $ua=LWP::RobotUA->new('MyBot/1.0', \ 'isaac@cpan.org'); $ua->delay(15/60); # 15 seconds delay $ua->use_sleep(1); # Sleep if delayed

7 Step 2 – Creating the content parser HTML::Parser Event-driven parser mechanism OO and function oriented interfaces Hooks to functions at certain points

8 Subclassing HTML::Parser Biggest issue is non-persistence CGI authors may be used to this, but still makes for many caveats You must implement your own state preservation mechanism

9 Implementation of Step 2 package My::LinkParser; # Parser class use base qw(HTML::Parser); use constant START=>0; # Define simple constants use constant GOT_NAME=>1; sub state { # Simple access methods return $_[0]->{STATE}; } sub author { return $_[0]->{AUTHOR}; }

10 Implementation of Step 2 (cont) sub reset { # Clear parser state my $self=shift; undef $self->{AUTHOR}; $self->{STATE}=START; return 0; } sub start { # Parser hook my($self, $tagname, $attr, $attrseq, $origtext) = @_; if ($tagname eq "meta" && lc($attr->{name}) eq "author") { $self->{STATE}=GOT_NAME; $self->{AUTHOR}=$attr->{content}; }

11 Shortcut HTML::SimpleLinkExtor Simple package to extract links from HTML Handles many links – we only want HREF type links

12 Step 3 – Tying it together Simple application Instanciate objects Enter request loop Spit data to somewhere Add parsed links to queue

13 Implementation of Step 3 for (my $i=0;$i<10;$i++) { # Parse loop my $response=$ua->get(pop @urls); # Get HTTP response if ($response->is_success) { # If reponse is OK $p->reset; $p->parse($response->content); # Parse for author $p->eof; if ($p->state==1) { # If state is FOUND_AUTHOR $authors{$p->author}++; # then add author count } else { $authors{'Not Specified'}++; # otherwise add default count } $linkex->parse($response->content); # parse for links unshift @urls,$linkex->a; # and add links to queue }

14 End result #!/usr/bin/perl use strict; use LWP::RobotUA; use HTML::Parser; use HTML::SimpleLinkExtor; my @urls; # List of URLs to visit my %authors; my $ua=LWP::RobotUA->new('AuthorBot/1.0','isaac@cpan.org'); # First, create & setup the user agent $ua->delay(15/60); # 15 seconds delay $ua->use_sleep(1); # Sleep if delayed my $p=My::LinkParser->new; # Create parsers my $linkex=HTML::SimpleLinkExtor->new; $urls[0]="http://www.beamartyr.net/"; # Initialize list of URLs

15 End result for (my $i=0;$i<10;$i++) { # Parse loop my $response=$ua->get(pop @urls); # Get HTTP response if ($response->is_success) { # If reponse is OK $p->reset; $p->parse($response->content); # Parse for author $p->eof; if ($p->state==1) { # If state is FOUND_AUTHOR $authors{$p->author}++; # then add author count } else { $authors{'Not Specified'}++; # otherwise add default count } $linkex->parse($response->content); # parse for links unshift @urls,$linkex->a; # and add links to queue } print "Results:\n"; # Print results map {print "$_\t$authors{$_}\n"} keys %authors;

16 End result package My::LinkParser; # Parser class use base qw(HTML::Parser); use constant START=>0; # Define simple constants use constant GOT_NAME=>1; sub state { # Simple access methods return $_[0]->{STATE}; } sub author { return $_[0]->{AUTHOR}; } sub reset { # Clear parser state my $self=shift; undef $self->{AUTHOR}; $self->{STATE}=START; return 0; }

17 End result sub start { # Parser hook my($self, $tagname, $attr, $attrseq, $origtext) = @_; if ($tagname eq "meta" && lc($attr->{name}) eq "author") { $self->{STATE}=GOT_NAME; $self->{AUTHOR}=$attr->{content}; }

18 What’s missing? Full URLs for relative links Non-HTTP links Queues & caches Persistent storage Link (and data) validation

19 In review Create robot user agent to crawl websites nicely Create parsers to extract data from sites, and links to the next sites Create a simple program to parse a queue of URLs

20 Thank you! For more information: Issac Goldstand isaac@cpan.org http://www.beamartyr.net/ http://www.mirimar.net/


Download ppt "Creating a Web Crawler in 3 Steps Issac Goldstand Mirimar Networks"

Similar presentations


Ads by Google