Download presentation
Presentation is loading. Please wait.
Published byDavid Waters Modified over 9 years ago
1
Creating a Web Crawler in 3 Steps Issac Goldstand isaac@cpan.org Mirimar Networks http://www.mirimar.net/
2
The 3 steps Creating the User Agent Creating the content parser Tying it together
3
Step 1 – Creating the User Agent Lib-WWW Perl (LWP) OO interface to creating user agents for interacting with remote websites and web applications We will look at LWP::RobotUA
4
Creating the LWP Object User agent Cookie jar Timeout
5
Robot UA extras Robot rules Delay use_sleep
6
Implementation of Step 1 use LWP::RobotUA; # First, create the user agent - MyBot/1.0 my $ua=LWP::RobotUA->new('MyBot/1.0', \ 'isaac@cpan.org'); $ua->delay(15/60); # 15 seconds delay $ua->use_sleep(1); # Sleep if delayed
7
Step 2 – Creating the content parser HTML::Parser Event-driven parser mechanism OO and function oriented interfaces Hooks to functions at certain points
8
Subclassing HTML::Parser Biggest issue is non-persistence CGI authors may be used to this, but still makes for many caveats You must implement your own state preservation mechanism
9
Implementation of Step 2 package My::LinkParser; # Parser class use base qw(HTML::Parser); use constant START=>0; # Define simple constants use constant GOT_NAME=>1; sub state { # Simple access methods return $_[0]->{STATE}; } sub author { return $_[0]->{AUTHOR}; }
10
Implementation of Step 2 (cont) sub reset { # Clear parser state my $self=shift; undef $self->{AUTHOR}; $self->{STATE}=START; return 0; } sub start { # Parser hook my($self, $tagname, $attr, $attrseq, $origtext) = @_; if ($tagname eq "meta" && lc($attr->{name}) eq "author") { $self->{STATE}=GOT_NAME; $self->{AUTHOR}=$attr->{content}; }
11
Shortcut HTML::SimpleLinkExtor Simple package to extract links from HTML Handles many links – we only want HREF type links
12
Step 3 – Tying it together Simple application Instanciate objects Enter request loop Spit data to somewhere Add parsed links to queue
13
Implementation of Step 3 for (my $i=0;$i<10;$i++) { # Parse loop my $response=$ua->get(pop @urls); # Get HTTP response if ($response->is_success) { # If reponse is OK $p->reset; $p->parse($response->content); # Parse for author $p->eof; if ($p->state==1) { # If state is FOUND_AUTHOR $authors{$p->author}++; # then add author count } else { $authors{'Not Specified'}++; # otherwise add default count } $linkex->parse($response->content); # parse for links unshift @urls,$linkex->a; # and add links to queue }
14
End result #!/usr/bin/perl use strict; use LWP::RobotUA; use HTML::Parser; use HTML::SimpleLinkExtor; my @urls; # List of URLs to visit my %authors; my $ua=LWP::RobotUA->new('AuthorBot/1.0','isaac@cpan.org'); # First, create & setup the user agent $ua->delay(15/60); # 15 seconds delay $ua->use_sleep(1); # Sleep if delayed my $p=My::LinkParser->new; # Create parsers my $linkex=HTML::SimpleLinkExtor->new; $urls[0]="http://www.beamartyr.net/"; # Initialize list of URLs
15
End result for (my $i=0;$i<10;$i++) { # Parse loop my $response=$ua->get(pop @urls); # Get HTTP response if ($response->is_success) { # If reponse is OK $p->reset; $p->parse($response->content); # Parse for author $p->eof; if ($p->state==1) { # If state is FOUND_AUTHOR $authors{$p->author}++; # then add author count } else { $authors{'Not Specified'}++; # otherwise add default count } $linkex->parse($response->content); # parse for links unshift @urls,$linkex->a; # and add links to queue } print "Results:\n"; # Print results map {print "$_\t$authors{$_}\n"} keys %authors;
16
End result package My::LinkParser; # Parser class use base qw(HTML::Parser); use constant START=>0; # Define simple constants use constant GOT_NAME=>1; sub state { # Simple access methods return $_[0]->{STATE}; } sub author { return $_[0]->{AUTHOR}; } sub reset { # Clear parser state my $self=shift; undef $self->{AUTHOR}; $self->{STATE}=START; return 0; }
17
End result sub start { # Parser hook my($self, $tagname, $attr, $attrseq, $origtext) = @_; if ($tagname eq "meta" && lc($attr->{name}) eq "author") { $self->{STATE}=GOT_NAME; $self->{AUTHOR}=$attr->{content}; }
18
What’s missing? Full URLs for relative links Non-HTTP links Queues & caches Persistent storage Link (and data) validation
19
In review Create robot user agent to crawl websites nicely Create parsers to extract data from sites, and links to the next sites Create a simple program to parse a queue of URLs
20
Thank you! For more information: Issac Goldstand isaac@cpan.org http://www.beamartyr.net/ http://www.mirimar.net/
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.