An Introduction to Perl with Applications in Web Page Scraping
What is Perl? Practical Extraction and Report Language High Level General purpose Interpreted, dynamic programming language Borrows from Unix shell scripting languages Ideal for “small” tasks which involve text processing
What is going to be taught during this workshop? Most of this presentation takes from the introductionwww.perl.com Perl language constructs Variables Flow control String processing File I/O Subroutines Object oriented Perl Application: Web page scraping
Hello World > perl -e 'print "hello world\n"' hello world > perl -e 'print "hello ", "world\n"' hello world > perl -e "print 'hello ', 'world\n'" hello world\n>
Scalars Single things Number String $fruitCount=5; $fruitType='apples'; $countReport = "> There are $fruitCount $fruitType"; print $count_report; > There are 5 apples
Scalars continued $a = "8"; $b = $a + "1"; print “> $b\n”; > 9 $c = $a. "1"; print “> $c\n” > 81
*Shameless taken from l1.html. Even more scalar examples* $a = 5; $a++; # $a is now 6; we added 1 to it. $a += 10; # Now it's 16; we added 10. $a /= 2; # And divided it by 2, so it's 8.
*Shameless taken from l1.html. Arrays Lists of = ("July", "August", "September"); print $months[0]; #This prints "July". $months[2] = "Smarch"; If an array doesn't exist you'll create it when you try to assign a value to one of its elements. $winterMonths[0] = "December"; #This implicitly
*Shameless taken from l1.html. Arrays continued If you want to find the last index of an array, use: print “> $#months\n”; > 2 If the array is empty or doesn't exist, -1 is returned You can also resize a list $#months=0 #Now months only contains “July”
*Shameless taken from l1.html. Hashes Map a key to a value %daysInMonth = ( "July" => 31, "August" => 31, "September" => 30 ); print “> $daysInMonth{'September'}\n”; > 30 To add a new key and value, $daysInMonth{"February"} = 28;
*Shameless taken from l1.html. Hashed continued Getting the key values print “>”. keys(%daysInMonth). “\n”; > 3
For loops print “> “; for ($i=0; $i <= 5; $i++) { print “I can count to $i\n”; } print “\n”; >
*Shameless taken from l1.html. For loops Iterating over a list print “> “; for $i (5, 4, 3, 2, 1) { print "$i "; } print “\n”; >
*Shameless taken from l1.html. For loops = (1.. 10); $top_limit = 25; for $i 15, 20.. $top_limit) { print "$i\n"; }
*Shameless taken from l1.html. One more for loop for $marx ('Groucho', 'Harpo', 'Zeppo', 'Karl') { print "> $marx is my favorite Marx brother.\n"; } > Groucho is my favorite Marx brother. > Harpo is my favorite Marx brother. > Zeppo is my favorite Marx brother. > Karl is my favorite Marx brother.
*Shameless taken from l1.html. While loop my $count = 0; print “> “; while ($count != 3) { $count++; print "$count "; } print “\n”; > 1 2 3
*Shameless taken from l1.html. Until loop $count=3; print “> “; until ($count == 0) { $count--; print "$count "; } print “\n”; > 2 1 0
*Shameless taken from l1.html. if/elsif/else if ($a == 5) { print "It's five!\n"; } elsif ($a == 6) { print "It's six!\n"; } else { print "It's something else.\n"; }
*Shameless taken from l1.html. Unless unless ($pie eq 'apple') { print "Ew, I don't like $pie flavored pie.\n"; } else { print "Apple! My favorite!\n"; }
Comparing unless and if print "I'm burning the 7 pm oil\n" unless $day eq 'Friday'; print “I'm burning the 7pm oil\n” if not ($day eq 'Friday');
String operations $yes_no = 'no'; print “> affirmative\n” if $yes_no == 'yes'; > affirmative Strings are automatically converted to numbers for operations like '==' Use eq instead of == for this to work correctly
More string comparisons my $five = 5; print "> Numeric equality!\n" if $five == " 5 "; print "> String equality!\n" if $five eq "5"; > Numeric equality > String equality print "> No string equality!\n" if not($five eq " 5"); > No string equality
substr $greeting = "Welcome to Perl!\n"; print “> “.substr($greeting, 0, 7).”\n”; > Welcome print “> “, substr($greeting, 7) ”\n”; > to Perl! print “> “, substr($greeting, -6, 6), “>”; > Perl! >
substr continued my $greeting = "Welcome to Java!\n"; substr($greeting, 11, 4) = 'Perl'; # $greeting is now "Welcome to Perl!\n"; substr($greeting, 7, 3) = ''; #... "Welcome Perl!\n"; substr($greeting, 0, 0) = 'Hello. '; #... "Hello. Welcome Perl!\n";
split my $greeting = "Hello. Welcome Perl!\n"; = split(/ /, $greeting); # Three items: "Hello.", "Welcome", "Perl!\n" my $greeting = "Hello. Welcome Perl!\n"; = split(/ /, $greeting, 2); # Two items: "Hello.", "Welcome Perl!\n";
join = ("Hello.", "Welcome", "Perl!\n"); my $greeting = join(' # "Hello. Welcome Perl!\n"; my $andy_greeting = join(' and # "Hello. and Welcome and Perl!\n"; my $jam_greeting = # "Hello.WelcomePerl!\n";
Reading from a file This is a test test.txt
Reading from a file continued open my $testfile, 'test.txt' or die "I couldn't get at log.txt: $!"; while ($line= ){ print “> “, $line; } > This > is > a > test
chomp open my $testfile, 'test.txt' or die "I couldn't get at log.txt: $!"; print “> “; while (chomp($line= )){ print “$line “; } print “\n”; > This is a test
Writing to a file open my $overwrite, '>', 'overwrite.txt' or die "error trying to overwrite: $!"; # Wave goodbye to the original contents. open my $append, '>>', 'append.txt' or die "error trying to append: $!"; # Original contents still there; add to the end of the file
Subroutines sub multiply{ my my $ret = 1; for $val { $ret *= $val; } return $ret; } print "> ",multiply(2.. 5), "\n"; > 120
Programming with objects An objects is a programmer defined data structure which encapsulates Data Behavior (methods) A web browser object may have Data The current page A history of recently visited URL Behavior Can navigate to a page Can display a page
An Application: Scraping Web Pages
References Beginners introduction to Perl Perl Mechanize Library Documentation Schwartz, R.L and Phoeniz, T., Lerning Perl, 3 rd Edition, November 1993.