Regular Expressions CISC/QCSE 810
Recognizing Matching Strings ls *.exe translates to "any set of characters, followed by the exact string ".exe" The "*.exe" is a regular expression ls gets a list of all files, and then only returns those that match the expression "*.exe"
In Perl In Perl, can see if strings match using the =~ operator $s = "Cat In the Hat"; if ($s =~ /Cat/) { print "Matches Cat"; } if ($s =~ /Chat/) { print "Matches Chat"; }
Common references \wCharacters in words\WNon-word character \sSpace, tab\SNon-whitespace character \dMatch a digit\DNon-digit match \nNewline\tTab.Any character ^Start of string$End of string Modifiers *0 or more occurences{n}Exactly n matches {n,}n or more matches{n,m}Match n to m matches Character Groups [a-z][xyz] [0-9A-Z][\w_] [^a-z]NOT a-z
Exercise 1 Write a regexp that matches only on Canadian postal codes
Exercise 2 Write a regexp that matches typical intermediate files (.o,.dvi,.tmp) helpful if you want a systematic way to delete them
String Substitution Found an input file (*.dat), looking for a matching output file = foreach $input_file { # Copy to output name $output_file = $input_file; # replace.dat with.out $output_file =~ s/.dat/.out/; if (! -f $output_file) { print "Need to create output for $output_file\n"; }
Translating $s = "Alternate Ending"; $s =~ tr/[a-z]/[A-Z]; Can also use 'uc' and 'lc' (more generic for non-English languages)
Grabbing Substrings Get root URL $url = " $url =~ /(www[\w.]*)/; $short_url = $1; print "Full URL: $url\n"; print "Site URL: $short_url\n";
End options s/a/A/g – global; swap all matches changes "aaaba" to "AAAbA" Compare with s/a/A/ changes "aaaba" to "Aaaba" /tmp/i - case insensitive recognizes "tmp", "Tmp", "tMP", "TMP"…
Exercise Write a regexp line that returns all the integers in the text Can it be extended to handle floating point values?
Functions with Regex split split /\s+/, $line; split /,/, $line; split /\t/, $line split //, $line; = qw( aaa bba = grep
Longer example – Log files Parsing log files [25/Mar/2003:02:22: ] "GET /gcs/new.gif HTTP/1.1" [25/Mar/2003:02:22: ] "GET /gcs/update.gif HTTP/1.1" proxy.skynet.be - - [25/Mar/2003:02:40: ] "GET /gcs/gc1hint.html HTTP/1.1" j3194.inktomisearch.com - - [25/Mar/2003:03:13: ] "GET /~gcs/K-12.html HTTP/1.0" kittyhawk.hhmi.org - - [25/Mar/2003:03:17: ] "HEAD /gcs/ HTTP/1.0" j3104.inktomisearch.com - - [25/Mar/2003:03:54: ] "GET /gcs/pa.html HTTP/1.0" crawl11-public.alexa.com - - [25/Mar/2003:04:51: ] "GET /gcs/clinical.html HTTP/1.0" … livebot search.live.com - - [24/Jul/2007:22:16: ] "GET /gcs/webstats/usage_ html HTTP/1.0" [24/Jul/2007:22:22: ] "GET /gcs/status/statuscheck.html HTTP/1.1" livebot search.live.com - - [24/Jul/2007:22:47: ] "GET /gcs/webstats/usage_ html HTTP/1.0" …
Alternate uses If you write your own program, with many print statements, can 1. make print statements meaningful "Time spent on loading: 23.5s" 2. can parse afterwards to process/store values $line = m/: ([\d.])+s/; $time = $1;
Resources Any web search for "perl regular expression tutorial" Perl reg exp by example Reference card Perl site reference