CSE S. Tanimoto Regular Expressions 1 Regular Expressions: Theory and Perl Implementation Outline: 1. Theoretical Definitions and Examples 2. Acceptance by Finite Automata 3. Perl’s Syntax 4. Other pattern matching functionality in Perl 5. Program Example
CSE S. Tanimoto Regular Expressions 2 Alphabets and Sets of Strings An alphabet = {a 1, a 2,..., a n } is a set of characters. A string over is a sequence of zero or more elements of . Example. If = {0, 1, 2} then 2201 is a string over . No matter what is, the empty string is a string over . A set of strings over is a set of zero or more strings, each of which is a string over . Example. If = {0, 1, 2} then { , 111, 121, 0} is a set of strings over .
CSE S. Tanimoto Regular Expressions 3 A Recursive Definition for Regular Expressions A regular expression for an alphabet is a certain kind of pattern that describes a set of strings over . Any character c in is a regular expression representing {c} If E, E 1 and E 2 are regular expressions over then so are E 1 E 2 -- representing the set concatenation of E 1 and E 2. E 1 | E 2 -- representing alternation of E 1 and E 2. ( E ) -- representing E grouped with parentheses. E + -- rep. one or more instances of E concatenated. E* -- zero or more instances of E
CSE S. Tanimoto Regular Expressions 4 Regular Expression Examples Let = {a, b}. a = {a} ab = {ab} a | b = {a, b} a + = {a, aa, aaa,... } ab* represents the set of strings having a single a followed by zero or more occurrences of b. That is, it’s {a, ab, abb, abbb,... } a (b | c) = {ab, ac} (a | b) (c | d) = {ac, ad, bc, bd} aa* = a + = {a, aa, aaa,... }
CSE S. Tanimoto Regular Expressions 5 Extended Regular Expressions Let letters = a | b | c | d Let digits = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Let identifiers = letters ( letters | digits )* Thus we can use a name to represent a set of strings and use that name in a regular expression.
CSE S. Tanimoto Regular Expressions 6 Finite Automaton a b a corresponding regular expression: ab*a start stateaccepting state Example: process the string abba Now try abbb Finite number of states, but number of strings is not necessarily finite.
CSE S. Tanimoto Regular Expressions 7 Equivalence of Finite Automata and Regular Expressions b a ab a | b a* a a a b a
CSE S. Tanimoto Regular Expressions 8 Regular Expressions in Perl In Perl, regular expressions are used to specify patterns for pattern matching. $sentence = "Winter weather has arrived." if ($sentence =~ /weather/) { print "Never bet on the weather." ; } # $string =~ /Pattern/ The result of this kind of pattern matching is a true or false value.
CSE S. Tanimoto Regular Expressions 9 A Perl Regular Expression for Identifier $identifier = "[a-z][a-z0-9]*"; $sentence = "012,cse ,ABC]*"; if ($sentence =~ /$identifier/) { print "Seems to be an identifier here." ; } $ident2 = "[a-zA-Z][a-zA-Z0-9]*"; $reservedWord = "begin|end";
CSE S. Tanimoto Regular Expressions 10 Specifying Patterns /Pattern/ # Literal text; # true if it occurs anywhere in the string. /^Pattern/ # Must occur at the beginning. "Pattern recognition is alive" =~ /^Pattern/ "The end" =~ /end$/ \s whitespace \S non-whitespace \w a word char. \W a non-word char. [a-zA-Z_0-9] \d a digit \D a non-digit \b word boundary \B not word boundary
CSE S. Tanimoto Regular Expressions 11 Specifying Patterns (Cont.) $test = "You have new mail "; if ($test =~ /^You\s.+\d+-\d+-\d+/ ) { print "The mail has arrived."; } if ($test =~ m( ^ You \s.+ \d+ - \d+ - \d+ ) { print "The mail has arrived."; }
CSE S. Tanimoto Regular Expressions 12 Extracting Information $test = "You have new mail "; if ($test =~ /^You\s.+(\d+)-(\d+)-(\d+)/ ) { print "The mail has arrived on "; print "day $2 of month $1 in year $3.\n"; } # Parentheses in the pattern establish # variables $1, $2, $3, etc. to hold # corresponding matched fragments.
CSE S. Tanimoto Regular Expressions 13 Search and Replace $sntc = "We surfed the waves the whole day." $sntc =~ s/surfed/sailed/; print $sntc; # We sailed the waves the whole day. $sntc =~ s/the//g; print $sntc; # We sailed waves whole day. # g makes the replacement “global”.
CSE S. Tanimoto Regular Expressions 14 Interpolation of Variables in Replacements $exclamation = "yeah"; $sntc = "We had fun." $sntc =~ s/w+/$exclamation/g; print $sntc; # yeah yeah yeah. # a pattern can contain a Perl variable.
CSE S. Tanimoto Regular Expressions 15 Example of (Crude) Lexical Analysis $ident = "[a-zA-Z][a-zA-Z0-9]*"; $int = "[\-]?[0-9]+"; $op = "[\-\+\*\/\=]|mod"; $exp = "begin x = 5; print sqrt(x); end"; $exp =~ s/$ident/ID/g; $exp =~ s/$int/N/g; $exp =~ s/$op/OP/g; print $exp; ID ID OP N; ID ID(ID); ID
CSE S. Tanimoto Regular Expressions 16 Processing Assignment Submissions Using Forms and Files 1. Form file 2. Perl script to process data from form. 3. Perl script to “compile” data into an index page.
CSE S. Tanimoto Regular Expressions 17 The HTML Form Submission for CSE 341 Miniproject Topic Proposals CSE 341 Miniproject Topic Proposal Submission Form Write a topic-proposal web page, and then fill out this form and submit it by Thursday, February 24 at 5:00 PM. (The web page should follow these guidelines.) <form method=post action=" student/process-topic-proposal.pl">
CSE S. Tanimoto Regular Expressions 18 The HTML Form (2 of 2) Possible name of project: Name of Possible partner (optional): URL of a web page that describes your proposal: If you plan to submit another topic proposal because you are very uncertain about whether to stick with this one, check this box:
CSE S. Tanimoto Regular Expressions 19 Perl Script to Process Data From Form #! /usr/bin/perl # Process the miniproject topic proposal form inputs # S. Tanimoto, 20 Feb 2000 use CGI qw/:standard/; use strict; print header; my $projectname = param("projectname"); my $uncertain = param("uncertain"); my $partner = param("partner"); my $proposal_url = param("proposalurl"); my $student_username = $ENV{"REMOTE_USER"}; my $now = localtime(); $projectname =~ s/[^a-zA-Z0-9\-\~]//g; $partner =~ s/[^a-zA-Z0-9\-\~]//g; $proposal_url =~ s/[^a-zA-Z0-9\-\~]//g;
CSE S. Tanimoto Regular Expressions 20 Perl Script to Process the Data (2 of 2) my $output_line = "STUDENT_USERNAME=$student_username; ". "PROPOSAL_URL=$proposal_url; ". "PROJECT_NAME=$projectname; ". "PARTNER=$partner; ". "UNCERTAIN=$uncertain; ". "DATE=$now; "; if (! (open(OUT, ">>MP-topic-proposal-data.txt"))) { print("Error: could not open topic file for output."); print("Please notify instructor and/or try again later."); print end_html; exit 0; } print OUT $output_line, "\n"; close OUT; print h1("Your miniproject topic proposal has been received. Thanks!"); print end_html;
CSE S. Tanimoto Regular Expressions 21 Perl Script to “Compile” the Data #!/usr/bin/perl # make-MP-index-of-proposed-topics.pl use strict; use CGI qw/:standard/; open(INFILE, "<MP-topic-proposal-data-sorted.txt") || die("Could not open the file MP-topic-proposal-data-sorted.txt.\n"); print<<"EOT"; CSE 341 MP Topic Proposal Index CSE 341 MP Topic Proposal Index EOT print " Student username Proposal Page Partner Certainty When \n"; my $projectname; my $uncertain; my $partner; my $proposal_url; my $student_username; my $date;
CSE S. Tanimoto Regular Expressions 22 Perl Script to “Compile” the Data (2 of 3) while ( ) { if ( /STUDENT_USERNAME=([^\;]+);\s/){$student_username =$1; } else { $student_username =""; } if ( /PROJECT_NAME=([^\;]+);\s/){$projectname =$1; } else { $projectname =""; } if ( /PROPOSAL_URL=([^\;]+);\s/){$proposal_url =$1; } else { $proposal_url =""; } if ( /PARTNER=([^\;]+);\s/){$partner =$1; } else { $partner =""; } if ( /UNCERTAIN=([^\;]+);\s/){$uncertain =$1; } else { $uncertain =""; } if ( /DATE=([^\;]+);/){$date = $1; } else { $date = ""; } if ($proposal_url =~ /http/ ) {} else { $proposal_url = " $proposal_url; } if ($uncertain eq "No") { $uncertain = ""; } else { $uncertain = "Uncertain"; }
CSE S. Tanimoto Regular Expressions 23 Perl Script to “Compile” the Data (3 of 3) my $link = " $projectname "; print " $student_username $link $partner $uncerta in $date \n"; } print " \n";