Regular Expressions Software Tools
Slide 2 What is a Regular Expression? A regular expression is a pattern to be matched against a string. For example, the pattern Bill. l Matching either succeeds or fails. l Sometimes you may want to replace a matched pattern with another string. Regular expressions are used by many other Unix commands and programs, such as grep, sed, awk, vi, emacs, and even some shells.
Slide 3 Simple Uses of Regular Expressions If we are looking for all the lines in a file that contain the string Shakespeare, we could use the grep command: $ grep Shakespeare movie > result Here, Shakespeare is the regular expression that grep looks for in the file movie. Lines that match are redirected to result.
Slide 4 Simple Uses of Regular Expressions In Perl, we can make Shakespeare a regular expression by enclosing it in slashes: if(/Shakespeare/){ print $_; } What is tested in the if-statement ? Answer: $_. When a regular expression is enclosed in slashes, $_ is tested against the regular expression, returning true if there is a match, false otherwise.
Slide 5 Simple Uses of Regular Expressions if(/Shakespeare/){ print $_; } The previous example tests only one line, and prints out the line if it contains Shakespeare. To work on all lines, add a loop: while(<>){ if(/Shakespeare/){ print; }
Slide 6 Simple Uses of Regular Expressions What if we are not sure how to spell Shakespeare ? Certainly the first part is easy Shak, and there must be a r near the end. How can we express our idea? grep:grep "Shak.*r" movie > result Perl:while(<>){ if(/Shak.*r/){ print; }.* means “zero or more of any character”.
Slide 7 Simple Uses of Regular Expressions grep:grep "Shak.*r" movie > result The double quotes in this grep example are needed to prevent the shell from interpreting * as “all files”. Since Shakespeare ends in “e”, shouldn’t it be: Shak.*r.* Answer: No need. Any character can come before or after the pattern. Shak.*r is the same as.*Shak.*r.*
Slide 8 Substitution l Another simple regular expression is the substitute operator. l It replaces part of a string that matches the regular expression with another string. s/Shakespeare/Bill Gates/; $_ is matched against the regular expression ( Shakespeare ). If the match is successful, the part of the string that matched is discarded and replaced by the replacement string ( Bill Gates ). l If the match is unsuccessful, nothing happens.
Slide 9 Substitution l The program: $ cat movie Titanic Saving Private Ryan Shakespeare in Love Life is Beautiful $ cat sub1 #!/usr/local/bin/perl5 -w while(<>){ if(/Shakespeare/){ s/Shakespeare/Bill Gates/; print; } $ sub1 movie Bill Gates in Love $
Slide 10 Substitution An even shorter way to write it: $ cat sub2 #!/usr/local/bin/perl5 -w while(<>){ if(s/Shakespeare/Bill Gates/){ print; } $ sub2 movie Bill Gates in Love $
Slide 11 Patterns l A regular expression is a pattern. Some parts of the pattern match a single character ( a ). Other parts of the pattern match multiple characters (.* ).
Slide 12 Single-Character Patterns The dot “. ” matches any single character except the newline ( \n ). For example, the pattern /a./ matches any two- letter sequence that starts with a and is not “ a\n ”. Use \. if you really want to match the period. $ cat test hi hi bob. $ cat sub3 #!/usr/local/bin/perl5 -w while(<>){ if(/\./){ print; } } $ sub3 test hi bob. $
Slide 13 Single-Character Groups l If you want to specify one out of a group of characters to match use [ ]: /[abcde]/ This matches a string containing any one of the first 5 lowercase letters, while: /[aeiouAEIOU]/ matches any of the 5 vowels in either upper or lower case.
Slide 14 Single-Character Groups If you want ] in the group, put a backslash before it, or put it as the first character in the list: /[abcde]]/# matches [abcde] + ] /[abcde\]]/# okay /[]abcde]/# also okay Use - for ranges of characters (like a through z ): /[ ]/# any single digit /[0-9]/# same If you want - in the list, put a backslash before it, or put it at the beginning/end: /[X-Z]/# matches X, Y, Z /[X\-Z]/# matches X, -, Z /[XZ-]/# matches X, Z, - /[-XZ]/# matches -, X, Z
Slide 15 Single-Character Groups l More range examples: /[0-9\-]/ # match 0-9, or minus /[0-9a-z]/ # match any digit or lowercase letter /[a-zA-Z0-9_]/ # match any letter, digit, underscore There is also a negated character group, which starts with a ^ immediately after the left bracket. This matches any single character not in the list. /[^ ]/# match any single non-digit /[^0-9]/# same /[^aeiouAEIOU]/# match any single non-vowel /[^\^]/ # match any single character except ^
Slide 16 l For convenience, some common character groups are predefined: PredefinedGroupNegatedNegated Group \d (a digit)[0-9]\D (non-digit)[^0-9] \w (word char)[a-zA-Z0-9_]\W (non-word)[^a-zA-Z0-9_] \s (space char)[ \t\n]\S (non-space)[^ \t\n] \d matches any digit \w matches any letter, digit, underscore \s matches any space, tab, newline l You can use these predefined groups in other groups: /[\da-fA-F]/# match any hexadecimal digit Single-Character Groups
Slide 17 Multipliers l Multipliers allows you to say “one or more of these” or “up to four” of these.” * means zero or more of the immediately previous character (or character group). + means one or more of the immediately previous character (or character group). ? means zero or one of the immediately previous character (or character group).
Slide 18 Multipliers l Example: /Ga+te?s/ matches a G followed by one or more a ’s followed by t, followed by an optional e, followed by s. *, +, and ? are greedy, and will match as many characters as possible: $_ = "Bill xxxxxxxxx Gates"; s/x+/Cheap/; # gives: Bill Cheap Gates
Slide 19 General Multiplier How do you say “five to ten x ’s”? /xxxxxx?x?x?x?x?/# works, but ugly /x{5,10}/# nicer How do you say “five or more x ’s”? /x{5,}/ How do you say “exactly five x ’s”? /x{5}/ How do you say “up to five x ’s”? /x{0,5}/
Slide 20 General Multiplier How do you say “ c followed by any 5 characters (which can be different) and ending with d ”? /c.{5}d/ * is the same as {0,} + is the same as {1,} ? is the same as {0,1}
Slide 21 Pattern Memory l How would we match a pattern that starts and ends with the same letter or word? l For this, we need to remember the pattern. l Use ( ) around any pattern to put that part of the string into memory (it has no effect on the pattern itself). l To recall memory, include a backslash followed by an integer. /Bill(.)Gates\1/
Slide 22 Pattern Memory l Example: /Bill(.)Gates\1/ This example matches a string starting with Bill, followed by any single non-newline character, followed by Gates, followed by that same single character. l So, it matches: Bill!Gates!Bill-Gates- but not: Bill?Gates!Bill-Gates_ (Note that /Bill.Gates./ would match all four)
Slide 23 Pattern Memory l More examples: /a(.)b(.)c\2d\1/ This example matches a string starting with a, a character (#1), followed by b, another single character (#2), c, the character #2, d, and the character #1. So it matches: a-b!c!d-.
Slide 24 Pattern Memory l The reference part can have more than a single character. l For example: /a(.*)b\1c/ This example matches an a, followed by any number of characters (even zero), followed by b, followed by the same sequence of characters, followed by c. So it matches: aBillbBillc and abc, but not: aBillbBillGatesc.
Slide 25 Alteration l How about picking from a set of alternatives when there is more than one character in the patterns. The following example matches either Gates or Clinton or Shakespeare : /Gates|Clinton|Shakespeare/ l For single character alternatives, /[abc]/ is the same as /a|b|c/.
Slide 26 Anchoring Patterns l Anchors requires that the pattern be at the beginning or end of the line. ^ matches the beginning of the line (only if ^ is the first character of the pattern): /^Bill/ # match lines that begin with Bill /^Gates/ # match lines that begin with Gates /Bill\^/ # match lines containing Bill^ somewhere /\^/ # match lines containing ^ $ matches the end of the line (only if $ is the last character of the pattern): /Bill$/ # match lines that end with Bill /Gates$/ # match lines that end with Gates /$Bill/ # match with contents of scalar $Bill /\$/ # match lines containing $
Slide 27 So what happens with the pattern: a|b* Is this (a|b)* or a|(b*) ? Precedence of patterns from highest to lowest: NameRepresentation Parentheses( ) Multipliers? + * {m,n} Sequence & anchoringabc ^ $ Alternation| By the table, * has higher precedence than |, so it is interpreted as a|(b*). Precedence
Slide 28 Precedence l What if we want the other interpretation in the previous example? Answer: Simple, just use parentheses: (a|b)* l Use parentheses in ambiguous cases to improve clarity, even if not strictly needed. When you use parentheses for precedence, they also go into memory ( \1, \2, \3 ).
Slide 29 Precedence More precedence examples: abc* # matches ab, abc, abcc, abccc,… (abc)* # matches "", abc, abcabc, abcabcabc,… ^a|b # matches a at beginning of line, or b anywhere ^(a|b) # matches either a or b at the beginning of line a|bc|d # a, or bc, or d (a|b)(c|d) # ac, ad, bc, or bd (Bill Gates)|(Bill Clinton)# Bill Gates, Bill Clinton Bill (Gates|Clinton)# Bill Gates, Bill Clinton (Mr\. Bill)|(Bill (Gates|Clinton)) # Mr. Bill, Bill Gates, Bill Clinton (Mr\. )?Bill( Gates| Clinton)? # Bill, Mr. Bill, Bill Gates, Bill Clinton, # Mr. Bill Gates, Mr. Bill Clinton
Slide 30 =~ What if you want to match a different variable than $_ ? Answer: Use =~. l Examples: $name = "Bill Shakespeare"; $name =~ /^Bill/;# true $name =~ /(.)\1/;# also true (matches ll) if($name =~ /(.)\1/){ print "$name\n"; }
Slide 31 =~ An example using =~ to match : $ cat match1 #!/usr/local/bin/perl5 -w print "Quit (y/n)? "; if( =~ /^[yY]/){ print "Quitting\n"; exit; } print "Continuing\n"; $ match1 Quit (y/n)? y Quitting $
Slide 32 =~ Another example using =~ to match : $ cat match2 #!/usr/local/bin/perl5 -w print "Wakeup (y/n)? "; while( =~ /^[nN]/){ print "Sleeping\n"; print "Wakeup (y/n)? "; } $ match2 Wakeup (y/n)? n Sleeping Wakeup (y/n)? N Sleeping Wakeup (y/n)? y $
Slide 33 Ignoring Case In the previous examples, we used [yY] and [nN] to match either upper or lower case. Perl has an “ignore case” option for pattern matching: /somepattern/i $ cat match1a #!/usr/local/bin/perl5 -w print "Quit (y/n)? "; if( =~ /^y/i){ print "Quitting\n"; exit; } print "Continuing\n"; $ match1a Quit (y/n)? Y Quitting $
Slide 34 Slash and Backslash If your pattern has a slash character ( / ), you must precede each with a backslash ( \ ): $ cat slash1 #!/usr/local/bin/perl5 -w print "Enter path: "; $path = ; if($path =~ /^\/usr\/local\/bin/){ print "Path is /usr/local/bin\n"; } $ slash1 Enter path: /usr/local/bin Path is /usr/local/bin $
Slide 35 Different Pattern Delimiters If your pattern has lots of slash characters ( / ), you can also use a different pattern delimiter with the form: m#somepattern# The # can be any non-alphanumeric character. $ cat slash1a #!/usr/local/bin/perl5 -w print "Enter path: "; $path = ; if($path =~ m#^/usr/local/bin#){ #if($path =~ also works print "Path is /usr/local/bin\n"; } $ slash1a Enter path: /usr/local/bin Path is /usr/local/bin $
Slide 36 Special Read-Only Variables After a successful pattern match, the variables $1, $2, $3,… are set to the same values as \1, \2, \3,… You can use $1, $2, $3,… later in your program. $ cat read1 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in Love"; /(\w+)\W+(\w+)/; # match first two words # $1 is now "Bill" and $2 is now "Shakespeare" print "The first name of $2 is $1\n"; $ read1 The first name of Shakespeare is Bill
Slide 37 Special Read-Only Variables You can also use $1, $2, $3,… by placing the match in a list context: $ cat read2 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in Love"; ($first, $last) = /(\w+)\W+(\w+)/; print "The first name of $last is $first\n"; $ read2 The first name of Shakespeare is Bill
Slide 38 Special Read-Only Variables l Other read-only variables: $& is the part of the string that matched the pattern. $` is the part of the string before the match $’ is the part of the string after the match $ cat read3 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in Love"; / in /; print "Before: $`\n"; print "Match: $&\n"; print "After: $'\n"; $ read3 Before: Bill Shakespeare Match: in After: Love
Slide 39 More on Substitution If you want to replace all matches instead of just the first match, use the g option for substitution: $ cat sub3 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in love with Bill Gates"; s/Bill/William/; print "Sub1: $_\n"; $_ = "Bill Shakespeare in love with Bill Gates"; s/Bill/William/g; print "Sub2: $_\n"; $ sub3 Sub1: William Shakespeare in love with Bill Gates Sub2: William Shakespeare in love with William Gates $
Slide 40 More on Substitution l You can use variable interpolation in substitutions: $ cat sub4 #!/usr/local/bin/perl5 -w $find = "Bill"; $replace = "William"; $_ = "Bill Shakespeare in love with Bill Gates"; s/$find/$replace/g; print "$_\n"; $ sub4 William Shakespeare in love with William Gates $
Slide 41 More on Substitution l Pattern characters in the regular expression allows patterns to be matched, not just fixed characters: $ cat sub5 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in love with Bill Gates"; s/(\w+)/ /g; print "$_\n"; $ sub5 $
Slide 42 More on Substitution l Substitution also allows you to: n ignore case n use alternate delimiters use =~ $ cat sub6 #!/usr/local/bin/perl5 -w $line = "Bill Shakespeare in love with bill Gates"; $line =~ s#bill#William#gi; $line =~ print "$line\n"; $ sub6 William Gates in love with William Gates $
Slide 43 split The split function allows you to break a string into fields. split takes a regular expression and a string, and breaks up the line wherever the pattern occurs. $ cat split1 #!/usr/local/bin/perl5 -w $line = "Bill Shakespeare in love with Bill = split(/ /,$line); # split $line using space as delimiter print "$fields[0] $fields[3] $fields[6]\n"; $ split1 Bill love Gates $
Slide 44 split You can use $_ with split. split defaults to look for space delimiters. $ cat split2 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in love with Bill = split; # split $_ using space (default) as delimiter print "$fields[0] $fields[3] $fields[6]\n"; $ split2 Bill love Gates $
Slide 45 join The join function allows you to glue strings in a list together. $ cat join1 #!/usr/local/bin/perl5 = qw(Bill Shakespeare dislikes Bill Gates); $line = join(" print "$line\n"; $ join1 Bill Shakespeare dislikes Bill Gates $ l Note that the glue string is not a regular expression, just a normal string.