1.5 Regular Expressions (REs) Compiler Structures 242-437, Semester 2, 2018-2019 1.5 Regular Expressions (REs) Objectives what is a regular expression? give examples of REs used in the grep command
1. Regular Expressions A regular expression (RE or regex) is a pattern used to match against text when searching inside a file. Regexs are used everywhere in Linux: Editors: ed, ex, vi Utilities: grep, egrep, sed, and awk
String Regex c k s UNIX Tools rocks. UNIX Tools sucks. regex pattern UNIX Tools rocks. text: match UNIX Tools sucks. text: match text: UNIX Tools is okay. no match
Multiple Matches a p p l e Scrapple from the apple. A regex pattern can match text in more than one place. a p p l e regex pattern Scrapple from the apple. text: match 1 match 2
The . (dot) Regex o . For me to poop on. The . regex pattern can be used to match any character in the text. o . regex pattern For me to poop on. text: match 1 match 2
The Character Class Regex A character class [] can match any set of characters in the text. b [eor] a t regex pattern beat a brat on a boat text: match 1 match 2 match 3
Character Class Examples
Repetition Regex: * (star) The * defines zero or more copies of the letter before it. y a * y regex pattern I got mail, yaaaaaaaaaay! text: match
o a * o I like the zoo. h . * o Say hello Andrew. regex pattern text: match h . * o regex pattern Say hello Andrew. text: match
h . * o regex pattern Say hello to Andrew. text: match Regex are greedy – they match as much of the text as they can.
Anchors: ^ $ ^ b [eor] a t beat a brat on a boat b [eor] a t $ regex pattern ^ matches the beginning of the text line beat a brat on a boat text: match b [eor] a t $ text: regex pattern $ matches the end of the text line beat a brat on a boat match
More Anchors
The | (or) Regex
More Repetition Regexs: * + ?
More Regex Operations See the regular expressions "cheat-sheet" at the course website over 80 operators!!
2. grep “grep” uses a regex pattern to search a text file Examples: all the lines containing a match (or matches) are printed Examples: % grep "root" test1 % grep "r..t" test1 % grep "ro*t" test1 % grep "r[a-z]*t" test1 regex pattern in "..." text filename
The Grep Family grep usual version egrep extended REs | + ? don’t need backslash) fgrep only strings, i.e. is faster
Common “grep” Options -c Print a count of matched lines. -i Ignore uppercase/ lowercase -l List filenames that contain matches -n Print matched lines and line numbers -s Work silently; only display error messages. -v Print lines that do not match the pattern.
Some Simple Examples grep searches input lines, a line at a time. If the line contains a string that matches grep's RE (pattern), then the line is output. input lines (e.g. from a file) output matching lines (e.g. to a file) grep "RE" hello andy my name is andy my bye byhe continued
Examples "|" means "or" continued grep "and" grep -E "an|my" hello andy my name is andy my bye byhe hello andy my name is andy grep -E "an|my" hello andy my name is andy my bye byhe hello andy my name is andy my bye byhe "|" means "or" continued
"*" means "0 or more" grep "hel*" hello andy my name is andy my bye byhe hello andy my bye byhe "*" means "0 or more"
grep with \< \> begin and end of word Look for the word "north"
grep with a\|b a or b egrep doesn't need backslash
grep with \+ one or more egrep doesn't need backslash
grep with . any character egrep doesn't need backslash
grep with ^ and $ begin and end of line
grep with [ ] set of chars
Fun with a Linux Dictionary Find the location of the words file List all the words containing "hh"
Look for "niether" or "neither" Look for words with three "u"s Count the words with three "a"s
Complex Regex Examples Variable names in C [a-zA-Z_][a-zA-Z_0-9]* Dollar amount with optional cents \$[0-9]+(\.[0-9][0-9])? Time of day (1[012]|[1-9]):[0-5][0-9] (am|pm) HTML headers <h1> <H1> <h2> … <[hH][1-4]>
3. The RE Language A RE can be defined as a pattern language (operands and operators) which matches on text strings.
Some Possible RE Operands text characters (e.g. ‘a’, ‘1’, ‘(‘) the symbol e (means an empty string ‘’) in code just use "" variables, which can be assigned a RE variable = RE
The Basic RE Operators There are three basic operators: union ‘|’ concatenation closure *
Union S | T use S or T to match strings Example REs: a | b a | b | c
Concatenation S T Example REs: use S followed by the T to match against strings Example REs: a b matches the string "ab" w | (a b) matches the strings "w" or "ab"
Closure S* Example RE: use S 0 or more times to match against strings a* matches the strings: e, a, aa, aaa, aaaa, aaaaa, ... empty string
3.1. REs for C Identifiers We define two RE variables, letter and digit: letter = A | B | C | D ... Z | a | b | c | d .... z digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 id is defined using letter and digit: id = letter ( letter | digit )* continued
Strings matched by id include: ab345 w h5g Strings not matched: 2 $abc ****
3.2. REs for Integers and Floats We redefine digit: digit = 0|1|2|3|4|5|6|7|8|9 or digit = [1 – 9] int and float: int = {digit}+ float = {digit}+ "." {digit}+
Integers and floats with exponents: number = {digit}+ ('.' {digit}+ )? ( 'E'('+'|'-')? {digit}+ )?
4. More on REs See RE summary on the course website: regular_expressions_cheat_sheet.pdf I have the standard RE book: Mastering Regular Expressions Jeffrey E. F. Freidl O'Reilly & Associates continued
There are many websites that explain REs: http://etext.lib.virginia.edu/services/ helpsheets/unix/regex.html http://www.zytrax.com/tech/web/regex.htm http://www.regular-expressions.info