Presentation is loading. Please wait.

Presentation is loading. Please wait.

Regular Expressions: Searching strings for patterns April 24, 2008 Copyright 2004-8, Andy Packard and Trent Russi. This work is licensed under the Creative.

Similar presentations


Presentation on theme: "Regular Expressions: Searching strings for patterns April 24, 2008 Copyright 2004-8, Andy Packard and Trent Russi. This work is licensed under the Creative."— Presentation transcript:

1 Regular Expressions: Searching strings for patterns April 24, Copyright , Andy Packard and Trent Russi. This work is licensed under the Creative Commons Attribution-ShareAlike License. To view a copy of this license, visit or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.

2 Introduction Programming tools that can search (long) strings for specific characters and patterns Unix: grep, awk, sed Perl, a language (originally) devoted to text and file manipulation Matlab: regexp, regexpi, regexprep Terminology: Regular expression: a formula for matching strings that follow a specified pattern, made up with the following constructs Character expression: formula to match a single character with prescribed properties Modifier: the number of times the character expression is applied Quantifier: greedy or lazy LookAround operator: specify patterns that must occur before/after the match Positional operators: “or” Grouping operators: Tokens:

3 Character expression patterns
ce = ’[WxYz6]’ % any one of the 5 listed chars ce = ’[m-y]’ % any one of m, n, o, ... x, y ce = ’[2-8]’ % any one of 2, 3, 4, ... 7, 8 ce = ’.’ % any single character ce = ’[^ctwH]’ % any character except c, t, w, H ce = ’\s’ % any whitespace character ce = ’\w’ % any “word” char, [a-zA-Z_0-9] ce = ’\d’ % any numeric digit ce = ’\t’ % horizontal tab character ce = ’\n’ % newline character >> ce = ’and’ % the 3 character sequence >> ce = ’f or’; % the 4 character: f-space-o-r >> ce = ’Mr\.’ % the 3 character sequence ’Mr.’ >> ce = ’g.[tp]’ % g, any single char, t or p Single character classes mulltiple character patterns

4 Character modifiers ce = ’z?’ % zero or one z
ce = ’Y*’ % zero or more Y ce = ’D+’ % one or more D ce = ’p{2}’ % two p (ie., pp) ce = ’r{3,}’ % at least 3 r (rrr, or rrrr, or...) ce = ’H{2,4}’ % HH or HHH or HHHH ce = ’r.{3}f’ % r, any 3 characters, f ce = ’Hill?ary’; % Hillary or Hilary ce = ’Mrs?\.’ % ’Mr.’ or ’Mrs.’ (includes .) ce = ’ \d{4} ’ % space, 4 digits, space ce = ’ab+’ % a followed by 1 or more b ce = ’(i[^i])+’ % one or more sequences of i not-I

5 Greedy/Lazy str = ’I wished I was living in Mississippi’;
ce1 = ’i[^i]+’ % i followed by one or more not-i regexp(str,ce1,’match’) Note: behavior is greedy. Tries to match as much of str that is possible (i, then many not-i, until it encounters an i) ce2 = ’i[^i]+?’ % ? quantifies the + as “lazy” regexp(str,ce2,’match’) Note: now behavior is lazy – each match as short as possible

6 LookAround Operators LookAhead Negative LookAhead LookBehind
ce = ’Andy(?= Packard)’ match ’Andy’ only if followed by ’space-Packard’ Negative LookAhead ce = ’UC(?!LA)’ match ’UC’ only if not followed by ’LA’ LookBehind ce = ’(?<=Hewlet?t ?)Packard’ match ’Packard’ only if preceded by ’Hewlett’ or ’Hewlet’, with or without a space at end Negative LookBehind ce = ’(?<!flexible )deadline’ match ’deadline’ only if not preceded by ’flexible ’

7 Positional Operators Beginning of the string: End of the string:
ce = ’^To be’ match ’To be’ only if it is at the beginning of the string End of the string: ce = ’(\d+)$’ match one or more digits at the end of the string Beginning of a word: ce = ’\<ard’ match ’ard’ only if it is at the beginning of a word End of a word: ce = ’ard\>’ Match ’ard’ only if it is at the end of a word Exact word: ce = ’\<part\>’ Match ’part’ only if exactly matches a word

8 Logical Operator ‘OR’ operator: ce = ’\<Jan|\<Feb’
match ’Jan’ at the beginning of a word, or ’Feb’ at the beginning a word

9 Tokens We have seen that parenthesis can be used to group patterns, and establish precedence. Parenthesis also implicitly define variables (called tokens) which can be used in the character expression. Match two 4-digit strings separated by characters ce1 = ’\d{4}.*\d{4}’ Match 4 digit string that occurs twice w/ characters in between ce2 = ’(\d{4}).*\1’ str = ’the year 1984 was unlike book 1984’; str2 = ’the year 2008 is more like 1984 book’ regexp(str,ce1,’match’), regexp(str,ce2,’match’) regexp(str2,ce1,’match’), regexp(str2,ce2,’match’) For a match to occur, the \1 character expression is whatever was found in the ()

10 Tokens Match 4 digit string that occurs twice w/ characters in between
ce = ’(\d{4})(.{3}).*\2\1’ str = ’year 1999Feb had Feb1999 stuff’; [m,t] = regexp(str,ce,’match’,’tokens’); m -> {’1999Feb had Feb1999’} t -> {’1999’ ’Feb’}


Download ppt "Regular Expressions: Searching strings for patterns April 24, 2008 Copyright 2004-8, Andy Packard and Trent Russi. This work is licensed under the Creative."

Similar presentations


Ads by Google