Presentation is loading. Please wait.

Presentation is loading. Please wait.

Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have.

Similar presentations


Presentation on theme: "Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have."— Presentation transcript:

1 Regular Expressions Regular Expressions

2 Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have similar library packages for regular expressions  Use regular expressions to: Search a string ( search and match)Search a string ( search and match) Replace parts of a string (sub)Replace parts of a string (sub) Break stings into smaller piece (split)Break stings into smaller piece (split)

3 Regular Expression Python Syntax  Most characters match themselves The regular expression “test” matches the string ‘test’, and only that string  [x] matches any one of a list of characters “[abc]” matches ‘a’,‘b’, or ‘c’  [^x] matches any one character that is not included in x “[^abc]” matches any single character except ‘a’,’b’, or ‘c’

4 Regular Expressions Syntax  “.” matches any single character  Parentheses can be used for grouping “(abc)+” matches ’abc’, ‘abcabc’, ‘abcabcabc’, etc.  x|y matches x or y “this|that” matches ‘this’ or ‘that’, but not ‘thisthat’.

5 Regular Expression Syntax  x* matches zero or more x’s “a*” matches ’’, ’a’, ’aa’, etc.  x+ matches one or more x’s “a+” matches ’a’, ’aa’, ’aaa’, etc.  x? matches zero or one x’s “a?” matches ’’ or ’a’. “a?” matches ’’ or ’a’.  x{m, n} matches i x’s, where m<i< n “a{2,3}” matches ’aa’ or ’aaa’

6 Regular Expression Syntax  “\d” matches any digit; “\D” matches any non-digit  “\s” matches any whitespace character; “\S” matches any non-whitespace character  “\w” matches any alphanumeric character; “\W” matches any non-alphanumeric character  “^” matches the beginning of the string; “$” matches the end of the string  “\b” matches a word boundary; “\B” matches position that is not a word boundary

7 Search and Match  The two basic functions are re.search and re.match Search looks for a pattern anywhere in a stringSearch looks for a pattern anywhere in a string Match looks for a match staring at the beginningMatch looks for a match staring at the beginning  Both return None if the pattern is not found (logical false) and a “match object” if it is true >>> pat = "a*b" >>> import re >>> re.search(pat,"fooaaabcde") >>> re.match(pat,"fooaaabcde")

8 Python’s raw string notation  Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.  Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical: >>> re.match(r"\\", r"\\") >>> re.match("\\\\", r"\\")

9 Search example import re programming = ["Python", "Perl", "PHP", "C++"] pat = "^B|^P|i$|H$" for lang in programming: if re.search(pat,lang,re.IGNORECASE): print lang, "FOUND" else: print lang, "NOT FOUND" The output of above script will be: Python FOUND Perl FOUND PHP FOUND C++ NOT FOUND

10 Q: What’s a match object?  An instance of the match class with the details of the match result pat = "a*b" >>> r1 = re.search(pat,"fooaaabcde") >>> r1.group() # group returns string matched 'aaab' >>> r1.start() # index of the match start 3 >>> r1.end() # index of the match end 7 >>> r1.span() # tuple of (start, end) (3, 7)

11 What got matched?  Here’s a pattern to match simple email addresses \w+@(\w+\.)+(com|org|net|edu) \w+@(\w+\.)+(com|org|net|edu) >>> pat1 = "\w+@(\w+\.)+(com|org|net|edu)" >>> r1 = re.match(pat1,"finin@cs.umbc.edu") >>> r1.group() 'finin@cs.umbc.edu’  We might want to extract the pattern parts, like the email name and host

12 What got matched?  We can put parentheses around groups we want to be able to reference >>> pat2 = "(\w+)@((\w+\.)+(com|org|net|edu))" >>> r2 = re.match(pat2,"finin@cs.umbc.edu") >>> r2.groups() r2.groups() ('finin', 'cs.umbc.edu', 'umbc.', 'edu’) >>> r2.group(1) 'finin' >>> r2.group(2) 'cs.umbc.edu'  Note that the ‘groups’ are numbered in a preorder traversal of the forest

13 What got matched?  We can ‘label’ the groups as well… >>> pat3 ="(?P \w+)@(?P (\w+\.)+(com|org|net|edu))" >>> r3 = re.match(pat3,"finin@cs.umbc.edu") >>> r3.group('name') 'finin' >>> r3.group('host') 'cs.umbc.edu’  And reference the matching parts by the labels

14 Pattern object methods  There are methods defined for a pattern object that parallel the regular expression functions, e.g., Match Match  Search Search  splitsplit findallfindall subsub

15 More re functions  re.split() is like split but can use patterns >>> re.split("\W+", “This... is a test, short and sweet, of split().”) ['This', 'is', 'a', 'test', 'short’, 'and', 'sweet', 'of', 'split’, ‘’] ['This', 'is', 'a', 'test', 'short’, 'and', 'sweet', 'of', 'split’, ‘’]  >>>re.split('[a-f]+', '0a3B9‘, flags=re.IGNORECASE) ['0', '3', '9'] ['0', '3', '9']  re.sub substitutes one string for a pattern >>> re.sub('(blue|white|red)', 'black', 'blue socks and red shoes') 'black socks and black shoes’  re.findall() finds all matches >>> re.findall("\d+”,"12 dogs,11 cats, 1 egg") ['12', '11', ’1’]  findall With Files f = open('test.txt', 'r') # it returns a list of all the found strings strings = re.findall('some pattern', f.read()) f = open('test.txt', 'r') # it returns a list of all the found strings strings = re.findall('some pattern', f.read())

16 Compiling regular expressions re.compile  If you plan to use a re pattern more than once, compile it to a re object  Python produces a special data structure that speeds up matching >>> capt3 = re.compile(pat3) >>> cpat3 >>> r3 = cpat3.search("finin@cs.umbc.edu") >>> r3 >>> r3.group() 'finin@cs.umbc.edu'

17 Example: pig latin  Rules If word starts with consonant(s)If word starts with consonant(s)  Move them to the end, append “ay” Else word starts with vowel(s)Else word starts with vowel(s)  Keep as is, but add “zay” How might we do this?How might we do this?

18 The pattern ([bcdfghjklmnpqrstvwxyz]+) (\w+)

19 piglatin.py import re pat = ‘([bcdfghjklmnpqrstvwxyz]+)(\w+)’ cpat = re.compile(pat) def piglatin(string): return " ".join( [piglatin1(w) for w in string.split()] ) return " ".join( [piglatin1(w) for w in string.split()] )

20 piglatin.py def piglatin1(word): match = cpat.match(word) match = cpat.match(word) if match: if match: consonants = match.group(1) consonants = match.group(1) rest = match.group(2) rest = match.group(2) return rest + consonents + “ay” return rest + consonents + “ay” else: else: return word + "zay" return word + "zay"

21 Exercises  Write a python program using regexp to validate an ip address : (eg. 172.1.2.200)  Write a regexp to validate your USN.  Find Email Domain in Address E.g 'My name is Ram, and blahblah@gmail.com is my email.‘ and program should return @gmail. E.g 'My name is Ram, and blahblah@gmail.com is my email.‘ and program should return @gmail.  Write a program to validate name and phone number using re. It will continue to ask until you put correct data only. (eg.Phone number: (800) 555.1212 #1234. Use re.compile  Define a simple "spelling correction" function correct() that takes a string and sees to it that 1) two or more occurrences of the space character is compressed into one, and 2) inserts an extra space after a period if the period is directly followed by a letter. (use regular expression) E.g. correct("This is very funny and cool.Indeed!") E.g. correct("This is very funny and cool.Indeed!") should return "This is very funny and cool. Indeed!" should return "This is very funny and cool. Indeed!"  Find all five characters long words in a sentence


Download ppt "Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have."

Similar presentations


Ads by Google