Python regular expressions

Slides:



Advertisements
Similar presentations
Regular Expressions Pattern and Match objects Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Python regular expressions. “Some people, when confronted with a problem, think ‘I know, I'll use regular expressions.’ Now they have two problems.”
Learning Ruby Regular Expressions Get at practice page by logging on to csilm.usu.edu and selecting PROGRAMMING LANGUAGES|Regular Expressions.
Python: Regular Expressions
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search.
Fundamentals of Characters and Strings Characters: fundamental building blocks of Python programs Function ord returns a character’s character code.
Regular Expressions in Java. Namespace in XML Transparency No. 2 Regular Expressions Regular expressions are an extremely useful tool for manipulating.
Regular Expressions in Java. Regular Expressions A regular expression is a kind of pattern that can be applied to text ( String s, in Java) A regular.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search.
1 Overview Regular expressions Notation Patterns Java support.
More on Regular Expressions Regular Expressions More character classes \s matches any whitespace character (space, tab, newline etc) \w matches.
Regular Expression A regular expression is a template that either matches or doesn’t match a given string.
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
Methods in Computational Linguistics II with reference to Matt Huenerfauth’s Language Technology material Lecture 4: Matching Things. Regular Expressions.
Faculty of Sciences and Social Sciences HOPE JavaScript Validation Regular Expression Stewart Blakeway FML
Programming for Linguists An Introduction to Python 24/11/2011.
Introduction to Computing Using Python Regular expressions Suppose we need to find all addresses in a web page How do we recognize addresses?
1 An Introduction to Python Part 3 Regular Expressions for Data Formatting Jacob Morgan Brent Frakes National Park Service Fort Collins, CO April, 2008.
Hossain Shahriar Announcement and reminder! Tentative date for final exam need to be fixed! Topics to be covered in this lecture(s)
Regular Expressions Chapter 11 Python for Informatics: Exploring Information
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
Python Regular Expressions Easy text processing. Regular Expression  A way of identifying certain String patterns  Formally, a RE is:  a letter or.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Programming in Perl regular expressions and m,s operators Peter Verhás January 2002.
BY Sandeep Kumar Gampa.. What is Regular Expression? Regex in.NET Regex Language Elements Examples Regular Expression API How to Test regex in.NET Conclusion.
Regular Expressions – An Overview Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in.
Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have.
 2002 Prentice Hall. All rights reserved. 1 Chapter 13 – String Manipulation and Regular Expressions Outline 13.1 Introduction 13.2 Fundamentals of Characters.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expression What is Regex? Meta characters Pattern matching Functions in re module Usage of regex object String substitution.
Module 6 – Generics Module 7 – Regular Expressions.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
Python for NLP Regular Expressions CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Regular Expressions The ultimate tool for textual analysis.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
CompSci 6 Introduction to Computer Science November 8, 2011 Prof. Rodger.
CompSci 101 Introduction to Computer Science November 18, 2014 Prof. Rodger.
Java Script Pattern Matching Using Regular Expressions.
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
Python Pattern Matching and Regular Expressions Peter Wad Sackett.
CompSci 101 Introduction to Computer Science April 7, 2015 Prof. Rodger.
STATIC CODE ANALYSIS. OUTLINE  INTRODUCTION  BACKGROUND o REGULAR EXPRESSIONS o SYNTAX TREES o CONTROL FLOW GRAPHS  TOOLS AND THEIR WORKING  ERROR.
RE Tutorial.
Regular Expressions Upsorn Praphamontripong CS 1110
Strings and Serialization
CompSci 101 Introduction to Computer Science
Lecture 19 Strings and Regular Expressions
Concepts of Programming Languages
CSC 594 Topics in AI – Natural Language Processing
Regular Expressions Chapter 11 Python for Everybody
CSC 594 Topics in AI – Natural Language Processing
Strings and Lists – the split method
CSCI 431 Programming Languages Fall 2003
CS 1111 Introduction to Programming Fall 2018
CIT 383: Administrative Scripting
Regular Expressions in Java
Regular Expressions in Java
CSCE 590 Web Scraping Lecture 4
Nate Brunelle Today: Regular Expressions
Nate Brunelle Today: Regular Expressions
Regular Expression: Pattern Matching
Python regular expressions
Presentation transcript:

Python regular expressions

“Some people, when confronted with a problem, think ‘I know, I'll use regular expressions.’ Now they have two problems.” -- Jamie Zawinski http://www.jwz.org/

Regular Expressions Regular expressions are a powerful string manipulation tool All modern languages have similar library packages for regular expressions Use regular expressions to: Search a string (search and match) Replace parts of a string (sub) Break strings into smaller pieces (split)

Python’s Regular Expression Syntax Most characters match themselves The regular expression “test” matches the string ‘test’, and only that string [x] matches any one of a list of characters “[abc]” matches ‘a’,‘b’,or ‘c’ [^x] matches any one character that is not included in x “[^abc]” matches any single character except ‘a’,’b’,or ‘c’

Python’s Regular Expression Syntax “.” matches any single character Parentheses can be used for grouping “(abc)+” matches ’abc’, ‘abcabc’, ‘abcabcabc’, etc. x|y matches x or y “this|that” matches ‘this’ and ‘that’, but not ‘thisthat’.

Python’sRegular Expression Syntax x* matches zero or more x’s “a*” matches ’’, ’a’, ’aa’, etc. x+ matches one or more x’s “a+” matches ’a’,’aa’,’aaa’, etc. x? matches zero or one x’s “a?” matches ’’ or ’a’ x{m, n} matches i x‘s, where m<i< n “a{2,3}” matches ’aa’ or ’aaa’

Regular Expression Syntax “\d” matches any digit; “\D” any non-digit “\s” matches any whitespace character; “\S” any non-whitespace character “\w” matches any alphanumeric character; “\W” any non-alphanumeric character “^” matches the beginning of the string; “$” the end of the string “\b” matches a word boundary; “\B” matches a character that is not a word boundary

Search and Match The two basic functions are re.search and re.match Search looks for a pattern anywhere in a string Match looks for a match staring at the beginning Both return None (logical false) if the pattern isn’t found and a “match object” instance if it is >>> import re >>> pat = "a*b” >>> re.search(pat,"fooaaabcde") <_sre.SRE_Match object at 0x809c0> >>> re.match(pat,"fooaaabcde") >>>

Q: What’s a match object? A: an instance of the match class with the details of the match result >>> r1 = re.search("a*b","fooaaabcde") >>> r1.group() # group returns string matched 'aaab' >>> r1.start() # index of the match start 3 >>> r1.end() # index of the match end 7 >>> r1.span() # tuple of (start, end) (3, 7)

What got matched? Here’s a pattern to match simple email addresses \w+@(\w+\.)+(com|org|net|edu) >>> pat1 = "\w+@(\w+\.)+(com|org|net|edu)" >>> r1 = re.match(pat,"finin@cs.umbc.edu") >>> r1.group() 'finin@cs.umbc.edu’ We might want to extract the pattern parts, like the email name and host

What got matched? We can put parentheses around groups we want to be able to reference >>> pat2 = "(\w+)@((\w+\.)+(com|org|net|edu))" >>> r2 = re.match(pat2,"finin@cs.umbc.edu") >>> r2.group(1) 'finin' >>> r2.group(2) 'cs.umbc.edu' >>> r2.groups() r2.groups() ('finin', 'cs.umbc.edu', 'umbc.', 'edu’) Note that the ‘groups’ are numbered in a preorder traversal of the forest

What got matched? We can ‘label’ the groups as well… >>> pat3 ="(?P<name>\w+)@(?P<host>(\w+\.)+(com|org|net|edu))" >>> r3 = re.match(pat3,"finin@cs.umbc.edu") >>> r3.group('name') 'finin' >>> r3.group('host') 'cs.umbc.edu’ And reference the matching parts by the labels

More re functions re.split() is like split but can use patterns >>> re.split("\W+", “This... is a test, short and sweet, of split().”) ['This', 'is', 'a', 'test', 'short’, 'and', 'sweet', 'of', 'split’, ‘’] re.sub substitutes one string for a pattern >>> re.sub('(blue|white|red)', 'black', 'blue socks and red shoes') 'black socks and black shoes’ re.findall() finds al matches >>> re.findall("\d+”,"12 dogs,11 cats, 1 egg") ['12', '11', ’1’]

Compiling regular expressions If you plan to use a re pattern more than once, compile it to a re object Python produces a special data structure that speeds up matching >>> capt3 = re.compile(pat3) >>> cpat3 <_sre.SRE_Pattern object at 0x2d9c0> >>> r3 = cpat3.search("finin@cs.umbc.edu") >>> r3 <_sre.SRE_Match object at 0x895a0> >>> r3.group() 'finin@cs.umbc.edu'

Pattern object methods Pattern objects have methods that parallel the re functions (e.g., match, search, split, findall, sub), e.g.: >>> p1 = re.compile("\w+@\w+\.+com|org|net|edu") >>> p1.match("steve@apple.com").group(0) 'steve@apple.com' >>> p1.search(”Email steve@apple.com today.").group(0) 'steve@apple.com’ >>> p1.findall("Email steve@apple.com and bill@msft.com now.") ['steve@apple.com', 'bill@msft.com’] >>> p2 = re.compile("[.?!]+\s+") >>> p2.split("Tired? Go to bed! Now!! ") ['Tired', 'Go to bed', 'Now', ’ '] email address sentence boundary

Example: pig latin Rules If word starts with consonant(s) Move them to the end, append “ay” Else word starts with vowel(s) Keep as is, but add “zay” How might we do this? http://cs.umbc.edu/courses/331/current/code/python/pig.py

([bcdfghjklmnpqrstvwxyz]+)(\w+) The pattern ([bcdfghjklmnpqrstvwxyz]+)(\w+)

piglatin.py import re pat = ‘([bcdfghjklmnpqrstvwxyz]+)(\w+)’ cpat = re.compile(pat) def piglatin(string): return " ".join( [piglatin1(w) for w in string.split()] )

piglatin.py def piglatin1(word): """Returns the pig latin form of a word. e.g.: piglatin1("dog”) => "ogday". """ match = cpat.match(word) if match: consonants = match.group(1) rest = match.group(2) return rest + consonents + “ay” else: return word + "zay"