Python regular expressions. “Some people, when confronted with a problem, think ‘I know, I'll use regular expressions.’ Now they have two problems.”

Slides:



Advertisements
Similar presentations
Introduction to LISP Programming of Pathway Tools Queries and Updates.
Advertisements

Regular Expressions Pattern and Match objects Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Python 3 March 15, NLTK import nltk nltk.download()
Learning Ruby Regular Expressions Get at practice page by logging on to csilm.usu.edu and selecting PROGRAMMING LANGUAGES|Regular Expressions.
Regular Expression ASCII Converting. Regular Expression Regular Expression is a tool to check if a string matches some rules. It is a very complicated.
1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search.
Regular Expressions in Java. Regular Expressions A regular expression is a kind of pattern that can be applied to text ( String s, in Java) A regular.
PERL Part 3 1.Subroutines 2.Pattern matching and regular expressions.
1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search.
1 Overview Regular expressions Notation Patterns Java support.
Introduction to regular expression. Wéber André Objective of the training Scope of the course  We will present what are “regular expressions”
More on Regular Expressions Regular Expressions More character classes \s matches any whitespace character (space, tab, newline etc) \w matches.
Regular Expression A regular expression is a template that either matches or doesn’t match a given string.
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy.
Lecture 7: Perl pattern handling features. Pattern Matching Recall =~ is the pattern matching operator A first simple match example print “An methionine.
Regular Expressions Dr. Ralph D. Westfall May, 2011.
Practical Extraction & Report Language PERL Joseph Beltran.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
Methods in Computational Linguistics II with reference to Matt Huenerfauth’s Language Technology material Lecture 4: Matching Things. Regular Expressions.
Regular Expressions in.NET Ashraya R. Mathur CS NET Security.
Faculty of Sciences and Social Sciences HOPE JavaScript Validation Regular Expression Stewart Blakeway FML
Programming for Linguists An Introduction to Python 24/11/2011.
Introduction to Computing Using Python Regular expressions Suppose we need to find all addresses in a web page How do we recognize addresses?
H3D API Training  Part 3.1: Python – Quick overview.
ASP.NET Programming with C# and SQL Server First Edition Chapter 5 Manipulating Strings with C#
Strings The Basics. Strings can refer to a string variable as one variable or as many different components (characters) string values are delimited by.
CIS 451: Regular Expressions Dr. Ralph D. Westfall January, 2009.
Regular Expressions Chapter 11 Python for Informatics: Exploring Information
8 Copyright © 2006, Oracle. All rights reserved. Regular Expression Support.
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
Python Regular Expressions Easy text processing. Regular Expression  A way of identifying certain String patterns  Formally, a RE is:  a letter or.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Programming in Perl regular expressions and m,s operators Peter Verhás January 2002.
BY Sandeep Kumar Gampa.. What is Regular Expression? Regex in.NET Regex Language Elements Examples Regular Expression API How to Test regex in.NET Conclusion.
Examples of comparing strings. “ABC” = “ABC”? yes “ABC” = “ ABC”? No! note the space up front “ABC” = “abc” ? No! Totally different letters “ABC” = “ABCD”?
Regular Expressions – An Overview Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in.
Simple Functions and Names Sec 9-4 Web Design. Objectives The student will: Know how to create a simple function in Python. Know how to call a function.
Chapter 9: Perl (continue) Advanced Perl Programming Some materials are taken from Sams Teach Yourself Perl 5 in 21 Days, Second Edition.
Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have.
 2002 Prentice Hall. All rights reserved. 1 Chapter 13 – String Manipulation and Regular Expressions Outline 13.1 Introduction 13.2 Fundamentals of Characters.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Module 6 – Generics Module 7 – Regular Expressions.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
Python for NLP Regular Expressions CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Regular Expressions The ultimate tool for textual analysis.
Regular Expressions CS 2204 Class meeting 6 Created by Doug Bowman, 2001 Modified by Mir Farooq Ali, 2002.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
CompSci 6 Introduction to Computer Science November 8, 2011 Prof. Rodger.
Python – May 16 Recap lab Simple string tokenizing Random numbers Tomorrow: –multidimensional array (list of list) –Exceptions.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
Michael Kovalchik CS 265, Fall  Parenthesis group parts of expressions together  “/CS265|CS270/” => “/CS(265|270)/”  Groups can be nested  “/Perl|Pearl/”
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
STATIC CODE ANALYSIS. OUTLINE  INTRODUCTION  BACKGROUND o REGULAR EXPRESSIONS o SYNTAX TREES o CONTROL FLOW GRAPHS  TOOLS AND THEIR WORKING  ERROR.
CSE 374 Programming Concepts & Tools
Regular Expressions Upsorn Praphamontripong CS 1110
CS 330 Class 7 Comments on Exam Programming plan for today:
CSC 594 Topics in AI – Natural Language Processing
Python regular expressions
CSC 594 Topics in AI – Natural Language Processing
Strings and Lists – the split method
Regular Expressions
CS 1111 Introduction to Programming Fall 2018
CIT 383: Administrative Scripting
String Processing 1 MIS 3406 Department of MIS Fox School of Business
Regular Expressions in Java
Nate Brunelle Today: Regular Expressions
Nate Brunelle Today: Regular Expressions
Nate Brunelle Today: Regular Expressions
Python regular expressions
Presentation transcript:

Python regular expressions

“Some people, when confronted with a problem, think ‘I know, I'll use regular expressions.’ Now they have two problems.” -- Jamie Zawinski

Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have similar library packages for regular expressions  Use regular expressions to: Search a string ( search and match) Replace parts of a string (sub) Break strings into smaller pieces (split)

Python’s Regular Expression Syntax  Most characters match themselves The regular expression “test” matches the string ‘test’, and only that string  [x] matches any one of a list of characters “[abc]” matches ‘a’,‘b’, or ‘c’  [^x] matches any one character that is not included in x “[^abc]” matches any single character except ‘a’,’b’, or ‘c’

Python’s Regular Expression Syntax  “.” matches any single character  Parentheses can be used for grouping “(abc)+” matches ’abc’, ‘abcabc’, ‘abcabcabc’, etc.  x|y matches x or y “this|that” matches ‘this’ and ‘that’, but not ‘thisthat’.

Python’sRegular Expression Syntax  x* matches zero or more x’s “a*” matches ’’, ’a’, ’aa’, etc.  x+ matches one or more x’s “a+” matches ’a’, ’aa’, ’aaa’, etc.  x? matches zero or one x’s “a?” matches ’’ or ’a’  x{m, n} matches i x‘s, where m<i< n “a{2,3}” matches ’aa’ or ’aaa’

Regular Expression Syntax  “\d” matches any digit; “\D” any non-digit  “\s” matches any whitespace character; “\S” any non-whitespace character  “\w” matches any alphanumeric character; “\W” any non-alphanumeric character  “^” matches the beginning of the string; “$” the end of the string  “\b” matches a word boundary; “\B” matches a character that is not a word boundary

Search and Match  The two basic functions are re.search and re.match Search looks for a pattern anywhere in a string Match looks for a match staring at the beginning  Both return None (logical false) if the pattern isn’t found and a “match object” instance if it is >>> import re >>> pat = "a*b” >>> re.search(pat,"fooaaabcde") >>> re.match(pat,"fooaaabcde") >>>

Q: What’s a match object?  A: an instance of the match class with the details of the match result >>> r1 = re.search("a*b","fooaaabcde") >>> r1.group() # group returns string matched 'aaab' >>> r1.start() # index of the match start 3 >>> r1.end() # index of the match end 7 >>> r1.span() # tuple of (start, end) (3, 7)

What got matched?  Here’s a pattern to match simple addresses >>> pat1 = >>> r1 = >>> r1.group()  We might want to extract the pattern parts, like the name and host

What got matched?  We can put parentheses around groups we want to be able to reference >>> pat2 = >>> r2 = >>> r2.group(1) 'finin' >>> r2.group(2) 'cs.umbc.edu' >>> r2.groups() r2.groups() ('finin', 'cs.umbc.edu', 'umbc.', 'edu’)  Note that the ‘groups’ are numbered in a preorder traversal of the forest

What got matched?  We can ‘label’ the groups as well… >>> pat3 ="(?P (\w+\.)+(com|org|net| edu))" >>> r3 = >>> r3.group('name') 'finin' >>> r3.group('host') 'cs.umbc.edu’  And reference the matching parts by the labels

More re functions  re.split() is like split but can use patterns >>> re.split("\W+", “This... is a test, short and sweet, of split().”) ['This', 'is', 'a', 'test', 'short’, 'and', 'sweet', 'of', 'split’, ‘’]  re.sub substitutes one string for a pattern >>> re.sub('(blue|white|red)', 'black', 'blue socks and red shoes') 'black socks and black shoes’  re.findall() finds al matches >>> re.findall("\d+”,"12 dogs,11 cats, 1 egg") ['12', '11', ’1’]

Compiling regular expressions  If you plan to use a re pattern more than once, compile it to a re object  Python produces a special data structure that speeds up matching >>> capt3 = re.compile(pat3) >>> cpat3 >>> r3 = >>> r3 >>> r3.group()

Pattern object methods Pattern objects have methods that parallel the re functions (e.g., match, search, split, findall, sub), e.g.: >>> p1 = >>> >>> p1.search(” today.").group(0) >>> p1.findall(" and now.") >>> p2 = re.compile("[.?!]+\s+") >>> p2.split("Tired? Go to bed! Now!! ") ['Tired', 'Go to bed', 'Now', ’ '] address sentence boundary

Example: pig latin  Rules If word starts with consonant(s) — Move them to the end, append “ay” Else word starts with vowel(s) — Keep as is, but add “zay” How might we do this?

The pattern ([bcdfghjklmnpqrstvwxyz]+)(\w+)

piglatin.py import re pat = ‘([bcdfghjklmnpqrstvwxyz]+)(\w+)’ cpat = re.compile(pat) def piglatin(string): return " ".join( [piglatin1(w) for w in string.split()] )

piglatin.py def piglatin1(word): """Returns the pig latin form of a word. e.g.: piglatin1("dog”) => "ogday". """ match = cpat.match(word) if match: consonants = match.group(1) rest = match.group(2) return rest + consonents + “ay” else: return word + "zay"