1 Python & Pattern Matching with Regular Expressions (REs) OPIM 101 File:PythonREs.ppt.

Slides:



Advertisements
Similar presentations
2-1. Today’s Lecture Review Chapter 4 Go over exercises.
Advertisements

Regular Expressions (in Python). Python or Egrep We will use Python. In some scripting languages you can call the command “grep” or “egrep” egrep pattern.
Lecture 5  Regular Expressions;  grep; CSE4251 The Unix Programming Environment.
CSCI 330 T HE UNIX S YSTEM Regular Expressions. R EGULAR E XPRESSION A pattern of special characters used to match strings in a search Typically made.
7 Searching and Regular Expressions (Regex) Mauro Jaskelioff.
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
CS 497C – Introduction to UNIX Lecture 29: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
Regular Expressions Lecture 3. Regular Expressions Motivation: To search for strings using partially specified patterns. Examples: To validate data fields.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Lane Medical Library & Knowledge Management Center Essential UNIX Skills for Biologists Yannick Pouliot, PhD Bioresearch Informationist.
Scripting Languages Chapter 8 More About Regular Expressions.
Filters using Regular Expressions grep: Searching a Pattern.
Shell Script Examples.
CS190/295 Programming in Python for Life Sciences: Lecture 1 Instructor: Xiaohui Xie University of California, Irvine.
Working with Files CSC 161: The Art of Programming Prof. Henry Kautz 11/9/2009.
Regex Wildcards on steroids. Regular Expressions You’ve likely used the wildcard in windows search or coding (*), regular expressions take this to the.
Regular Expressions Week 07 TCNJ Web 2 Jean Chu. Regular Expressions Regular Expressions are a powerful way to validate and format text strings that may.
Overview of the grep Command Alex Dukhovny CS 265 Spring 2011.
System Programming Regular Expressions Regular Expressions
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7.
1 Regular Expressions CIS*2450 Advanced Programming Techniques Material for this lectures has been taken from the excellent book, Mastering Regular Expressions,
Strings The Basics. Strings can refer to a string variable as one variable or as many different components (characters) string values are delimited by.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
UNIX Shell Script (1) Dr. Tran, Van Hoai Faculty of Computer Science and Engineering HCMC Uni. of Technology
Introduction to Unix – CS 21 Lecture 6. Lecture Overview Homework questions More on wildcards Regular expressions Using grep Quiz #1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
BTANT129 w61 Regular expressions step by step Tamás Váradi
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
CSC 352– Unix Programming, Spring 2015 April 28 A few final commands.
Post-Module JavaScript BTM 395: Internet Programming.
I/O Redirection and Regular Expressions February 9 th, 2004 Class Meeting 4.
Introduction to Unix – CS 21 Lecture 12. Lecture Overview A few more bash programming tricks The here document Trapping signals in bash cut and tr sed.
Regular Expression - Intro Patterns that define a set of strings (or, pieces of a string) Not wildcards (similar notion, but different thing) Used by utilities.
 2002 Prentice Hall. All rights reserved. 1 Chapter 13 – String Manipulation and Regular Expressions Outline 13.1 Introduction 13.2 Fundamentals of Characters.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
Sys Prog & Scrip - Heriot Watt Univ 1 Systems Programming & Scripting Lecture 12: Introduction to Scripting & Regular Expressions.
20-753: Fundamentals of Web Programming 1 Lecture 10: Server-Side Scripting II Fundamentals of Web Programming Lecture 10: Server-Side Scripting II.
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
R EGULAR E XPRESSION IN P ERL (P ART 1) Thach Nguyen.
2004/12/051/27 SPARCS 04 Seminar Regular Expression By 박강현 (lightspd)
Unix Programming Environment Part 3-4 Regular Expression and Pattern Matching Prepared by Xu Zhenya( Draft – Xu Zhenya(
Regular Expressions CS 2204 Class meeting 6 Created by Doug Bowman, 2001 Modified by Mir Farooq Ali, 2002.
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
CSE 374 Programming Concepts & Tools Hal Perkins Fall 2015 Lecture 5 – Regular Expressions, grep, Other Utilities.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. ADVANCED.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
SlideSet #19: Regular expressions SY306 Web and Databases for Cyber Operations.
CSC 352– Unix Programming, Fall 2011 November 8, 2011, Week 11, a useful subset of regular expressions, grep and sed, parts of Chapter 11.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
CS 403: Programming Languages Lecture 20 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
Unix tools Regular expressions grep sed AWK. Regular expressions Sequence of characters that define a search pattern banana matches the text banana
Setting Up and Managing a Bioinformatics Project Andrii Rozumnyi and Gagandeep Singh Spring 2016.
Unix RE’s Text Processing Lexical Analysis.   RE’s appear in many systems, often private software that needs a simple language to describe sequences.
Regular Expressions Copyright Doug Maxwell (
RE Tutorial.
Regular Expressions Upsorn Praphamontripong CS 1110
Looking for Patterns - Finding them with Regular Expressions
CSC 594 Topics in AI – Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
CS190/295 Programming in Python for Life Sciences: Lecture 1
Programming Fundamentals Lecture #3 Overview of Computer Programming
CSE 303 Concepts and Tools for Software Development
CSCI The UNIX System Regular Expressions
1.5 Regular Expressions (REs)
Input and Output Python3 Beginner #3.
LING 388: Computers and Language
Presentation transcript:

1 Python & Pattern Matching with Regular Expressions (REs) OPIM 101 File:PythonREs.ppt

2 Foresight Pattern matching –Literal –With metacharacters Regular expressions (REs) Using REs in Python

3 Consider: dir by Itself D:\athomepc\day\idt>dir Volume in drive D has no label Volume Serial Number is 3E4B-1609 Directory of D:\athomepc\day\idt :16a :16a.. SPRING~1 PDF 180, :17a spring02idtfront.pdf SPRING~2 PDF 241, :19a spring02idtpartI.pdf SPRING~3 PDF 1,246, :20a spring02idtpartII.pdf SPRING~4 PDF 2,517, :22a spring02idtpartIII.pdf SPRING~5 PDF 3,469, :24a spring02idtpartIV.pdf CASE1-~1 DOC 35, :42a case1-python.doc LECTUR~1 PPT 78, :45a lecture01fall01.ppt PYTHON~1 PPT 34, :46a Python_Intro.ppt PYTHON~2 PPT 37, :46a Python_Structures.ppt LECTUR~2 PPT 154, :51a lecture01spring02.ppt PYTHON~3 PPT 34, :52a PythonREs.ppt 11 file(s) 8,029,393 bytes 2 dir(s) 1, MB free D:\athomepc\day\idt>

4 Now: dir with a Literal Search D:\athomepc\day\idt>dir case1-python.doc Volume in drive D has no label Volume Serial Number is 3E4B-1609 Directory of D:\athomepc\day\idt CASE1-~1 DOC 35, :42a case1-python.doc 1 file(s) 35,328 bytes 0 dir(s) 1, MB free D:\athomepc\day\idt>

5 Now: dir with “ * ” D:\athomepc\day\idt>dir *.doc Volume in drive D has no label Volume Serial Number is 3E4B-1609 Directory of D:\athomepc\day\idt CASE1-~1 DOC 35, :42a case1-python.doc 1 file(s) 35,328 bytes 0 dir(s) 1, MB free D:\athomepc\day\idt>

6 Literal vs. Pattern Searches dir myfile.doc –Searches literally, for an exact match with “myfile.doc” dir my*.doc –Does a pattern search. Matches to any file beginning with “ my ”, followed by 0 or more characters of any kind, followed by “.doc ”

7 MetaCharacters dir treats “ * ” as a metacharacter, a character not taken literally, but as instruction to match a certain kind of pattern (here: anything) The dir metacharacter scheme is very useful

8 On Beyond *...and also very primitive and limited A step up: grep in Unix & Linux; support for RE searches in some text editors, e.g., TextPad ( Regular expressions (REs) use a richer language and larger set of metacharacters, giving us a very powerful capability to extract information (patterns) from text

9 Python’s RE Metacharacters Here’s the complete list:. ^ $ * + ? { } [ ] \ | ( ) No use memorizing. We’ll learn by examples. A natural question: But what if I want to search for a pattern that contains what Python’s RE counts as metacharacters? –Be just a little patient

10 Load Python’s re Module >>> import re >>> teststring = "Television is public anomie number 1.” >>> teststring 'Television is public anomie number 1.’ >>> len(teststring) 37 >>> match = re.search('anomie',teststring) >>> match == None 0 >>> match.span() (21, 27) >>> teststring[21:27] 'anomie’ >>>

11 Now a Nonliteral Match >>> match = re.search('Television',teststring) >>> match == None 0 >>> match = re.search('television',teststring) >>> match == None 1 >>> match = re.search('[tT]elevision',teststring) >>> match.span() (0, 10) >>> teststring 'Television is public anomie number 1.’ >>>

12 Square Bracket Notation: [...] “ [tT] ” means “any one of the characters ‘ t ’ or ‘ T ’.” [...] is called a character class Examples: –[abc], [a-z], [A-Z] –[^t^T] not t and not T

13 Not Example ^ >>> teststring 'Television is public anomie number 1.’ >>> match = re.search('[^t^T][a-z]+',teststring) >>> match.span() (1, 10) >>> teststring[1:10] 'elevision’ >>> Note: + means “one or more of the previous” * means “zero or more” ? means “zero or one”

14 '\s\w+\.' and '\s(\w+)\.' >>> teststring 'Television is public anomie number 1.’ >>> match = re.search('\s\w+\.',teststring) >>> match.span() (34, 37) >>> teststring[34:37] ' 1.’ >>> match = re.search('\s(\w+)\.',teststring) >>> match.span(0) (34, 37) >>> match.span(1) (35, 36) >>> teststring[35:36] '1’ >>>

15 [.] == \. Inside [...] most metacharacters are taken literally –So, [.] == \. Note (again): [...] is called a character class >>> match = re.search('\s(\w+)[.]',teststring) >>> match.span() (34, 37) >>>

16 Avoiding Greed ? >>> newstring = ' ’ >>> newstring = newstring+' ’ >>> newstring = newstring+'(As of 10:55 AM on 12/20/01)’ >>> newstring = newstring+' ’ >>> newstring ' (As of 10:55 AM on 12/20/01) ’ >>> match = re.search(' ',newstring) >>> match.span() (0, 81) >>> match = re.search(' ',newstring) >>> match.group() ’ >>>

17 More on Not Being Greedy >>> match = re.search(r' (.+)</(\1)',newstring) >>> match.groups() ('d', ' (As of 10:55 AM on 12/20/01) ', 'd') >>> match = re.search(r' ([^<]+)</(\1)',newstring) >>> match.groups() ('i', '(As of 10:55 AM on 12/20/01)', 'i') >>> \1 is called a backreference. It refers to group 1

18 Concluding REs are a very powerful tool, very often very useful The language notation is compact and a bit hard to read Practice, study the examples, don’t worry about memorization.

19 Advice on Scripting Scripting, and programming in general, is a process Successful scripts don’t spring into existence whole –Scripts built in small increments Attend to: –Decomposition –Stories –Testing

20 Advice on Scripting Decomposition –Solve big problems by decomposing them into small problems and solving them Stories –Scripting/programming as a form of literature –Use comments with code to tell a clear story about what the code is or should be doing Testing –Everything, whole and part, often, varying inputs

21 Readings IDT book, chapter 8, “Text and Pattern Processing” Further information (but beyond the scope of 101) –The Python online documentation on the re module –“Regular Expression HOWTO” by A.M. Kuchling at and also at howto.sourceforge.net/regex/regex.htmlhttp://py-howto.sourceforge.net/ howto.sourceforge.net/regex/regex.html