1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search.

Slides:



Advertisements
Similar presentations
Regular Expressions Pattern and Match objects Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Python regular expressions. “Some people, when confronted with a problem, think ‘I know, I'll use regular expressions.’ Now they have two problems.”
Regular Expressions (in Python). Python or Egrep We will use Python. In some scripting languages you can call the command “grep” or “egrep” egrep pattern.
1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search.
7 Searching and Regular Expressions (Regex) Mauro Jaskelioff.
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 3: 8/28.
1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search.
Fundamentals of Characters and Strings Characters: fundamental building blocks of Python programs Function ord returns a character’s character code.
CS 497C – Introduction to UNIX Lecture 31: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
1 A pair of sometimes useful functions Function ord returns a character’s ordinance / character code (Unicode) Function chr returns the character with.
LING 388: Language and Computers Sandiway Fong Lecture 3: 8/28.
8.1 Last time on: Pattern Matching. 8.2 Finding a sub string (match) somewhere: if ($line =~ m/he/)... remember to use slash( / ) and not back-slash Will.
Using regular expressions Search for a single occurrence of a specific string. Search for all occurrences of a string. Approximate string matching.
Introduction to regular expression. Wéber André Objective of the training Scope of the course  We will present what are “regular expressions”
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.
Last Updated March 2006 Slide 1 Regular Expressions.
Tutorial 14 Working with Forms and Regular Expressions.
Regular Expressions Dr. Ralph D. Westfall May, 2011.
Overview of the grep Command Alex Dukhovny CS 265 Spring 2011.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
System Programming Regular Expressions Regular Expressions
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Introduction to Computing Using Python Regular expressions Suppose we need to find all addresses in a web page How do we recognize addresses?
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Python Regular Expressions Easy text processing. Regular Expression  A way of identifying certain String patterns  Formally, a RE is:  a letter or.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
I/O Redirection and Regular Expressions February 9 th, 2004 Class Meeting 4.
Regular Expressions in PHP. Supported RE’s The most important set of regex functions start with preg. These functions are a PHP wrapper around the PCRE.
Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have.
 2002 Prentice Hall. All rights reserved. 1 Chapter 13 – String Manipulation and Regular Expressions Outline 13.1 Introduction 13.2 Fundamentals of Characters.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
Regular Expression What is Regex? Meta characters Pattern matching Functions in re module Usage of regex object String substitution.
Module 6 – Generics Module 7 – Regular Expressions.
Python for NLP Regular Expressions CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Regular Expressions in Perl CS/BIO 271 – Introduction to Bioinformatics.
©Brooks/Cole, 2001 Chapter 9 Regular Expressions ( 정규수식 )
©Brooks/Cole, 2001 Chapter 9 Regular Expressions.
CS346 Regular Expressions1 Pattern Matching Regular Expression.
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
20-753: Fundamentals of Web Programming 1 Lecture 10: Server-Side Scripting II Fundamentals of Web Programming Lecture 10: Server-Side Scripting II.
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
CSC 2720 Building Web Applications PHP PERL-Compatible Regular Expressions.
Regular Expressions CS 2204 Class meeting 6 Created by Doug Bowman, 2001 Modified by Mir Farooq Ali, 2002.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. ADVANCED.
Python – May 16 Recap lab Simple string tokenizing Random numbers Tomorrow: –multidimensional array (list of list) –Exceptions.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
Regular Expressions In Javascript cosc What Do They Do? Does pattern matching on text We use the term “string” to indicate the text that the regular.
RE Tutorial.
CSE 374 Programming Concepts & Tools
Regular Expressions Upsorn Praphamontripong CS 1110
CS 330 Class 7 Comments on Exam Programming plan for today:
Regular Expressions ICCM 2017
CSC1018F: Regular Expressions
Looking for Patterns - Finding them with Regular Expressions
CSC 594 Topics in AI – Natural Language Processing
CSC 458– Predictive Analytics I, Fall 2017, Intro. To Python
CSC 594 Topics in AI – Natural Language Processing
CSC 458– Predictive Analytics I, Fall 2018, Intro. To Python
CS 1111 Introduction to Programming Fall 2018
Regular Expressions and Grep
CSCI The UNIX System Regular Expressions
Presentation transcript:

1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search for this SRE_Pattern in text 3.Result is an SRE_Match object or precompile the expression: compiledRE = re.compile( regExp) 1.Now compiledRE is an SRE_Pattern object compiledRE.search( text ) 2.Use search method in this SRE_Pattern to search text 3.Result is same SRE_Match object

2 ^ : (caret) indicates the beginning of the string $ : indicates the end of the string # search for zero or one t, followed by two a’s # at the beginning of the string: regExp1 = “^t?aa“ # search for g followed by one or more c’s followed by a # at the end of the string: regExp1 = “gc+a$“ # whole string should match ct followed by zero or more # g’s followed by a: regExp1 = “^ctg*a$“ A few more metacharacters

3 Text1 contains the regular expression ^t?aa Text1 contains the regular expression gc+a$ Text2 contains the regular expression ^ctg*a$ This time we use re.search() to search the text for the regular expressions directly without compiling them in advance

4 {} : indicate repetition | : match either regular expression to the left or to the right () : indicate a group (a part of a regular expression) # search for four t’s followed by three c’s: regExp1 = “t{4}c{3}“ # search for g followed by 1, 2 or 3 c’s: regExp1 = “gc{1,3}$“ # search for either gg or cc: regExp1 = “gg|cc“ # search for either gg or cc followed by tt: regExp1 = “(gg|cc)tt“ Yet more metacharacters..

5 \ : used to escape a metacharacter (“to take it literally”) # search for x followed by + followed by y: regExp1 = “x\+y“ # search for ( followed by x followed by y: regExp1 = “\(xy“ # search for x followed by ? followed by y: regExp1 = “x\?y“ # search for x followed by at least one ^ followed by 3: regExp1 = “x\^+3“ Escaping metacharacters

6 Microsatellites: follow-up on exercise Microsatellites are small consecutive DNA repeats which are found throughout almost all genomes AAAAAAAAAAA would be referred to as (A) 11 GTGTGTGTGTGT would be referred to as (GT) 6 CTGCTGCTGCTG would be referred to as (CTG) 4 ACTCACTCACTCACTC would be referred to as (ACTC) 4 Microsatellites have high mutation rates and therefore may show high variation between individuals within a species.

7 Looking for microsatellites Sequence contains the pattern AA+ Sequence does not contain the pattern GT(GT)+ Sequence contains the pattern CTG(CTG)+ Sequence does not contain the pattern ACTC(ACTC)+ microsatellites.py

8 Character Classes A character class matches one of the characters in the class: [abc] matches either a or b or c. d[abc]d matches dad and dbd and dcd [ab]+c matches e.g. ac, abc, bac, bbabaabc,.. Metacharacter ^ at beginning negates character class: [^abc] matches any character other than a, b and c A class can use – to indicate a range of characters: [a-e] is the same as [abcde] Characters except ^ and – are taken literally in a class: [a+b*] matches a or + or b or *

9 Common character classes regExp1 = “\d\d:\d\d:\d\d [AP]M” # (possibly illegal) time stamps 04:23:19 PM regExp2 = # any Danish address Backslash necessary Inside character class: backslash not necessary

Regular expression functions sub, split regExpfunctions.py *a*b*c*d*e*f *a*b*c4d5e6f ['', 'a', 'b', 'c', 'd', 'e', 'f'] ['1', '2', '3', '4', '5', '6', '']

11 Recall the trypsin exercise: If you put ()’s around delimiter pattern, delimiters are returned also regExpsplit.py [‘DCQ’, ‘R’, ‘VYAPFM’, ‘K’, ‘LIHDQWGWDYNNWTSM’, ‘K’, ‘GDA’, ‘R’, ‘EILIMPFCQWTSPF’, ‘R’, ‘NMGCHV’]

12 The group method We can extract the actual substring that matched the regular expression by calling method group() in the SRE_Match object: text = "But here: what a address“ regExp = # match Danish address compiledRE = re.compile( regExp) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish address:", SRE_Match.group() Text contains this Danish address:

13 The RE can be subdivided into smaller groups (parts) Each group of the matching substring can be extracted Metacharacters ( and ) denote a group text = "But here: what a address“ # Match any Danish address; define two groups: username and domain: regExp = compiledRE = re.compile( regExp ) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish address:", SRE_Match.group() print “Username:”, SRE_Match.group(1), “\nDomain:”, SRE_Match.group(2) danish_ address_groups.py Text contains this Danish address: Username: chili Domain: daimi.au.dk

14 Greedy vs. non-greedy operators + and * are greedy operators – They attempt to match as many characters as possible +? and *? are non-greedy operators – They attempt to match as few characters as possible

15 nongreedy.py ATGCGACTGACTCGTAGCGATGCTATGCGATCGATGTAG ATGCGACTGACTCGTAG

16 Non-greedy NB: won’t skip a match to get to a shorter match: Search string read left to right, first match is reported >>> import re >>> s = “aa---aa----bb” >>> re.search(“aa.*?bb”, s).group() ‘aa---aa----bb’

17 Extract today’s exercises from course calendar

18 Extract today’s exercises from course calendar Mon 30/10 Practical matters. Introduction to Unix and Python and the dynamic interpreter. Read: Emacs tricks, Course notes chapters 1 and 2. File organization, Pattern counting using Unix commands, Python mode, Math functions, Math functions in a file Thu 2/11 Nov 3rd > String formatting, string methods, if/else, while loops, for loops, functions,.. Read: Course notes, chapters 3 and 4. Idea: Get some date from user; then.. 1)Extract entry for date 2)Extract all exercises in this entry

19 Extract today’s exercises from course calendar 30/10 Practical matters. Introduction to Unix and Python and the dynamic interpreter. Read: Emacs tricks, Course notes chapters 1 and 2. File organization, Pattern counting using Unix commands, Python mode, Math functions, Math functions in a file Regular expressions: 1)r“\b%s\b.*?/tr>”%date ( DOTALL mode) Word boundary necessary, otherwise “4/1” is matched inside “24/11”. Hence raw string. DOTALL mode necessary to have. match across several lines. > necessary to avoid matching exercise names containing tr (e.g. the trypsin exercise) Use non-greedy version to avoid matching several date entries

20 Extract today’s exercises from course calendar 30/10 Practical matters. Introduction to Unix and Python and the dynamic interpreter. Read: Emacs tricks, Course notes chapters 1 and 2. File organization, Pattern counting using Unix commands, Python mode, Math functions, Math functions in a file Regular expressions: 1)r“\b%s\b.*?/tr>”%date ( DOTALL mode) 2)r”(Exercises/\w+\.html).*?\b([ \w]+)” Use non-greedy version in case several exercises on same line

21 regex_webpage.py Several subgroups in pattern: findall returns list of subgroup tuples

22 Trial runs

23.. on to the exercises