1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search.

Slides:



Advertisements
Similar presentations
Regular Expressions Pattern and Match objects Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Regular Expressions (in Python). Python or Egrep We will use Python. In some scripting languages you can call the command “grep” or “egrep” egrep pattern.
1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search.
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
Fundamentals of Characters and Strings Characters: fundamental building blocks of Python programs Function ord returns a character’s character code.
1 CSE 390a Lecture 7 Regular expressions, egrep, and sed slides created by Marty Stepp, modified by Jessica Miller
CS 497C – Introduction to UNIX Lecture 31: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
1 A pair of sometimes useful functions Function ord returns a character’s ordinance / character code (Unicode) Function chr returns the character with.
CS 330 Programming Languages 10 / 10 / 2006 Instructor: Michael Eckmann.
8.1 Last time on: Pattern Matching. 8.2 Finding a sub string (match) somewhere: if ($line =~ m/he/)... remember to use slash( / ) and not back-slash Will.
Using regular expressions Search for a single occurrence of a specific string. Search for all occurrences of a string. Approximate string matching.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.
Last Updated March 2006 Slide 1 Regular Expressions.
Regular Expressions Dr. Ralph D. Westfall May, 2011.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
System Programming Regular Expressions Regular Expressions
Regular Expressions in.NET Ashraya R. Mathur CS NET Security.
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Introduction to Computing Using Python Regular expressions Suppose we need to find all addresses in a web page How do we recognize addresses?
Chapter 11: Regular Expressions and Matching The match operator has the following form. m/pattern/ A pattern can be an ordinary string or a generalized.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
By Michael Wolfe. Grouping Things and Hierarchical Matching  In a regexp ab|ac is nice, but it’s not very efficient because it uses “a” twice  Perl.
BTANT129 w61 Regular expressions step by step Tamás Váradi
Python Regular Expressions Easy text processing. Regular Expression  A way of identifying certain String patterns  Formally, a RE is:  a letter or.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have.
 2002 Prentice Hall. All rights reserved. 1 Chapter 13 – String Manipulation and Regular Expressions Outline 13.1 Introduction 13.2 Fundamentals of Characters.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Module 6 – Generics Module 7 – Regular Expressions.
Python for NLP Regular Expressions CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Regular Expressions in Perl CS/BIO 271 – Introduction to Bioinformatics.
©Brooks/Cole, 2001 Chapter 9 Regular Expressions ( 정규수식 )
©Brooks/Cole, 2001 Chapter 9 Regular Expressions.
CS346 Regular Expressions1 Pattern Matching Regular Expression.
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
CSC 2720 Building Web Applications PHP PERL-Compatible Regular Expressions.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
Regular Expressions CS 2204 Class meeting 6 Created by Doug Bowman, 2001 Modified by Mir Farooq Ali, 2002.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
CGS – 4854 Summer 2012 Web Site Construction and Management Instructor: Francisco R. Ortega Chapter 5 Regular Expressions.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. ADVANCED.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
Regular Expressions In Javascript cosc What Do They Do? Does pattern matching on text We use the term “string” to indicate the text that the regular.
Regular Expressions Upsorn Praphamontripong CS 1110
CSC 594 Topics in AI – Natural Language Processing
Regular Expressions in Pearl - Part II
Regular Expressions and perl
CSC 594 Topics in AI – Natural Language Processing
CSE 390a Lecture 7 Regular expressions, egrep, and sed
CS 1111 Introduction to Programming Fall 2018
Regular Expressions and Grep
CSCI The UNIX System Regular Expressions
Regular expressions, egrep, and sed
Regular expressions, egrep, and sed
CSE 390a Lecture 7 Regular expressions, egrep, and sed
REGEX.
Presentation transcript:

1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search for this SRE_Pattern in text 3.Result is an SRE_Match object or precompile the expression: compiledRE = re.compile( regExp) 1.Now compiledRE is an SRE_Pattern object compiledRE.search( text ) 2.Use search method in this SRE_Pattern to search text 3.Result is same SRE_Match object

2 ^ : indicates placement at the beginning of the string $ : indicates placement at the end of the string # search for zero or one t, followed by two a’s # at the beginning of the string: regExp1 = “^t?aa“ # search for g followed by one or more c’s followed by a # at the end of the string: regExp1 = “gc+a$“ # whole string should match ct followed by zero or more # g’s followed by a: regExp1 = “^ctg*a$“ A few more metacharacters

3 Text1 contains the regular expression ^t?aa Text1 contains the regular expression gc+a$ Text2 contains the regular expression ^ctg*a$ This time we use re.search() to search the text for the regular expressions directly without compiling them in advance

4 {} : indicate repetition | : match either regular expression to the left or to the right () : indicate a group (a part of a regular expression) # search for four t’s followed by three c’s: regExp1 = “t{4}c{3}“ # search for g followed by 1, 2 or 3 c’s: regExp1 = “gc{1,3}$“ # search for either gg or cc: regExp1 = “gg|cc“ # search for either gg or cc followed by tt: regExp1 = “(gg|cc)tt“ Yet more metacharacters..

5 Microsatellites: follow-up on exercise Microsatellites are small consecutive DNA repeats which are found throughout the genome of organisms ranging from yeasts through to mammals. AAAAAAAAAAA would be referred to as (A) 11 GTGTGTGTGTGT would be referred to as (GT) 6 CTGCTGCTGCTG would be referred to as (CTG) 4 ACTCACTCACTCACTC would be referred to as (ACTC) 4 Microsatellites have high mutation rates and therefore may show high variation between individuals within a species. Source:

6 Looking for microsatellites Sequence contains the pattern AA+ Sequence does not contain the pattern GT(GT)+ Sequence contains the pattern CTG(CTG)+ Sequence does not contain the pattern ACTC(ACTC)+ microsatellites.py

7 \ : used to escape a metacharacter (“to take it literally”) # search for x followed by + followed by y: regExp1 = “x\+y“ # search for ( followed by x followed by y: regExp1 = “\(xy“ # search for x followed by ? followed by y: regExp1 = “x\?y“ # search for x followed by at least one ^ followed by 3: regExp1 = “x\^+3“ Escaping metacharacters

8 Character Classes A character class matches one of the characters in the class: [abc] matches either a or b or c. d[abc]d matches dad and dbd and dcd [ab]+c matches e.g. ac, abc, bac, bbabaabc,.. Metacharacter ^ at beginning negates character class: [^abc] matches any character other than a, b and c A class can use – to indicate a range of characters: [a-e] is the same as [abcde] Characters except ^ and – are taken literally in a class: [a+b*] matches a or + or b or *

9 Special Sequences Special sequence: shortcut for a common character class regExp1 = “\d\d:\d\d:\d\d [AP]M” # (possibly illegal) time stamps 04:23:19 PM regExp2 = # any Danish address

Regular expression functions sub, split, match

*a*b*c*d*e*f *a*b*c4d5e6f ['', 'a', 'b', 'c', 'd', 'e', 'f'] ['1', '2', '3', '4', '5', '6', ''] method search found \db method match found \da

12 Groups text = "But here: what a address“ regExp = # match Danish address compiledRE = re.compile( regExp) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish address:", SRE_Match.group() We can extract the actual substring that matched the regular expression by calling method group() in the SRE_Match object: Text contains this Danish address:

13 The substring that matches the whole RE is called a group The RE can be subdivided into smaller groups (parts) Each group of the matching substring can be extracted Metacharacters ( and ) denote a group text = "But here: what a address“ # Match any Danish address; define two groups: username and domain: regExp = compiledRE = re.compile( regExp ) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish address:", SRE_Match.group() print “Username:”, SRE_Match.group(1), “\nDomain:”, SRE_Match.group(2) Text contains this Danish address: Username: chili Domain: daimi.au.dk

14 Greedy vs. non-greedy operators + and * are greedy operators – They attempt to match as many characters as possible +? and *? are non-greedy operators – They attempt to match as few characters as possible

15 # Task: Find a space-separated list of digits, report the first digit. import re text = " blah blah" # use greedy operator + regExp = "(\d )+" print "Greedy operator:", re.match( regExp, text ).group() # use non-greedy version instead (by putting a ? after the +) regExp = "(\d )+?" print "Non-greedy operator:", re.match( regExp, text ).group() Greedy operator: Non-greedy operator: 1