Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search.

Similar presentations


Presentation on theme: "1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search."— Presentation transcript:

1 1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search for this SRE_Pattern in text 3.Result is an SRE_Match object or precompile the expression: compiledRE = re.compile( regExp) 1.Now compiledRE is an SRE_Pattern object compiledRE.search( text ) 2.Use search method in this SRE_Pattern to search text 3.Result is same SRE_Match object

2 2 ^ : indicates placement at the beginning of the string $ : indicates placement at the end of the string # search for zero or one t, followed by two a’s # at the beginning of the string: regExp1 = “^t?aa“ # search for g followed by one or more c’s followed by a # at the end of the string: regExp1 = “gc+a$“ # whole string should match ct followed by zero or more # g’s followed by a: regExp1 = “^ctg*a$“ A few more metacharacters

3 3 Text1 contains the regular expression ^t?aa Text1 contains the regular expression gc+a$ Text2 contains the regular expression ^ctg*a$ This time we use re.search() to search the text for the regular expressions directly without compiling them in advance

4 4 {} : indicate repetition | : match either regular expression to the left or to the right () : indicate a group (a part of a regular expression) # search for four t’s followed by three c’s: regExp1 = “t{4}c{3}“ # search for g followed by 1, 2 or 3 c’s: regExp1 = “gc{1,3}$“ # search for either gg or cc: regExp1 = “gg|cc“ # search for either gg or cc followed by tt: regExp1 = “(gg|cc)tt“ Yet more metacharacters..

5 5 Microsatellites: follow-up on exercise Microsatellites are small consecutive DNA repeats which are found throughout the genome of organisms ranging from yeasts through to mammals. AAAAAAAAAAA would be referred to as (A) 11 GTGTGTGTGTGT would be referred to as (GT) 6 CTGCTGCTGCTG would be referred to as (CTG) 4 ACTCACTCACTCACTC would be referred to as (ACTC) 4 Microsatellites have high mutation rates and therefore may show high variation between individuals within a species. Source: http://www.amonline.net.au/evolutionary_biology/tour/microsatellites.htm

6 6 Looking for microsatellites Sequence contains the pattern AA+ Sequence does not contain the pattern GT(GT)+ Sequence contains the pattern CTG(CTG)+ Sequence does not contain the pattern ACTC(ACTC)+ microsatellites.py

7 7 \ : used to escape a metacharacter (“to take it literally”) # search for x followed by + followed by y: regExp1 = “x\+y“ # search for ( followed by x followed by y: regExp1 = “\(xy“ # search for x followed by ? followed by y: regExp1 = “x\?y“ # search for x followed by at least one ^ followed by 3: regExp1 = “x\^+3“ Escaping metacharacters

8 8 Character Classes A character class matches one of the characters in the class: [abc] matches either a or b or c. d[abc]d matches dad and dbd and dcd [ab]+c matches e.g. ac, abc, bac, bbabaabc,.. Metacharacter ^ at beginning negates character class: [^abc] matches any character other than a, b and c A class can use – to indicate a range of characters: [a-e] is the same as [abcde] Characters except ^ and – are taken literally in a class: [a+b*] matches a or + or b or *

9 9 Special Sequences Special sequence: shortcut for a common character class regExp1 = “\d\d:\d\d:\d\d [AP]M” # (possibly illegal) time stamps 04:23:19 PM regExp2 = "\w+@[\w.]+\.dk“ # any Danish email address

10 Regular expression functions sub, split, match regExpfunctions.py

11 *a*b*c*d*e*f *a*b*c4d5e6f ['', 'a', 'b', 'c', 'd', 'e', 'f'] ['1', '2', '3', '4', '5', '6', ''] method search found \db method match found \da

12 12 Recall the trypsin exercise: If you put ()’s around delimiter pattern, delimiters are returned also regExpsplit.py [‘DCQ’, ‘R’, ‘VYAPFM’, ‘K’, ‘LIHDQWGWDYNNWTSM’, ‘K’, ‘GDA’, ‘R’, ‘EILIMPFCQWTSPF’, ‘R’, ‘NMGCHV’]

13 13 The group method We can extract the actual substring that matched the regular expression by calling method group() in the SRE_Match object: text = "But here: chili@daimi.au.dk what a *(.@#$ silly @#*.( email address“ regExp = "\w+@[\w.]+\.dk“ # match Danish email address compiledRE = re.compile( regExp) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish email address:", SRE_Match.group() Text contains this Danish email address: chili@daimi.au.dk

14 14 The substring that matches the whole RE is called a group The RE can be subdivided into smaller groups (parts) Each group of the matching substring can be extracted Metacharacters ( and ) denote a group text = "But here: chili@daimi.au.dk what a *(.@#$ silly @#*.( email address“ # Match any Danish email address; define two groups: username and domain: regExp = “(\w+)@([\w.]+\.dk)“ compiledRE = re.compile( regExp ) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish email address:", SRE_Match.group() print “Username:”, SRE_Match.group(1), “\nDomain:”, SRE_Match.group(2) danish_emailaddress_groups.py Text contains this Danish email address: chili@daimi.au.dk Username: chili Domain: daimi.au.dk

15 15 Greedy vs. non-greedy operators + and * are greedy operators – They attempt to match as many characters as possible +? and *? are non-greedy operators – They attempt to match as few characters as possible

16 16 nongreedy.py ATGCGACTGACTCGTAGCGATGCTATGCGATCGATGTAG ATGCGACTGACTCGTAG

17 17.. on to the exercises


Download ppt "1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search."

Similar presentations


Ads by Google