Download presentation
Presentation is loading. Please wait.
1
1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search for this SRE_Pattern in text 3.Result is an SRE_Match object or precompile the expression: compiledRE = re.compile( regExp) 1.Now compiledRE is an SRE_Pattern object compiledRE.search( text ) 2.Use search method in this SRE_Pattern to search text 3.Result is same SRE_Match object
2
2 ^ : (caret) indicates the beginning of the string $ : indicates the end of the string # search for zero or one t, followed by two a’s # at the beginning of the string: regExp1 = “^t?aa“ # search for g followed by one or more c’s followed by a # at the end of the string: regExp1 = “gc+a$“ # whole string should match ct followed by zero or more # g’s followed by a: regExp1 = “^ctg*a$“ A few more metacharacters
3
3 Text1 contains the regular expression ^t?aa Text1 contains the regular expression gc+a$ Text2 contains the regular expression ^ctg*a$ This time we use re.search() to search the text for the regular expressions directly without compiling them in advance
4
4 {} : indicate repetition | : match either regular expression to the left or to the right () : indicate a group (a part of a regular expression) # search for four t’s followed by three c’s: regExp1 = “t{4}c{3}“ # search for g followed by 1, 2 or 3 c’s: regExp1 = “gc{1,3}$“ # search for either gg or cc: regExp1 = “gg|cc“ # search for either gg or cc followed by tt: regExp1 = “(gg|cc)tt“ Yet more metacharacters..
5
5 \ : used to escape a metacharacter (“to take it literally”) # search for x followed by + followed by y: regExp1 = “x\+y“ # search for ( followed by x followed by y: regExp1 = “\(xy“ # search for x followed by ? followed by y: regExp1 = “x\?y“ # search for x followed by at least one ^ followed by 3: regExp1 = “x\^+3“ Escaping metacharacters
6
6 Microsatellites: follow-up on exercise Microsatellites are small consecutive DNA repeats which are found throughout almost all genomes AAAAAAAAAAA would be referred to as (A) 11 GTGTGTGTGTGT would be referred to as (GT) 6 CTGCTGCTGCTG would be referred to as (CTG) 4 ACTCACTCACTCACTC would be referred to as (ACTC) 4 Microsatellites have high mutation rates and therefore may show high variation between individuals within a species.
7
7 Looking for microsatellites Sequence contains the pattern AA+ Sequence does not contain the pattern GT(GT)+ Sequence contains the pattern CTG(CTG)+ Sequence does not contain the pattern ACTC(ACTC)+ microsatellites.py
8
8 Character Classes A character class matches one of the characters in the class: [abc] matches either a or b or c. d[abc]d matches dad and dbd and dcd [ab]+c matches e.g. ac, abc, bac, bbabaabc,.. Metacharacter ^ at beginning negates character class: [^abc] matches any character other than a, b and c A class can use – to indicate a range of characters: [a-e] is the same as [abcde] Characters except ^ and – are taken literally in a class: [a+b*] matches a or + or b or *
9
9 Common character classes regExp1 = “\d\d:\d\d:\d\d [AP]M” # (possibly illegal) time stamps 04:23:19 PM regExp2 = "\w+@[\w.]+\.dk“ # any Danish email address Backslash necessary Inside character class: backslash not necessary
10
Regular expression functions sub, split regExpfunctions.py *a*b*c*d*e*f *a*b*c4d5e6f ['', 'a', 'b', 'c', 'd', 'e', 'f'] ['1', '2', '3', '4', '5', '6', '']
11
11 Recall the trypsin exercise: If you put ()’s around delimiter pattern, delimiters are returned also regExpsplit.py [‘DCQ’, ‘R’, ‘VYAPFM’, ‘K’, ‘LIHDQWGWDYNNWTSM’, ‘K’, ‘GDA’, ‘R’, ‘EILIMPFCQWTSPF’, ‘R’, ‘NMGCHV’]
12
12 The group method We can extract the actual substring that matched the regular expression by calling method group() in the SRE_Match object: text = "But here: chili@daimi.au.dk what a *(.@#$ silly @#*.( email address“ regExp = "\w+@[\w.]+\.dk“ # match Danish email address compiledRE = re.compile( regExp) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish email address:", SRE_Match.group() Text contains this Danish email address: chili@daimi.au.dk
13
13 The RE can be subdivided into smaller groups (parts) Each group of the matching substring can be extracted Metacharacters ( and ) denote a group text = "But here: chili@daimi.au.dk what a *(.@#$ silly @#*.( email address“ # Match any Danish email address; define two groups: username and domain: regExp = “(\w+)@([\w.]+\.dk)“ compiledRE = re.compile( regExp ) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish email address:", SRE_Match.group() print “Username:”, SRE_Match.group(1), “\nDomain:”, SRE_Match.group(2) danish_emailaddress_groups.py Text contains this Danish email address: chili@daimi.au.dk Username: chili Domain: daimi.au.dk
14
14 Greedy vs. non-greedy operators + and * are greedy operators – They attempt to match as many characters as possible +? and *? are non-greedy operators – They attempt to match as few characters as possible
15
15 nongreedy.py ATGCGACTGACTCGTAGCGATGCTATGCGATCGATGTAG ATGCGACTGACTCGTAG
16
16 Non-greedy NB: won’t skip a match to get to a shorter match: Search string read left to right, first match is reported >>> import re >>> s = “aa---aa----bb” >>> re.search(“aa.*?bb”, s).group() ‘aa---aa----bb’
17
17 Extract today’s exercises from course calendar
18
18 Extract today’s exercises from course calendar Mon 30/10 Practical matters. Introduction to Unix and Python and the dynamic interpreter. Read: Emacs tricks, Course notes chapters 1 and 2. File organization, Pattern counting using Unix commands, Python mode, Math functions, Math functions in a file Thu 2/11 Nov 3rd 14-16 --> String formatting, string methods, if/else, while loops, for loops, functions,.. Read: Course notes, chapters 3 and 4. Idea: Get some date from user; then.. 1)Extract entry for date 2)Extract all exercises in this entry
19
19 Extract today’s exercises from course calendar 30/10 Practical matters. Introduction to Unix and Python and the dynamic interpreter. Read: Emacs tricks, Course notes chapters 1 and 2. File organization, Pattern counting using Unix commands, Python mode, Math functions, Math functions in a file Regular expressions: 1)r“\b%s\b.*?/tr>”%date ( DOTALL mode) Word boundary necessary, otherwise “4/1” is matched inside “24/11”. Hence raw string. DOTALL mode necessary to have. match across several lines. > necessary to avoid matching exercise names containing tr (e.g. the trypsin exercise) Use non-greedy version to avoid matching several date entries
20
20 Extract today’s exercises from course calendar 30/10 Practical matters. Introduction to Unix and Python and the dynamic interpreter. Read: Emacs tricks, Course notes chapters 1 and 2. File organization, Pattern counting using Unix commands, Python mode, Math functions, Math functions in a file Regular expressions: 1)r“\b%s\b.*?/tr>”%date ( DOTALL mode) 2)r”(Exercises/\w+\.html).*?\b([ \w]+)” Use non-greedy version in case several exercises on same line
21
21 regex_webpage.py Several subgroups in pattern: findall returns list of subgroup tuples
22
22 Trial runs
23
23.. on to the exercises
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.