Fundamentals of Characters and Strings Characters: fundamental building blocks of Python programs Function ord returns a character’s character code Function chr returns the character with the given character code >>> ord('ff') Traceback (most recent call last): File " ", line 1, in ? TypeError: ord() expected a character, but string of length 2 found >>> ord('f') 102 >>> ord('.') 46 >>> chr(46) '.'
2 Characters and Strings Since characters and strings are fundamental in python, there are a lot of useful methods for dealing with them (fig. 13.2).
Fundamentals of Characters and Strings
4
5
6 fig13_03.py 1 # Fig. 13.3: fig13_03.py 2 # Simple output formatting example. 3 4 string1 = "Now I am here." 5 6 print string1.center( 50 ) 7 print string1.rjust( 50 ) 8 print string1.ljust( 50 ) Now I am here. Centers calling string in a new string of 50 charactersRight-aligns calling string in new string of 50 charactersLeft-aligns calling string in new string of 50 characters Remember: strings are immutable; a string manipulating function returns a new string >>> aString = 'gacataggt' >>> >>> aString.upper() 'GACATAGGT' >>> >>> aString 'gacataggt'
7 fig13_04.py 1 # Fig. 13.4: fig13_04.py 2 # Stripping whitespace from a string. 3 4 string1 = "\t \n This is a test string. \t\t \n" 5 6 print 'Original string: "%s"\n' % string1 7 print 'Using strip: "%s"\n' % string1.strip() 8 print 'Using left strip: "%s"\n' % string1.lstrip() 9 print "Using right strip: \"%s\"\n" % string1.rstrip() Original string: " This is a test string. " Using strip: "This is a test string." Using left strip: "This is a test string. " Using right strip: " This is a test string." Removes leading whitespace from string Removes trailing whitespace from string Removes leading and trailing whitespace from string
Searching Strings Method find, index, rfind and rindex search for substrings in a calling string Methods startswith and endswith return 1 if a calling string begins with or ends with a given string, respectively Method count returns number of occurrences of a substring in a calling string Method replace substitutes its second argument for its first argument in a calling string
9 s = "actgccgacgatcgcgcatcagcg" index_string= " " # length 24 print s print index_string, "\n" print "gc occurs %d times" % s.count( "gc" ) print “(%d times from index 13)\n" % s.count( "gc", 13, len(s) ) # same result as s[13:len(s)].count("gc") print "first occurrence of gc: index %d" % s.find( "gc" ) print "first occurrence of x: index %d\n" % s.find( “x" ) # -1 is a number, program breaks down later if string not found? # index(): as find() but raises exception if string is not found if s.startswith( "AC" ): print "sequence starts with AC" else: print "sequence doesn't start with AC" # case sensitive! print "last occurrence of gc: index %d\n" % s.rfind( "gc" ) print "replacing gc with GC:\n%s\n" %s.replace( "gc", "GC" ) print "replace 2 occurrences max:\n%s" %s.replace( "gc", "GC", 2 ) actgccgacgatcgcgcatcagcg gc occurs 4 times (3 times from index 13) first occurrence of gc: index 3 first occurrence of x: index -1 sequence doesn't start with AC last occurrence of gc: index 21 replacing 'gc' with GC: actGCcgacgatcGCGCatcaGCg replace 2 occurrences max: actGCcgacgatcGCgcatcagcg Searching Strings
Splitting and Joining Strings Tokenization breaks statements into individual components (or tokens) Delimiters, typically whitespace characters, separate tokens
11 fig13_06.py 1 # Fig. 13.6: fig13_06.py 2 # Token splitting and delimiter joining. 3 4 # splitting strings 5 string1 = "A, B, C, D, E, F" 6 7 print "String is:", string1 8 print "Split string by spaces:", string1.split() 9 print "Split string by commas:", string1.split( "," ) 10 print "Split string by commas, max 2:", string1.split( ",", 2 ) 11 print # joining strings 14 list1 = [ "A", "B", "C", "D", "E", "F" ] 15 string2 = "___" print "List is:", list1 18 print 'Joining with ___ : %s' % ( string2.join ( list1 ) ) print 'Joining with -.- :', "-.-".join( list1 ) String is: A, B, C, D, E, F Split string by spaces: ['A,', 'B,', 'C,', 'D,', 'E,', 'F'] Split string by commas: ['A', ' B', ' C', ' D', ' E', ' F'] Split string by commas, max 2: ['A', ' B', ' C, D, E, F'] List is: ['A', 'B', 'C', 'D', 'E', 'F'] Joining with "___": A___B___C___D___E___F Joining with "-.-": A-.-B-.-C-.-D-.-E-.-F Splits calling string by whitespace characters Return list of tokens split by first two comma delimiters Splits calling string by specified character Joins list elements with calling string as a delimiter to create new string Joins list elements with calling quoted string as delimiter to create new string
12 Intermezzo 1 / html 1. Copy and run this program: /users/chili/CSS.E03/ExamplePrograms/random_text.py What does it do? 2. Extend the program: search the text string it produces and print out the index of the first occurrence of 11 (you might look at Figure 13.2 at page 438ff to find a suitable string method). Tell the user if there is no '11'. 3. Split the text into a list of substrings using '11' as a delimiter, print out the list.
13 Solution from random import randrange text = "" for i in range(150): next_char = chr( randrange(48, 58) ) text = "".join( [text, next_char] ) print text i = text.find( "11" ) if i>=0: print "'11' found at index", i splittext = text.split( "11" ) print "text split in %d pieces" %len(splittext) for piece in splittext: print piece '11' found at index 4 text split in 4 pieces
14 Regular Expressions – Motivation import re text1 = "No Danish address here fj3a" text2 = "But here: what a address" regularExpression = compiledRE = re.compile( regularExpression) SRE_Match1 = compiledRE.search( text1) SRE_Match2 = compiledRE.search( text2) if SRE_Match1: print "Text1 contains this Danish address:", SRE_Match1.group() else: print "Text1 contains no Danish address" if SRE_Match2: print "Text2 contains this Danish address:", SRE_Match2.group() else: print "Text2 contains no Danish address" Problem: search a text for any Danish Text1 contains no Danish address Text2 contains this Danish address:
Regular Expressions Provide more efficient and powerful alternative to string search methods Instead of searching for a specific string we can search for a text pattern –Don’t have to search explicitly for ‘Monday’, ‘Tuesday’, ‘Wednesday’.. : there is a pattern in these search strings. –A regular expression is a text pattern In Python, regular expression processing capabilities provided by module re
16 Example Simple regular expression: regExp = “football” - matches only the string “football” To search a text for regExp, we can use re.search( regExp, text )
17 Compiling Regular Expressions re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search for this SRE_Pattern in text 3.Result is an SRE_Match object If we need to search for regExp several times, it is more efficient to compile it once and for all: compiledRE = re.compile( regExp) 1.Now compiledRE is an SRE_Pattern object compiledRE.search( text ) 2.Use search method in this SRE_Pattern to search text 3.Result is same SRE_Match object
18 Searching for ‘football’ import re text1 = "Here are the football results: Bosnia - Denmark 0-7" text2 = "We will now give a complete list of python keywords." regularExpression = "football" compiledRE = re.compile( regularExpression) SRE_Match1 = compiledRE.search( text1 ) SRE_Match2 = compiledRE.search( text2 ) if SRE_Match1: print "Text1 contains the substring ‘football’" if SRE_Match2: print "Text2 contains the substring ‘football’" Text1 contains the substring 'football' Compile regular expression and get the SRE_Pattern object Use the same SRE_Pattern object to search both texts and get two SRE_Match objects (or none if the search was unsuccesful)
19 Building more sophisticated patterns Metacharacters: regular-expression syntax element ? : matches zero or one occurrences of the expression it follows + : matches one or more occurrences of the expression it follows * : matches zero or more occurrences of the expression it follows # search for zero or one t, followed by two a’s: regExp1 = “t?aa“ # search for g followed by one or more c’s followed by a: regExp1 = “gc+a“ #search for ct followed by zero or more g’s followed by a: regExp1 = “ctg*a“
20 Metacharacter example import re text = "gaaagccactgggggggggggggga" regExp1 = "t?aa" compiledRE1 = re.compile( regExp1 ) regExp2 = "gc+a" compiledRE2 = re.compile( regExp2 ) regExp3 = "ctg*a" compiledRE3 = re.compile( regExp3 ) SRE_Match1 = compiledRE1.search( text ) SRE_Match2 = compiledRE2.search( text ) SRE_Match3 = compiledRE3.search( text ) if SRE_Match1: print "Text contains the regular expression", regExp1 if SRE_Match2: print "Text contains the regular expression", regExp2 if SRE_Match3: print "Text contains the regular expression", regExp3 Text contains the regular expression t?aa Text contains the regular expression gc+a Text contains the regular expression ctg*a Compile all three regular expressions into SRE_Pattern objects Use the three SRE_Pattern objects to search the text and get three SRE_Match objects
21 ^ : indicates placement at the beginning of the string $ : indicates placement at the end of the string # search for zero or one t, followed by two a’s # at the beginning of the string: regExp1 = “^t?aa“ # search for g followed by one or more c’s followed by a # at the end of the string: regExp1 = “gc+a$“ # whole string should match ct followed by zero or more # g’s followed by a: regExp1 = “^ctg*a$“ A few more metacharacters
22 Metacharacter example import re text1 = "aactggagcccca" text2 = "ctgga" regExp1 = "^t?aa" regExp2 = "gc+a$" regExp3 = "^ctg*a$" if re.search( regExp1, text1 ): print "Text1 contains the regular expression", regExp1 if re.search( regExp2, text1 ): print "Text1 contains the regular expression", regExp2 if re.search( regExp3, text1 ): print "Text1 contains the regular expression", regExp3 if re.search( regExp3, text2 ): print "Text2 contains the regular expression", regExp3 Text1 contains the regular expression ^t?aa Text1 contains the regular expression gc+a$ Text2 contains the regular expression ^ctg*a$ This time we use re.search() to search the text for the regular expressions directly without compiling them in advance
23 {} : indicate repetition | : match either regular expression to the left or to the right () : indicate a group (a part of a regular expression) # search for four t’s followed by three c’s: regExp1 = “t{4}c{3}“ # search for g followed by 1 to 3 c’s: regExp1 = “gc{1,3}$“ # search for either gg or cc: regExp1 = “gg|cc“ # search for either gg or cc followed by tt: regExp1 = “(gg|cc)tt“ Yet more metacharacters..
24 \ : used to escape (to ‘keep’) a metacharacter # search for x followed by + followed by y: regExp1 = “x\+y“ # search for ( followed by x followed by y: regExp1 = “\(xy“ # search for x followed by ? followed by y: regExp1 = “x\?y“ # search for x followed by at least one ^ followed by 3: regExp1 = “x\^+3“ Escaping metacharacters
25 Intermezzo 2 Intermezzi/ html Copy and run this program: /users/chili/CSS.E03/ExamplePrograms/sequence_searching.py What does it do? Put in more regular expressions in the list to search for these patterns: 1. 6 c's followed by 3 g's 2. cc, followed by at least one g, followed by cc 3. double triplets (e.g. aaa followed by ccc) 4. any number of a's, followed by either cc or gg, followed by c at the end of the string
26 Solution import re # this is a dna sequence in fasta format: seq = """>U03518 Aspergillus awamori\naacctgcggaaggatcattaccgagtgcgggtcctttgggcccaacctcccatccgtgt ctattgtaccctgttgcttcggcgggcccgccgcttgtcggccgccgggggggcgcctctgccccccg ggcccgtgcccgccggagaccccaacacgaacactgtctgaaagcgtgcagtctgagttgattgaatg caatcagttaaaactttcaacaatggatctcttggttccggc""" regular_expressions = [ "a{4}", "c+(t|g)tt", "g*c$", "(gt){2}", "c{6}g{3}", "ccg+cc", "(aaa|ccc|ggg|ttt){2}", "a*(cc|gg)c$" ] for regExp in regular_expressions: if re.search( regExp, seq ): print "found", regExp 1.6 c's followed by 3 g's 2.cc, followed by at least one g, followed by cc 3.double triplets (e.g. aaa followed by ccc) 4.any number of a's, followed by either cc or gg, followed by c at the end of the string
27 Character Classes A character class matches one of the characters in the class: [abc] matches either a or b or c. d[abc]d matches dad and dbd and dcd [ab]+c matches e.g. ac, abc, bac, bbabaabc,.. Metacharacter ^ at beginning negates character class: [^abc] matches any character other than a, b and c A class can use – to indicate a range of characters: [a-e] is the same as [abcde] Characters except ^ and – are taken literally in a class: [a+b*] matches a or + or b or *
28 Special Sequences Special sequence: shortcut for a common character class regExp1 = “\d\d:\d\d:\d\d [AP]M” # (possibly illegal) time stamps like 04:23:19 PM regExp2 = # any Danish address
29 import re text = "1a2b3c4d5e6f" print re.sub( "\d", "*", text ) # substitute * for any digit (i.e. replace digit with *) print print re.sub( "\d", "*", text, 3 ) # substitute * for any digit, max 3 times print print re.split( "\d", text ) # delimiter: any digit print print re.split( "[a-z]", text ) # delimiter: any lower-case letter print if re.search( "\db", text ): # the RE of search() can appear anywhere in text print "method search found \db" if re.match( "\db", text ): # the RE of match() must appear in beginning of text print "method match found \db“ if re.match( "\da", text ): print "method match found \da" Other regular expression functions *a*b*c*d*e*f *a*b*c4d5e6f ['', 'a', 'b', 'c', 'd', 'e', 'f'] ['1', '2', '3', '4', '5', '6', ''] method search found \db method match found \da
30 Groups We can extract the actual substring that matched the regular expression by calling method group() in the SRE_Match object: text = "But here: what a address“ regExp = # match Danish address compiledRE = re.compile( regExp) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish address:", SRE_Match.group()
Grouping The substring that matches the whole RE called a group RE can be subdivided into smaller groups (parts) Each group of the matching substring can be extracted Metacharacters ( and ) denote a group text = "But here: what a address“ # Match any Danish address; define two groups: username and domain: regExp = compiledRE = re.compile( regExp ) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish address:", SRE_Match.group() print “Username:”, SRE_Match.group(1), “\nDomain:”, SRE_Match.group(2) Text2 contains this Danish address: Username: chili Domain: daimi.au.dk
32 Greedy vs. non-greedy operators + and * are greedy operators –They attempt to match as many characters as possible even if this is not the desired behavior +? and *? are non-greedy operators –They attempt to match as few characters as possible
33 Greedy vs. non-greedy operators # Task: Find a space-separated list of digits, extract the first number. import re text = " blah blah" # use greedy operator + regExp = "(\d )+" print "Greedy operator:", re.match( regExp, text ).group() # use non-greedy version instead (by putting a ? after the +) regExp = "(\d )+?" print "Non-greedy operator:", re.match( regExp, text ).group() Greedy operator: Non-greedy operator: 1