Python Pattern Matching and Regular Expressions Peter Wad Sackett
2DTU Systems Biology, Technical University of Denmark Simple matching with string methods Just checking for the presence of a substring in a string, use in mystr = ’I am here’ if ’am’ in mystr:print(’present) if ’are’ not in mystr:print(’absent’) The in operator also works with lists, tubles, sets and dicts. Finding the position of the substring, returns -1 if not present mystr.find(’am’) mystr.find(’am’, startpos, endpos) Method rfind does the same from the other direction index is similar to find, but raises ValueError if not present mystr.index(’are’) Methods startswith and endswith can be considered special cases of find. They give a True/False value. mystr.startswith(’I’) mystr.endswith(’ere’,-3)
3DTU Systems Biology, Technical University of Denmark Simple checks of strings The following methods returns True if the string only contains character of the appropiate type, False otherwise. Needs at least one char to return True. isalpha()alphabetic isdigit()digits isdecimal()float numbers, etc isnumeric()similar, covers special chars like ½ isalnum()all of above islower()contains only lowercase isupper()contains only uppercase isspace()contains only whitespace
4DTU Systems Biology, Technical University of Denmark Replacement and removal Returns a string with all occurrences of substring replaced mystr = ’Fie Fye Foe’ mystr.replace(’F’, ’L’) Result: Lie Lye Loe You can replace something with nothing. Where is that useful? Stripping strings, default whitespace rightStripppedString = mystr.rstrip() leftStripppedString = mystr.lstrip() bothSidesStripppedString = mystr.strip() You can specify which chars should be stripped. All are stripped until one is encountered, which should be be removed mystr.strip(’ieF’) Result: ’ Fye Fo’ Notice the leading space
5DTU Systems Biology, Technical University of Denmark Translation Translation is an efficient method to replace chars with other chars First make a char-to-char translation table translationTable = str.maketrans(’ATCG’,’TAGC’) Then use the table dna = ’ATGATGATCGATCGATCGATGCAT’ complementdna = dna.translate(translationTable) The dna has now been complemented. Chars not mentioned in the translation table will be untouched. This method has a use-case close to our hearts. OldNew AT TA CG GC
6DTU Systems Biology, Technical University of Denmark Regular Expressions - regex Regular expressions are very powerful pattern matching Python unfortunately made them cumbersome Uses the re library Full and complex documentation at The library supports both precompiled (more efficient) regex and simple regex. The general forms are regex = re.compile(pattern) result = regex.method(string) versus result = re.method(pattern, string) You will have a hard time understanding the following without an explanation.
7DTU Systems Biology, Technical University of Denmark Regex Patterns - Classes Any simple chars just matches themselves Built-in character classes \smatches a whitespace \Smatches a non-whitespace \dmatches a digit \Dmatches a non-digit \wmatches a wordchar which is a-zA-Z0-9_ \Wmatches a non-wordchar \nmatches newline.matches anything but newline Make your own classes with [] [aB4-6]matches only one of the chars aB456 [^xY]matches anything but x and Y
8DTU Systems Biology, Technical University of Denmark Regex Patterns - Quantifiers A single simple char just matches itself, but a quantifier can be added to determine how many times. ?Zero or one time +One or more times *Zero or more times The {} can be used to make a specific quantification {4}Four times {,3}At most three times {5,}Minimum five time {3,5}Between three and five times Quantifiers are greedy, can be made non-greedy with extra ? A few examples: A{3,4}C?Match AAAA AAA AAAAC AAAC \s\w{4}\sMatch any four-letter word in a sentence
9DTU Systems Biology, Technical University of Denmark Regex Patterns - Groups The parenthesis denote a group. A group belongs together. Example: ABC(xyz)?DEF matches both ABCDEF and ABCxzyDEF Either the entire group xyz is matched once or not (?) The content of the group can be captured, see later. Non-capturing group (?: ) The pipe sign | means or A(BC|DEF)G matches either ABCG or ADEFG Other special chars ^Must be first, bind a match to the start of line $Must be last, bind a match to the end of line \bWord-boundary, could be whitespace, comma, BoL, EoL
10DTU Systems Biology, Technical University of Denmark General flow of a regex Regular expressions are often used in loops Static regexes in loops benefit from compiling Compile the regex to generate a regex object myregexobj = re.compile(pattern) Use a method on regex object to generate a match object mymatchobject = myregexobject.search(string) These two steps can be combined mymatchobject = re.search(pattern, string) The match object can be investigated for matches mymatchobj.group(0)# Entire match mymatchobj.group(1)# First group mymatchobj.start(1)# Start of first group in string mymatchobj.end(1)# End of first group in string
11DTU Systems Biology, Technical University of Denmark Using regex - example Testing if there is a match mystr = ’In this string is an accession AB somewhere’ accregex = re.compile(r”\b[A-Z]{1,2}\d{6,8}\b”) if accregex.search(mystr) is not None: print(’Yeah, there is an accession number somewhere”) Capturing a match, notice parenthesis mystr = ’In this string is an accession AB somewhere’ accregex = re.compile(r”\b([A-Z]{1,2}\d{6,8})\b”) result = accregex.search(mystr) if result is None: print(’No match”) else: print(”I got one”, result.group(1))
12DTU Systems Biology, Technical University of Denmark Methods of re library Compile a regex, base for the rest of the methods compile(pattern) Find a match anywhere in the string search(string) Find a match only in the beginning of the string match(string) Split string on a pattern split(string) Return all matches as list of strings findall(string) Return string where matches are replaced with replacement string Count = 0 means all occurences sub(replacement, count=0)