1 13.2 Fundamentals of Characters and Strings Characters: fundamental building blocks of Python programs Function ord returns a character’s character code.

Slides:



Advertisements
Similar presentations
Regular Expressions Pattern and Match objects Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

1 A pair of sometimes useful functions Function ord returns a character’s ordinance / character code (Unicode) Function chr returns the character with.
1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search.
CS1100: Computer Science and Its Applications Text Processing Created By Martin Schedlbauer
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search.
ISBN Chapter 6 Data Types Character Strings Pattern Matching.
1 A pair of sometimes useful functions Function ord returns a character’s ordinance / character code (Unicode) Function chr returns the character with.
Using regular expressions Search for a single occurrence of a specific string. Search for all occurrences of a string. Approximate string matching.
1 Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search.
UNIX Filters.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
Regular Expressions in ColdFusion Applications Dave Fauth DOMAIN technologies Knowledge Engineering : Systems Integration : Web.
Lesson 3 – Regular Expressions Sandeepa Harshanganie Kannangara MBCS | B.Sc. (special) in MIT.
Last Updated March 2006 Slide 1 Regular Expressions.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
Regular Expressions in.NET Ashraya R. Mathur CS NET Security.
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 9 More About Strings.
Python for Informatics: Exploring Information
ASP.NET Programming with C# and SQL Server First Edition Chapter 5 Manipulating Strings with C#
Strings The Basics. Strings can refer to a string variable as one variable or as many different components (characters) string values are delimited by.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Fall Week 4 CSCI-141 Scott C. Johnson.  Computers can process text as well as numbers ◦ Example: a news agency might want to find all the articles.
Python Regular Expressions Easy text processing. Regular Expression  A way of identifying certain String patterns  Formally, a RE is:  a letter or.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Regular Expressions.
Built-in Data Structures in Python An Introduction.
Clearly Visual Basic: Programming with Visual Basic 2008 Chapter 24 The String Section.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have.
 2002 Prentice Hall. All rights reserved. 1 Chapter 13 – String Manipulation and Regular Expressions Outline 13.1 Introduction 13.2 Fundamentals of Characters.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
CS346 Regular Expressions1 Pattern Matching Regular Expression.
Introducing Python CS 4320, SPRING Lexical Structure Two aspects of Python syntax may be challenging to Java programmers Indenting ◦Indenting is.
CSC 2720 Building Web Applications PHP PERL-Compatible Regular Expressions.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
JavaScript III ECT 270 Robin Burke. Outline Validation examples password more complex Form validation Regular expressions.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
Processing Text Excel can not only be used to process numbers, but also text. This often involves taking apart (parsing) or putting together text values.
Trinity College Dublin, The University of Dublin GE3M25: Computer Programming for Biologists Python, Class 2 Karsten Hokamp, PhD Genetics TCD, 17/11/2015.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Strings CSE 1310 – Introduction to Computers and Programming Alexandra Stefan University of Texas at Arlington 1.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
Python Pattern Matching and Regular Expressions Peter Wad Sackett.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Regular Expressions Copyright Doug Maxwell (
Regular Expressions Upsorn Praphamontripong CS 1110
CS170 – Week 1 Lecture 3: Foundation Ismail abumuhfouz.
Python regular expressions
Strings Part 1 Taken from notes by Dr. Neil Moore
Winter 2018 CISC101 11/16/2018 CISC101 Reminders
Advanced String handling
Python - Strings.
Data Manipulation & Regex
Methods – on strings and other things
CS 1111 Introduction to Programming Spring 2019
Topic 6 Lesson 1 – Text Processing
Topics Basic String Operations String Slicing
Introduction to Computer Science
Python Strings.
Topics Basic String Operations String Slicing
Python regular expressions
Strings Taken from notes by Dr. Neil Moore & Dr. Debby Keen
Topics Basic String Operations String Slicing
Presentation transcript:

Fundamentals of Characters and Strings Characters: fundamental building blocks of Python programs Function ord returns a character’s character code Function chr returns the character with the given character code >>> ord('ff') Traceback (most recent call last): File " ", line 1, in ? TypeError: ord() expected a character, but string of length 2 found >>> ord('f') 102 >>> ord('.') 46 >>> chr(46) '.'

2 Characters and Strings Since characters and strings are fundamental in python, there are a lot of useful methods for dealing with them (fig. 13.2).

Fundamentals of Characters and Strings

4

5

6 fig13_03.py 1 # Fig. 13.3: fig13_03.py 2 # Simple output formatting example. 3 4 string1 = "Now I am here." 5 6 print string1.center( 50 ) 7 print string1.rjust( 50 ) 8 print string1.ljust( 50 ) Now I am here. Centers calling string in a new string of 50 charactersRight-aligns calling string in new string of 50 charactersLeft-aligns calling string in new string of 50 characters Remember: strings are immutable; a string manipulating function returns a new string >>> aString = 'gacataggt' >>> >>> aString.upper() 'GACATAGGT' >>> >>> aString 'gacataggt'

7 fig13_04.py 1 # Fig. 13.4: fig13_04.py 2 # Stripping whitespace from a string. 3 4 string1 = "\t \n This is a test string. \t\t \n" 5 6 print 'Original string: "%s"\n' % string1 7 print 'Using strip: "%s"\n' % string1.strip() 8 print 'Using left strip: "%s"\n' % string1.lstrip() 9 print "Using right strip: \"%s\"\n" % string1.rstrip() Original string: " This is a test string. " Using strip: "This is a test string." Using left strip: "This is a test string. " Using right strip: " This is a test string." Removes leading whitespace from string Removes trailing whitespace from string Removes leading and trailing whitespace from string

Searching Strings Method find, index, rfind and rindex search for substrings in a calling string Methods startswith and endswith return 1 if a calling string begins with or ends with a given string, respectively Method count returns number of occurrences of a substring in a calling string Method replace substitutes its second argument for its first argument in a calling string

9 s = "actgccgacgatcgcgcatcagcg" index_string= " " # length 24 print s print index_string, "\n" print "gc occurs %d times" % s.count( "gc" ) print “(%d times from index 13)\n" % s.count( "gc", 13, len(s) ) # same result as s[13:len(s)].count("gc") print "first occurrence of gc: index %d" % s.find( "gc" ) print "first occurrence of x: index %d\n" % s.find( “x" ) # -1 is a number, program breaks down later if string not found? # index(): as find() but raises exception if string is not found if s.startswith( "AC" ): print "sequence starts with AC" else: print "sequence doesn't start with AC" # case sensitive! print "last occurrence of gc: index %d\n" % s.rfind( "gc" ) print "replacing gc with GC:\n%s\n" %s.replace( "gc", "GC" ) print "replace 2 occurrences max:\n%s" %s.replace( "gc", "GC", 2 ) actgccgacgatcgcgcatcagcg gc occurs 4 times (3 times from index 13) first occurrence of gc: index 3 first occurrence of x: index -1 sequence doesn't start with AC last occurrence of gc: index 21 replacing 'gc' with GC: actGCcgacgatcGCGCatcaGCg replace 2 occurrences max: actGCcgacgatcGCgcatcagcg Searching Strings

Splitting and Joining Strings Tokenization breaks statements into individual components (or tokens) Delimiters, typically whitespace characters, separate tokens

11 fig13_06.py 1 # Fig. 13.6: fig13_06.py 2 # Token splitting and delimiter joining. 3 4 # splitting strings 5 string1 = "A, B, C, D, E, F" 6 7 print "String is:", string1 8 print "Split string by spaces:", string1.split() 9 print "Split string by commas:", string1.split( "," ) 10 print "Split string by commas, max 2:", string1.split( ",", 2 ) 11 print # joining strings 14 list1 = [ "A", "B", "C", "D", "E", "F" ] 15 string2 = "___" print "List is:", list1 18 print 'Joining with ___ : %s' % ( string2.join ( list1 ) ) print 'Joining with -.- :', "-.-".join( list1 ) String is: A, B, C, D, E, F Split string by spaces: ['A,', 'B,', 'C,', 'D,', 'E,', 'F'] Split string by commas: ['A', ' B', ' C', ' D', ' E', ' F'] Split string by commas, max 2: ['A', ' B', ' C, D, E, F'] List is: ['A', 'B', 'C', 'D', 'E', 'F'] Joining with "___": A___B___C___D___E___F Joining with "-.-": A-.-B-.-C-.-D-.-E-.-F Splits calling string by whitespace characters Return list of tokens split by first two comma delimiters Splits calling string by specified character Joins list elements with calling string as a delimiter to create new string Joins list elements with calling quoted string as delimiter to create new string

12 Intermezzo 1 / html 1. Copy and run this program: /users/chili/CSS.E03/ExamplePrograms/random_text.py What does it do? 2. Extend the program: search the text string it produces and print out the index of the first occurrence of 11 (you might look at Figure 13.2 at page 438ff to find a suitable string method). Tell the user if there is no '11'. 3. Split the text into a list of substrings using '11' as a delimiter, print out the list.

13 Solution from random import randrange text = "" for i in range(150): next_char = chr( randrange(48, 58) ) text = "".join( [text, next_char] ) print text i = text.find( "11" ) if i>=0: print "'11' found at index", i splittext = text.split( "11" ) print "text split in %d pieces" %len(splittext) for piece in splittext: print piece '11' found at index 4 text split in 4 pieces

14 Regular Expressions – Motivation import re text1 = "No Danish address here fj3a" text2 = "But here: what a address" regularExpression = compiledRE = re.compile( regularExpression) SRE_Match1 = compiledRE.search( text1) SRE_Match2 = compiledRE.search( text2) if SRE_Match1: print "Text1 contains this Danish address:", SRE_Match1.group() else: print "Text1 contains no Danish address" if SRE_Match2: print "Text2 contains this Danish address:", SRE_Match2.group() else: print "Text2 contains no Danish address" Problem: search a text for any Danish Text1 contains no Danish address Text2 contains this Danish address:

Regular Expressions Provide more efficient and powerful alternative to string search methods Instead of searching for a specific string we can search for a text pattern –Don’t have to search explicitly for ‘Monday’, ‘Tuesday’, ‘Wednesday’.. : there is a pattern in these search strings. –A regular expression is a text pattern In Python, regular expression processing capabilities provided by module re

16 Example Simple regular expression: regExp = “football” - matches only the string “football” To search a text for regExp, we can use re.search( regExp, text )

17 Compiling Regular Expressions re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search for this SRE_Pattern in text 3.Result is an SRE_Match object If we need to search for regExp several times, it is more efficient to compile it once and for all: compiledRE = re.compile( regExp) 1.Now compiledRE is an SRE_Pattern object compiledRE.search( text ) 2.Use search method in this SRE_Pattern to search text 3.Result is same SRE_Match object

18 Searching for ‘football’ import re text1 = "Here are the football results: Bosnia - Denmark 0-7" text2 = "We will now give a complete list of python keywords." regularExpression = "football" compiledRE = re.compile( regularExpression) SRE_Match1 = compiledRE.search( text1 ) SRE_Match2 = compiledRE.search( text2 ) if SRE_Match1: print "Text1 contains the substring ‘football’" if SRE_Match2: print "Text2 contains the substring ‘football’" Text1 contains the substring 'football' Compile regular expression and get the SRE_Pattern object Use the same SRE_Pattern object to search both texts and get two SRE_Match objects (or none if the search was unsuccesful)

19 Building more sophisticated patterns Metacharacters: regular-expression syntax element ? : matches zero or one occurrences of the expression it follows + : matches one or more occurrences of the expression it follows * : matches zero or more occurrences of the expression it follows # search for zero or one t, followed by two a’s: regExp1 = “t?aa“ # search for g followed by one or more c’s followed by a: regExp1 = “gc+a“ #search for ct followed by zero or more g’s followed by a: regExp1 = “ctg*a“

20 Metacharacter example import re text = "gaaagccactgggggggggggggga" regExp1 = "t?aa" compiledRE1 = re.compile( regExp1 ) regExp2 = "gc+a" compiledRE2 = re.compile( regExp2 ) regExp3 = "ctg*a" compiledRE3 = re.compile( regExp3 ) SRE_Match1 = compiledRE1.search( text ) SRE_Match2 = compiledRE2.search( text ) SRE_Match3 = compiledRE3.search( text ) if SRE_Match1: print "Text contains the regular expression", regExp1 if SRE_Match2: print "Text contains the regular expression", regExp2 if SRE_Match3: print "Text contains the regular expression", regExp3 Text contains the regular expression t?aa Text contains the regular expression gc+a Text contains the regular expression ctg*a Compile all three regular expressions into SRE_Pattern objects Use the three SRE_Pattern objects to search the text and get three SRE_Match objects

21 ^ : indicates placement at the beginning of the string $ : indicates placement at the end of the string # search for zero or one t, followed by two a’s # at the beginning of the string: regExp1 = “^t?aa“ # search for g followed by one or more c’s followed by a # at the end of the string: regExp1 = “gc+a$“ # whole string should match ct followed by zero or more # g’s followed by a: regExp1 = “^ctg*a$“ A few more metacharacters

22 Metacharacter example import re text1 = "aactggagcccca" text2 = "ctgga" regExp1 = "^t?aa" regExp2 = "gc+a$" regExp3 = "^ctg*a$" if re.search( regExp1, text1 ): print "Text1 contains the regular expression", regExp1 if re.search( regExp2, text1 ): print "Text1 contains the regular expression", regExp2 if re.search( regExp3, text1 ): print "Text1 contains the regular expression", regExp3 if re.search( regExp3, text2 ): print "Text2 contains the regular expression", regExp3 Text1 contains the regular expression ^t?aa Text1 contains the regular expression gc+a$ Text2 contains the regular expression ^ctg*a$ This time we use re.search() to search the text for the regular expressions directly without compiling them in advance

23 {} : indicate repetition | : match either regular expression to the left or to the right () : indicate a group (a part of a regular expression) # search for four t’s followed by three c’s: regExp1 = “t{4}c{3}“ # search for g followed by 1 to 3 c’s: regExp1 = “gc{1,3}$“ # search for either gg or cc: regExp1 = “gg|cc“ # search for either gg or cc followed by tt: regExp1 = “(gg|cc)tt“ Yet more metacharacters..

24 \ : used to escape (to ‘keep’) a metacharacter # search for x followed by + followed by y: regExp1 = “x\+y“ # search for ( followed by x followed by y: regExp1 = “\(xy“ # search for x followed by ? followed by y: regExp1 = “x\?y“ # search for x followed by at least one ^ followed by 3: regExp1 = “x\^+3“ Escaping metacharacters

25 Intermezzo 2 Intermezzi/ html Copy and run this program: /users/chili/CSS.E03/ExamplePrograms/sequence_searching.py What does it do? Put in more regular expressions in the list to search for these patterns: 1. 6 c's followed by 3 g's 2. cc, followed by at least one g, followed by cc 3. double triplets (e.g. aaa followed by ccc) 4. any number of a's, followed by either cc or gg, followed by c at the end of the string

26 Solution import re # this is a dna sequence in fasta format: seq = """>U03518 Aspergillus awamori\naacctgcggaaggatcattaccgagtgcgggtcctttgggcccaacctcccatccgtgt ctattgtaccctgttgcttcggcgggcccgccgcttgtcggccgccgggggggcgcctctgccccccg ggcccgtgcccgccggagaccccaacacgaacactgtctgaaagcgtgcagtctgagttgattgaatg caatcagttaaaactttcaacaatggatctcttggttccggc""" regular_expressions = [ "a{4}", "c+(t|g)tt", "g*c$", "(gt){2}", "c{6}g{3}", "ccg+cc", "(aaa|ccc|ggg|ttt){2}", "a*(cc|gg)c$" ] for regExp in regular_expressions: if re.search( regExp, seq ): print "found", regExp 1.6 c's followed by 3 g's 2.cc, followed by at least one g, followed by cc 3.double triplets (e.g. aaa followed by ccc) 4.any number of a's, followed by either cc or gg, followed by c at the end of the string

27 Character Classes A character class matches one of the characters in the class: [abc] matches either a or b or c. d[abc]d matches dad and dbd and dcd [ab]+c matches e.g. ac, abc, bac, bbabaabc,.. Metacharacter ^ at beginning negates character class: [^abc] matches any character other than a, b and c A class can use – to indicate a range of characters: [a-e] is the same as [abcde] Characters except ^ and – are taken literally in a class: [a+b*] matches a or + or b or *

28 Special Sequences Special sequence: shortcut for a common character class regExp1 = “\d\d:\d\d:\d\d [AP]M” # (possibly illegal) time stamps like 04:23:19 PM regExp2 = # any Danish address

29 import re text = "1a2b3c4d5e6f" print re.sub( "\d", "*", text ) # substitute * for any digit (i.e. replace digit with *) print print re.sub( "\d", "*", text, 3 ) # substitute * for any digit, max 3 times print print re.split( "\d", text ) # delimiter: any digit print print re.split( "[a-z]", text ) # delimiter: any lower-case letter print if re.search( "\db", text ): # the RE of search() can appear anywhere in text print "method search found \db" if re.match( "\db", text ): # the RE of match() must appear in beginning of text print "method match found \db“ if re.match( "\da", text ): print "method match found \da" Other regular expression functions *a*b*c*d*e*f *a*b*c4d5e6f ['', 'a', 'b', 'c', 'd', 'e', 'f'] ['1', '2', '3', '4', '5', '6', ''] method search found \db method match found \da

30 Groups We can extract the actual substring that matched the regular expression by calling method group() in the SRE_Match object: text = "But here: what a address“ regExp = # match Danish address compiledRE = re.compile( regExp) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish address:", SRE_Match.group()

Grouping The substring that matches the whole RE called a group RE can be subdivided into smaller groups (parts) Each group of the matching substring can be extracted Metacharacters ( and ) denote a group text = "But here: what a address“ # Match any Danish address; define two groups: username and domain: regExp = compiledRE = re.compile( regExp ) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish address:", SRE_Match.group() print “Username:”, SRE_Match.group(1), “\nDomain:”, SRE_Match.group(2) Text2 contains this Danish address: Username: chili Domain: daimi.au.dk

32 Greedy vs. non-greedy operators + and * are greedy operators –They attempt to match as many characters as possible even if this is not the desired behavior +? and *? are non-greedy operators –They attempt to match as few characters as possible

33 Greedy vs. non-greedy operators # Task: Find a space-separated list of digits, extract the first number. import re text = " blah blah" # use greedy operator + regExp = "(\d )+" print "Greedy operator:", re.match( regExp, text ).group() # use non-greedy version instead (by putting a ? after the +) regExp = "(\d )+?" print "Non-greedy operator:", re.match( regExp, text ).group() Greedy operator: Non-greedy operator: 1