Download presentation
Presentation is loading. Please wait.
Published byRolf Jenkins Modified over 9 years ago
1
REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
2
Course organization 15-Sept-2014NLP, Prof. Howard, Tulane University 2 http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/ The syllabus is under construction. http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/
3
The quiz was the review. Review 15-Sept-2014 3 NLP, Prof. Howard, Tulane University
4
4.3.4. Summary table meta-charactermatchesnamenotes a|ba or bdisjunction (ab)a and bgrouping only outputs what is in (); (?:ab) for rest of pattern [ab]a or brange [a-z] lowercase, [A-Z] uppercase, [0-9] digits [^a]all but anegation a{m, n}from m to n of arepetitiona{n} a number n of a ^aa at start of S a$a at end of S a+one or more of a a+? lazy + a*zero or more of aKleene stara*? lazy * a?with or without aoptionalitya?? lazy ? 15-Sept-2014NLP, Prof. Howard, Tulane University 4
5
There is a bit more to say. §4. Regular expressions 4 15-Sept-2014 5 NLP, Prof. Howard, Tulane University
6
Open Spyder 15-Sept-2014 6 NLP, Prof. Howard, Tulane University
7
Sample string import re >>> S = '''This above all: to thine own self be true, And it must follow, as the night the day, Thou canst not then be false to any man.''' 15-Sept-2014NLP, Prof. Howard, Tulane University 7
8
4.4. Character classes classabbreviatesnamenotes \w[a-zA-Z0-9_]alphanumericit’s really alphanumeric and underscore, but we are lazy \W[^a-zA-Z0-9_] not alphanumeric \d[0-9]digit \D[^0-9] not a digit \s[ tvnrf]whitespace \S[^ tvnrf] not whitespace \t horizontal tab \v vertical tab \n newline \r carriage return \f form-feed \b word boundary \B not a word boundary \A^ \Z$ 15-Sept-2014NLP, Prof. Howard, Tulane University 8
9
4.4.2. Raw string notation with r’‘ Python interprets regular expressions just like any other expression. This can lead to unexpected results with class meta-characters, because the backslash that they incorporate is sometimes also used by Python for its own constructs. For instance, we just met a class meta-character \b, which marks the edge of a word. It will be extremely useful for us, but it happens to overlap with Python’s own backspace operator, \b. 15-Sept-2014NLP, Prof. Howard, Tulane University 9
10
Raw text The way to resolve this ambiguity is to prefix an r to a regular expression. The r marks the regular expression as raw text, so Python does not process it for special characters. The previous example is augmented with the raw text notation below: 1. >>> re.findall(r'\b\w\w\b', S) 2. ['to', 'be', 'it', 'as', 'be', 'to'] 3. >>> re.findall(r'\b\w{2}\b', S) 4. ['to', 'be', 'it', 'as', 'be', 'to'] 15-Sept-2014NLP, Prof. Howard, Tulane University 10
11
More raw text As a further illustration, what do you think are the non-alphanumeric characters in the Shakespeare text?: >>> re.findall(r'\W', S) [' ', ' ', ':', ' ', ' ', ' ', ' ', ' ', ' ', ',', '\n', ' ', ' ', ' ', ',', ' ', ' ', ' ', ' ', ' ', ',', '\n', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '.'] 15-Sept-2014NLP, Prof. Howard, Tulane University 11
12
Practice 4.3.5. Further practice of variable-length matching 4.6. Further practice Practice with answers on a different page 15-Sept-2014NLP, Prof. Howard, Tulane University 12
13
There is a bit more to say. §5. Lists1 15-Sept-2014 13 NLP, Prof. Howard, Tulane University
14
Introduction In working with re.findall(), you have seen many instances of a collection of strings held within square brackets, such as the one below: >>> S = '''This above all: to thine own self be true,... And it must follow, as the night the day,... Thou canst not then be false to any man.''' >>> re.findall(r'\b[a-zA-Z]{4}\b', S) ['This', 'self', 'true', 'must', 'Thou', 'then'] 15-Sept-2014NLP, Prof. Howard, Tulane University 14
15
Definition of list A list in Python is a sequence of objects delimited by square brackets, []. The objects are separated by commas. Consider this sentence from Shakespeare’s A Midsummer Night’s Dream represented as a list: >>> L = ['Love', 'looks', 'not', 'with', 'the', 'eyes', ',', 'but', 'with', 'the', 'mind', '.'] >>> type(L) >>> type(L[0]) L is a list of strings. You may think that a string is also a list of characters, and you would be correct for ordinary English, but in pythonic English, the word ‘list’ refers exclusively to a sequence of objects delimited by square brackets. 15-Sept-2014NLP, Prof. Howard, Tulane University 15
16
An example with numerical objects 1. >>> i = 2 2. >>> type(i) 3. >>> I = [0,1,i,3] 4. >>> type(I) 5. >>> type(I[0]) 6. >>> n = 2.3 7. >>> type(n) 8. >>> N = [2.0,2.1,2.2,n] 9. >>> type(N) 10. >>> type(N[0]) 15-Sept-2014NLP, Prof. Howard, Tulane University 16
17
Most of the string methods work just as well on lists 1. >>> len(L) 2. >>> sorted(L) 3. >>> set(L) 4. >>> sorted(set(L)) 5. >>> len(sorted(set(L))) 6. >>> L+'!' 7. >>> len(L+'!') 8. >>> L*2 9. >>> len(L*2) 10. >>> L.count('the') 15-Sept-2014NLP, Prof. Howard, Tulane University 17
18
String methods work on lists, cont. 1. >>> L.count('Love') 2. >>> L.count('love') 3. >>> L.index('with') 4. >>> L.rindex('with') 5. >>> L[2:] 6. >>> L[:2] 7. >>> L[-2:] 8. >>> L[:-2] 9. >>> L[2:-2] 10. >>> L[-2:2] 11. >>> L[:] 12. >>> L[:-1]+['!'] 15-Sept-2014NLP, Prof. Howard, Tulane University 18
19
Q1 MIN 5.0 AVG 9.5 MAX 10.0 15-Sept-2014NLP, Prof. Howard, Tulane University 19
20
More on lists Next time 15-Sept-2014NLP, Prof. Howard, Tulane University 20
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.