Download presentation
Presentation is loading. Please wait.
Published byEustacia Benson Modified over 8 years ago
1
Regular expressions Day 11 LING 681.02 Computational Linguistics Harry Howard Tulane University
2
18-Sept-2009LING 681.02, Prof. Howard, Tulane University2 Course organization http://www.tulane.edu/~ling/NLP/ http://www.tulane.edu/~ling/NLP/ NLTK is installed on the computers in this room! How would you like to use the Provost's $150?
3
NLPP §3 Processing raw text §3.2 Strings: Text processing at the lowest level
4
NLPP §3 Processing raw text §3.4 Regular expressions for detecting word formats
5
18-Sept-2009LING 681.02, Prof. Howard, Tulane University5 Notation in Python Table 3.3 OperatorBehavior.Wildcard, matches any character ^abcMatches some pattern abc at the start of a string abc$Matches some pattern abc at the end of a string [abc]Matches one of a set of characters [A-Z0-9]Matches one of a range of characters ed|ing|sMatches one of the specified strings (disjunction) *Zero or more of previous item, e.g. a*, [a-z]* (aka Kleene Closure/star) +One or more of previous item, e.g. a+, [a-z]+ ?Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]? {n}Exactly n repeats where n is a non-negative integer {n,}At least n repeats {,n}No more than n repeats {m,n}At least m and no more than n repeats a(b|c)+Parentheses that indicate the scope of the operators
6
18-Sept-2009LING 681.02, Prof. Howard, Tulane University6 Raw strings To the Python interpreter, a regex is just like any other string. If the string contains a backslash followed by particular characters, it will interpret these specially. For example \b = backspace character normally, but word boundary in re. In general, when using regexs containing backslash, we should instruct the interpreter not to look inside the string at all, but simply to pass it directly to the re library for processing. We do this by prefixing the string with the letter r, to indicate that it is a raw string. For example, the raw string r'\band\b' contains two \b symbols that are interpreted by re as matching word boundaries instead of backspaces. If you get into the habit of using r'...' for regular expressions — as we will do from now on — you will avoid having to think about these complications.
7
NLPP §3 Processing raw text §3.5 Useful Applications of Regular Expressions
8
18-Sept-2009LING 681.02, Prof. Howard, Tulane University8 Some applications Extracting word pieces Doing more with word pieces Finding word stems Searching tokenized text
9
NLPP §3 Processing raw text §3.6 Normalizing Text
10
18-Sept-2009LING 681.02, Prof. Howard, Tulane University10 Examples Stemming Lemmatization
11
NLPP §3 Processing raw text §3.7 Regular Expressions for Tokenizing Text
12
18-Sept-2009LING 681.02, Prof. Howard, Tulane University12 Regex character class symbols Table 3.4 SymbolFunction \bWord boundary (zero width) \dAny decimal digit (equivalent to [0-9]) \DAny non-digit character (equivalent to [^0-9]) \sAny whitespace character (equivalent to [ \t\n\r\f\v] \SAny non-whitespace character (equivalent to [^ \t\n\r\f\v]) \wAny alphanumeric character (equivalent to [a-zA-Z0-9_]) \WAny non-alphanumeric character (equivalent to [^a-zA-Z0-9_]) \tThe tab character \nThe newline character
13
NLPP §3 Processing raw text §3.8 Segmentation
14
Next time P3: Do #6 & #7 of Exercises 3.12 SLP §2.2 Maybe NLPP §4
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.