Presentation is loading. Please wait.

Presentation is loading. Please wait.

Regular expressions Day 11 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Similar presentations


Presentation on theme: "Regular expressions Day 11 LING 681.02 Computational Linguistics Harry Howard Tulane University."— Presentation transcript:

1 Regular expressions Day 11 LING 681.02 Computational Linguistics Harry Howard Tulane University

2 18-Sept-2009LING 681.02, Prof. Howard, Tulane University2 Course organization  http://www.tulane.edu/~ling/NLP/ http://www.tulane.edu/~ling/NLP/  NLTK is installed on the computers in this room!  How would you like to use the Provost's $150?

3 NLPP §3 Processing raw text §3.2 Strings: Text processing at the lowest level

4 NLPP §3 Processing raw text §3.4 Regular expressions for detecting word formats

5 18-Sept-2009LING 681.02, Prof. Howard, Tulane University5 Notation in Python Table 3.3 OperatorBehavior.Wildcard, matches any character ^abcMatches some pattern abc at the start of a string abc$Matches some pattern abc at the end of a string [abc]Matches one of a set of characters [A-Z0-9]Matches one of a range of characters ed|ing|sMatches one of the specified strings (disjunction) *Zero or more of previous item, e.g. a*, [a-z]* (aka Kleene Closure/star) +One or more of previous item, e.g. a+, [a-z]+ ?Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]? {n}Exactly n repeats where n is a non-negative integer {n,}At least n repeats {,n}No more than n repeats {m,n}At least m and no more than n repeats a(b|c)+Parentheses that indicate the scope of the operators

6 18-Sept-2009LING 681.02, Prof. Howard, Tulane University6 Raw strings  To the Python interpreter, a regex is just like any other string.  If the string contains a backslash followed by particular characters, it will interpret these specially.  For example \b = backspace character normally, but word boundary in re.  In general, when using regexs containing backslash, we should instruct the interpreter not to look inside the string at all, but simply to pass it directly to the re library for processing.  We do this by prefixing the string with the letter r, to indicate that it is a raw string.  For example, the raw string r'\band\b' contains two \b symbols that are interpreted by re as matching word boundaries instead of backspaces.  If you get into the habit of using r'...' for regular expressions — as we will do from now on — you will avoid having to think about these complications.

7 NLPP §3 Processing raw text §3.5 Useful Applications of Regular Expressions

8 18-Sept-2009LING 681.02, Prof. Howard, Tulane University8 Some applications  Extracting word pieces  Doing more with word pieces  Finding word stems  Searching tokenized text

9 NLPP §3 Processing raw text §3.6 Normalizing Text

10 18-Sept-2009LING 681.02, Prof. Howard, Tulane University10 Examples  Stemming  Lemmatization

11 NLPP §3 Processing raw text §3.7 Regular Expressions for Tokenizing Text

12 18-Sept-2009LING 681.02, Prof. Howard, Tulane University12 Regex character class symbols Table 3.4 SymbolFunction \bWord boundary (zero width) \dAny decimal digit (equivalent to [0-9]) \DAny non-digit character (equivalent to [^0-9]) \sAny whitespace character (equivalent to [ \t\n\r\f\v] \SAny non-whitespace character (equivalent to [^ \t\n\r\f\v]) \wAny alphanumeric character (equivalent to [a-zA-Z0-9_]) \WAny non-alphanumeric character (equivalent to [^a-zA-Z0-9_]) \tThe tab character \nThe newline character

13 NLPP §3 Processing raw text §3.8 Segmentation

14 Next time P3: Do #6 & #7 of Exercises 3.12 SLP §2.2 Maybe NLPP §4


Download ppt "Regular expressions Day 11 LING 681.02 Computational Linguistics Harry Howard Tulane University."

Similar presentations


Ads by Google