Download presentation
Presentation is loading. Please wait.
Published byWilfred Collins Modified over 9 years ago
1
Strings and regular expressions Day 10 LING 681.02 Computational Linguistics Harry Howard Tulane University
2
16-Sept-2009LING 681.02, Prof. Howard, Tulane University2 Course organization http://www.tulane.edu/~ling/NLP/ http://www.tulane.edu/~ling/NLP/ NLTK is installed on the computers in this room! How would you like to use the Provost's $150? Please become a fan of Tulane Linguistics on Facebook.
3
NLPP §3 Processing raw text §3.2 Strings: Text processing at the lowest level
4
16-Sept-2009LING 681.02, Prof. Howard, Tulane University4 Syntax of single-line strings Strings are specified with single quotes, or double quotes if a single quote is one of the characters: 'Monty Python' "Monty Python's Flying Circus" 'Monty Python\s Flying Circus'
5
16-Sept-2009LING 681.02, Prof. Howard, Tulane University5 Syntax of multi-line strings A sequence of strings can be joined into a single one with … a backslash at the end of each line: 'first half'\ 'second half' = 'first halfsecond half' parentheses to open and close the sequence: ('first half' 'second half') = 'first halfsecond half' triple double quotes to open and close the sequence and maintain line breaks: """first half second half""" = 'first half/nsecond half'
6
16-Sept-2009LING 681.02, Prof. Howard, Tulane University6 Basic opertions Concatenation (+) >>> 'really' + 'really' 'reallyreally' Repetition (*) >>> 'really' * 4 'reallyreallyreallyreally'
7
16-Sept-2009LING 681.02, Prof. Howard, Tulane University7 Your Turn p. 88 !!!
8
16-Sept-2009LING 681.02, Prof. Howard, Tulane University8 Printing strings Make a couple of string assignments: harry = 'Harry Potter' prince = 'Half-Blood Prince' Inspection of a variable produces Python's representation of its value: >>> harry 'Harry Potter' Printing a variable produces its value: >>> print harry Harry Potter What do you expect? >>> print harry + prince >>> print harry, prince >>> print harry, 'and the', prince
9
16-Sept-2009LING 681.02, Prof. Howard, Tulane University9 Using indices Every character of a string is indexed from 0 (and -1) >>> harry[0] 'H' >>> harry[-1] 'r' >>> harry[:2] 'Har' >>> harry[-12:-10] 'Har' >>> for char in prince:...print char, H a l f - B l o o d P r i n c e
10
16-Sept-2009LING 681.02, Prof. Howard, Tulane University10 More string operations See Table 3-2
11
16-Sept-2009LING 681.02, Prof. Howard, Tulane University11 Strings vs. lists Both are sequences and so support joining by concatenation and separation by slicing. But they are different, so they cannot be concatenated. Granularity Strings have a single level of resolution, the individual character > good for writing to screen or file. Lists can have any level of resolution we want: character, morpheme, word, phrase, sentence, paragraph > good for NLP. So the second step in the NLP pipeline is to tokenize a string into a list.
12
NLPP §3 Processing raw text §3.3 Text processing with Unicode
13
16-Sept-2009LING 681.02, Prof. Howard, Tulane University13 Unicode The format for representing special characters that go beyond ASCII Let's skip this until we really need it.
14
NLPP §3 Processing raw text §3.4 Regular expressions for detecting word formats
15
16-Sept-2009LING 681.02, Prof. Howard, Tulane University15 Getting started To use regular expressions in Python, we need to import the re library. We also need a list of words to search. we'll use the Words Corpus again (Section 2.4). We will preprocess it to remove any proper names. >>> import re >>> wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]
16
16-Sept-2009LING 681.02, Prof. Howard, Tulane University16 Different terminologies In textbook, regex = «ed$» In re, regex = 'ed$' (i.e. a string)
17
16-Sept-2009LING 681.02, Prof. Howard, Tulane University17 Searching re.search(p, s) p is a pattern – what we are looking for, and s is a candidate string for matching the pattern.
18
16-Sept-2009LING 681.02, Prof. Howard, Tulane University18 Some examples Find words ending in -ed: >>> [w for w in wordlist if re.search('ed$', w)] Find a word that fits a certain group of blanks in a crossword puzzle that is 8 letters long, with j as the 3rd letter and t as the 6th letter: >>> [w for w in wordlist if re.search('^..j..t..$', w)] Find the strings email or e-mail: >>> [w for w in wordlist if re.search('^e-?mail$', w)]
19
Next time More on RegEx
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.