Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING 388: Computers and Language

Similar presentations


Presentation on theme: "LING 388: Computers and Language"— Presentation transcript:

1 LING 388: Computers and Language
Lecture 24

2 nltk book: chapter 3 Last time:

3 nltk book: chapter 3 Download: raw file = 1 (long) string
Text number 2554 is an English translation of Crime and Punishment

4 nltk book: chapter 3 Tokenize: list of words

5 nltk book: chapter 3 . html: get_text() from BeautifulSoup

6 nltk book: chapter 3 nltk Text: .collocations() and .concordance(word)

7 nltk book: chapter 3 Adjusting start and end:

8 nltk book: chapter 3 3.5 Useful Applications of Regular Expressions
Extracting Word Pieces Relative frequency of sequences of two or more vowels >>> import nltk >>> import re >>> from nltk.corpus import ptb >>> t = ptb.words(categories='news') >>> w = set(t)                                                                  >>> len(w) 49817 >>> len(t) >>> fd = nltk.FreqDist(vvs for word in w for vvs in re.findall(r'[aeiou]{2,}', word)) >>> fd.most_common(10) [('io', 2090), ('ea', 1882), ('ou', 1421), ('ie', 1418), ('ia', 1083), ('ai', 970), ('ee', 874), ('oo', 783), ('au', 448), ('ei', 442)] >>> fd.plot() ptb: you don't have access to the full treebank (in this course) treebank: use this instead

9 nltk book: chapter 3 iou eau

10 nltk book: chapter 3 Readability: leave out word-internal vowels

11 nltk book: chapter 3 Word stemming
book example from Monty Python And The Holy Grail /watch?v=eXmwK2-R2dY

12 nltk book: chapter 3

13 nltk book: chapter 3 Class IndexedText defined next slide:

14 nltk book: chapter 3 Advanced: generates (index, word) tuples

15 nltk book: chapter 3 >>> from nltk import word_tokenize >>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords ... is no basis for a system of government. Supreme executive power derives from ... a mandate from the masses, not from some farcical aquatic ceremony.""" >>> tokens = word_tokenize(raw) >>> wnl = nltk.WordNetLemmatizer() >>> [wnl.lemmatize(t) for t in tokens] ['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond', 'distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']

16 nltk book: chapter 3 3.7 Regular Expressions for Tokenizing Text
word_tokenize() we take for granted. Simple tokenization: >>> raw "'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL. Soup does very\nwell without--Maybe it's always pepper that makes people hot-tempered,'..." >>> import re >>> re.split(r' ', raw) ["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone\nthough),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very\nwell', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."] >>> re.split(r'\s', raw) any whitespace ["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]

17 nltk book: chapter 3 Split on non-word characters:
>>> re.split(r'\W+', raw) ['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without', 'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered', ''] >>> re.findall(r'\w+|\S\w*', raw) Punctuation ["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'won', "'t", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that', 'makes', 'people', 'hot', '-tempered', ',', "'", '.', '.', '.']

18 word (possibly hyphenated)
nltk book: chapter 3 re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw) ["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I', "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...'] (?: ….) means non-capturing group word (possibly hyphenated) apostrophe -- or … punctuation

19 nltk book: chapter 3 Remind ourselves what word_tokenize() does for ':
>>> word_tokenize(raw) ["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'wo', "n't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', 'it', "'s", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']

20 nltk book: chapter 3 >>> text = 'That U.S.A. poster-print costs $ ' >>> word_tokenize(text) ['That', 'U.S.A.', 'poster-print', 'costs', '$', '12.40', '...']


Download ppt "LING 388: Computers and Language"

Similar presentations


Ads by Google