Download presentation
Presentation is loading. Please wait.
1
LING 388: Computers and Language
Lecture 25
2
nltk book: chapter 3 Last time, we discussed the problem of word tokenization… >>> text = 'That U.S.A. poster-print costs $ ' >>> word_tokenize(text) ['That', 'U.S.A.', 'poster-print', 'costs', '$', '12.40', '...']
3
nltk book: chapter 3 3.8 Segmentation Sentence segmentation
Brown corpus (pre-segmented): >>> len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents()) (average sentence length in terms of number of words) >>> raw = "'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL. Soup does very\nwell without--Maybe it's always pepper that makes people hot-tempered,'..." >>> nltk.sent_tokenize(raw) ["'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL.", "Soup does very\nwell without--Maybe it's always pepper that makes people hot-tempered,'..."] >>> nltk.sent_tokenize(raw)[0] "'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL." >>> nltk.sent_tokenize(raw)[1] "Soup does very\nwell without--Maybe it's always pepper that makes people hot-tempered,'..."
4
nltk book: chapter 3 Mrs. Dalloway revisited:
>>> from urllib import request >>> url = " >>> response = request.urlopen(url) >>> raw = response.read().decode('latin-1') >>> raw = raw[431:] >>> raw = raw[1217:] >>> raw = raw[:368976] >>> raw[:100] 'Title: Mrs. Dalloway\r\nAuthor: Virginia Woolf\r\n\r\n\r\n\r\n\r\nMrs. Dalloway said she would buy the ' >>> raw[-100:] 's me with extraordinary excitement?\r\n\r\nIt is Clarissa, he said.\r\n\r\nFor there she was.\r\n\r\n\r\n\r\nTHE END'
5
nltk book: chapter 3 >>> sents = nltk.sent_tokenize(raw)
0. 'Title: Mrs. Dalloway\r\nAuthor: Virginia Woolf\r\n\r\n\r\n \r\n\r\nMrs. Dalloway said she would buy the flowers herself.' 'For Lucy had her work cut out for her.' "The doors would be taken\r\noff their hinges; Rumpelmayer's men were coming." 'And then, thought\r\nClarissa Dalloway, what a morning--fresh as if issued to children\r\non a beach.' 'What a lark!' 'What a plunge!' 'For so it had always seemed to her,\r\nwhen, with a little squeak of the hinges, which she could hear now,\r\nshe had burst open the French windows and plunged at Bourton into\r\nthe open air.'
6
nltk book: chapter 3 Famous for her stream-of-consciousness style of writing: >>> sents[7] 'How fresh, how calm, stiller than this of course,\r\nthe air was in the early morning; like the flap of a wave; the kiss\r\nof a wave; chill and sharp and yet (for a girl of eighteen as she\r\nthen was) solemn, feeling as she did, standing there at the open\r\nwindow, that something awful was about to happen; looking at the\r\nflowers, at the trees with the smoke winding off them and the rooks\r\nrising, falling; standing and looking until Peter Walsh said,\r\n"Musing among the vegetables?"' >>> s7 = word_tokenize(sents[7]) >>> len(s7) 107 cf. Brown corpus average of 20 words/sentence
7
Python save/restore corpus
json can be used (and is more standard across programming languages), but pickle is a Python library for this purpose: >>> import pickle >>> f = open('dalloway.pickle','wb') wb = write binary >>> pickle.dump(raw,f) >>> f.close() >>> f = open('dalloway.pickle','rb') rb = read binary >>> raw2 = pickle.load(f) >>> raw == raw2 True
8
nltk book: chapter 3 Python formatted output
(printed as just one column, not three…) Python formatted output >>> import nltk >>> import pickle >>> f = open('dalloway.pickle','rb') >>> raw = pickle.load(f) >>> f.close() >>> fd = nltk.FreqDist(nltk.word_tokenize(raw)) >>> t50 = fd.most_common(50) >>> m = max(len(t[0]) for t in t50) >>> m 8 >>> for x,y in t50: ... print('{:{width}} {}'.format(x,y,width=m)) ... , the 3015 and 1625 of 1525 ; 1473 to 1447 a 1328 was 1254 her 1227 she 1157 in 1107 had 928 he 908 it 712 that 622 with 565 -- 545 his 490 '' 458 for 446 on 441 at 427 him 421 said 410 not 403 as 396 `` 388 She 372 ? 361 ! 346 one 317 's 306 all 305 they 305 ( 290 ) 290 would 278 were 276 But 270 He 269 so 266 which 266 could 264 Clarissa 263 this 254 thought 252 be 245 up 232 like 232
9
nltk book: chapter 3
10
nltk book: chapter 3 At the terminal, line wrap is quite arbitrary…
>>> from textwrap import fill >>> pieces = ['{} ({})'.format(t[0],t[1]) for t in t50] >>> s50 = ' '.join(pieces) >>> print(fill(s50)) , (6098) . (3017) the (3015) and (1625) of (1525) ; (1473) to (1447) a (1328) was (1254) her (1227) she (1157) in (1107) had (928) he (908) it (712) that (622) with (565) -- (545) his (490) '' (458) for (446) on (441) at (427) him (421) said (410) not (403) as (396) `` (388) She (372) ? (361) ! (346) one (317) 's (306) all (305) they (305) ( (290) ) (290) would (278) were (276) But (270) He (269) so (266) which (266) could (264) Clarissa (263) this (254) thought (252) be (245) up (232) like (232) fill() Text is preferably wrapped on whitespaces and right after the hyphens in hyphenated words; only then will long words be broken if necessary
11
nltk book: chapter 3 More fancy formatting:
>>> pieces = ['{}_({})'.format(t[0],t[1]) for t in t50] >>> s50 = ' '.join(pieces) >>> print(fill(s50).replace('_',' ')) , (6098) . (3017) the (3015) and (1625) of (1525) ; (1473) to (1447) a (1328) was (1254) her (1227) she (1157) in (1107) had (928) he (908) it (712) that (622) with (565) -- (545) his (490) '' (458) for (446) on (441) at (427) him (421) said (410) not (403) as (396) `` (388) She (372) ? (361) ! (346) one (317) 's (306) all (305) they (305) ( (290) ) (290) would (278) were (276) But (270) He (269) so (266) which (266) could (264) Clarissa (263) this (254) thought (252) be (245) up (232) like (232)
12
nltk book: chapter 4 4 Writing Structured Programs
teaches Python with many cool examples relevant to text processing Example: find the longest words in Milton's Paradise Lost >>> text = nltk.corpus.gutenberg.words('milton-paradise.txt') >>> maxlen = max(len(word) for word in text) >>> [word for word in text if len(word) == maxlen] ['unextinguishable', 'transubstantiate', 'inextinguishable', 'incomprehensible']
13
nltk book: chapter 4 Advanced topic: Generators (functions with yield instead of return)
14
nltk book: chapter 4 Brown corpus
15
nltk book: chapter 4
16
nltk book: chapter 4 WordNet
17
nltk book: chapter 4
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.