Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING 388: Computers and Language

Similar presentations


Presentation on theme: "LING 388: Computers and Language"— Presentation transcript:

1 LING 388: Computers and Language
Lecture 25

2 nltk book: chapter 3 Last time, we discussed the problem of word tokenization… >>> text = 'That U.S.A. poster-print costs $ ' >>> word_tokenize(text) ['That', 'U.S.A.', 'poster-print', 'costs', '$', '12.40', '...']

3 nltk book: chapter 3 3.8 Segmentation Sentence segmentation
Brown corpus (pre-segmented): >>> len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents()) (average sentence length in terms of number of words) >>> raw = "'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL. Soup does very\nwell without--Maybe it's always pepper that makes people hot-tempered,'..." >>> nltk.sent_tokenize(raw) ["'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL.", "Soup does very\nwell without--Maybe it's always pepper that makes people hot-tempered,'..."] >>> nltk.sent_tokenize(raw)[0]                                                                   "'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL." >>> nltk.sent_tokenize(raw)[1]                                                                   "Soup does very\nwell without--Maybe it's always pepper that makes people hot-tempered,'..."

4 nltk book: chapter 3 Mrs. Dalloway revisited:
>>> from urllib import request >>> url = " >>> response = request.urlopen(url)  >>> raw = response.read().decode('latin-1')  >>> raw = raw[431:] >>> raw = raw[1217:] >>> raw = raw[:368976] >>> raw[:100] 'Title:      Mrs. Dalloway\r\nAuthor:     Virginia Woolf\r\n\r\n\r\n\r\n\r\nMrs. Dalloway said she would buy the ' >>> raw[-100:] 's me with extraordinary excitement?\r\n\r\nIt is Clarissa, he said.\r\n\r\nFor there she was.\r\n\r\n\r\n\r\nTHE END'

5 nltk book: chapter 3 >>> sents = nltk.sent_tokenize(raw)
0. 'Title:      Mrs. Dalloway\r\nAuthor:     Virginia Woolf\r\n\r\n\r\n \r\n\r\nMrs. Dalloway said she would buy the flowers herself.' 'For Lucy had her work cut out for her.' "The doors would be taken\r\noff their hinges; Rumpelmayer's men were coming." 'And then, thought\r\nClarissa Dalloway, what a morning--fresh as if issued to children\r\non a beach.' 'What a lark!' 'What a plunge!' 'For so it had always seemed to her,\r\nwhen, with a little squeak of the hinges, which she could hear now,\r\nshe had burst open the French windows and plunged at Bourton into\r\nthe open air.'

6 nltk book: chapter 3 Famous for her stream-of-consciousness style of writing: >>> sents[7] 'How fresh, how calm, stiller than this of course,\r\nthe air was in the early morning; like the flap of a wave; the kiss\r\nof a wave; chill and sharp and yet (for a girl of eighteen as she\r\nthen was) solemn, feeling as she did, standing there at the open\r\nwindow, that something awful was about to happen; looking at the\r\nflowers, at the trees with the smoke winding off them and the rooks\r\nrising, falling; standing and looking until Peter Walsh said,\r\n"Musing among the vegetables?"' >>> s7 = word_tokenize(sents[7]) >>> len(s7) 107 cf. Brown corpus average of 20 words/sentence

7 Python save/restore corpus
json can be used (and is more standard across programming languages), but pickle is a Python library for this purpose: >>> import pickle >>> f = open('dalloway.pickle','wb') wb = write binary >>> pickle.dump(raw,f) >>> f.close() >>> f = open('dalloway.pickle','rb') rb = read binary >>> raw2 = pickle.load(f) >>> raw == raw2 True

8 nltk book: chapter 3 Python formatted output
(printed as just one column, not three…) Python formatted output >>> import nltk >>> import pickle >>> f = open('dalloway.pickle','rb') >>> raw = pickle.load(f) >>> f.close() >>> fd = nltk.FreqDist(nltk.word_tokenize(raw)) >>> t50 = fd.most_common(50) >>> m = max(len(t[0]) for t in t50) >>> m 8 >>> for x,y in t50: ...     print('{:{width}} {}'.format(x,y,width=m)) ...  , the 3015 and 1625 of 1525 ; 1473 to 1447 a 1328 was 1254 her 1227 she 1157 in 1107 had 928 he 908 it 712 that 622 with 565 --       545 his      490 ''       458 for      446 on       441 at       427 him      421 said     410 not      403 as       396 ``       388 She      372 ?        361 !        346 one      317 's       306 all      305 they     305 (        290 )        290 would    278 were     276 But      270 He       269 so       266 which    266 could    264 Clarissa 263 this     254 thought  252 be       245 up       232 like     232

9 nltk book: chapter 3

10 nltk book: chapter 3 At the terminal, line wrap is quite arbitrary…
>>> from textwrap import fill >>> pieces = ['{} ({})'.format(t[0],t[1]) for t in t50] >>> s50 = ' '.join(pieces) >>> print(fill(s50)) , (6098) . (3017) the (3015) and (1625) of (1525) ; (1473) to (1447) a (1328) was (1254) her (1227) she (1157) in (1107) had (928) he (908) it (712) that (622) with (565) -- (545) his (490) '' (458) for (446) on (441) at (427) him (421) said (410) not (403) as (396) `` (388) She (372) ? (361) ! (346) one (317) 's (306) all (305) they (305) ( (290) ) (290) would (278) were (276) But (270) He (269) so (266) which (266) could (264) Clarissa (263) this (254) thought (252) be (245) up (232) like (232) fill() Text is preferably wrapped on whitespaces and right after the hyphens in hyphenated words; only then will long words be broken if necessary

11 nltk book: chapter 3 More fancy formatting:
>>> pieces = ['{}_({})'.format(t[0],t[1]) for t in t50] >>> s50 = ' '.join(pieces) >>> print(fill(s50).replace('_',' ')) , (6098) . (3017) the (3015) and (1625) of (1525) ; (1473) to (1447) a (1328) was (1254) her (1227) she (1157) in (1107) had (928) he (908) it (712) that (622) with (565) -- (545) his (490) '' (458) for (446) on (441) at (427) him (421) said (410) not (403) as (396) `` (388) She (372) ? (361) ! (346) one (317) 's (306) all (305) they (305) ( (290) ) (290) would (278) were (276) But (270) He (269) so (266) which (266) could (264) Clarissa (263) this (254) thought (252) be (245) up (232) like (232)

12 nltk book: chapter 4 4 Writing Structured Programs
teaches Python with many cool examples relevant to text processing Example: find the longest words in Milton's Paradise Lost >>> text = nltk.corpus.gutenberg.words('milton-paradise.txt') >>> maxlen = max(len(word) for word in text) >>> [word for word in text if len(word) == maxlen] ['unextinguishable', 'transubstantiate', 'inextinguishable', 'incomprehensible']

13 nltk book: chapter 4 Advanced topic: Generators (functions with yield instead of return)

14 nltk book: chapter 4 Brown corpus

15 nltk book: chapter 4

16 nltk book: chapter 4 WordNet

17 nltk book: chapter 4


Download ppt "LING 388: Computers and Language"

Similar presentations


Ads by Google