LING 388: Computers and Language

Slides:



Advertisements
Similar presentations
The people Look for some people. Write it down. By the water
Advertisements

Of.
High Frequency Words List A Group 1
The.
Programming for Linguists
Mrs Dalloway said she was going to buy some flowers.
THE CUPCAKE QUEEN by Heather Hepler
By Walt Prentice. Children learn what is demonstrated. In our house, reading was something we all did and enjoyed. Books were everywhere. It is only natural.
The Concept of Time in the Modern Age From literature to painting Laurano Federica 5 A A.S. 2012/
100 Most Common Words.
First Grade Spelling Words
Unit 2 Reading Dying to be thin. Teaching aims  Grasp the main ideas of the three s  Help the students to express their own opinions on a topic.
Stream of Consciousness A Literary Technique. Stream of Consciousness First assumes that the human mind is a constant stream of thoughts, associations,
THE COLLECTOR by John Fowles
Created by Verna C. Rentsch and Joyce Cooling Nelson School
Terminology and concepts. Using terminology and concepts correctly helps you to articulate your responses to literary texts: in detail with precision.
List A Sight Words.
Sight Words - List A Words
Structured programming 3 Day 33 LING Computational Linguistics Harry Howard Tulane University.
Listen and Decode Listen and Respond Listen and Read Listen and Match Listen and Conclude Listen and Complete Listen and Judge Being All Ears.
Chemeketa Goldilocks and the Three Bears. chemeketa There was once a family of bears who lived in a cozy cottage in the woods. There was a great big Papa.
Sight Words.
Modernism – a movement in art and literature that occurred around the time immediately before and during the First and Second World Wars. Among the factors.
How do you write a great fiction story? Great fiction stories have these parts: A cool title Sentences that tell about the main character. Sentences.
THE LITTLE TRICKSTER KASEY. The super bunny saw bear. And she thought of a good trick so she dressed like her mom and, she said hi how are you she said.
Gerunds and Infinitives. Gerunds: The Gerund as a Noun It can be subject, object, predicate, and the object of a preposition: Her feelings were hurt /
Modernism, Literature and the Feminist Perspective
First Grade English High Frequency Words
Reflective Thinking By: Lindsay.
This is beautiful! Try not to cry.
First 100 high frequency words
Posters. Posters when it happened. suddenly, several car alarms went off. Speaking activity The following people live in a city that was hit by an.
Unit 6 An old man tried to move the mountains. Section B 2b-3b.
Modernism and Heart of Darkness
Lesson 3 Experiment in Folk
THE COLLECTOR by John Fowles
the and a to said in he I of it was you they on she is for at his but
What`s your favourite film ?
Grammar Lessons Week Two.
Reading Log.
Mrs. Dalloway said she would buy the flowers herself.
This is beautiful! Try not to cry.
Adapted from work by Celeste Gledhill
Formatting and Citation Review
Follow That Star Tonight fells like there’s something in the air Tonight it’s like the sky is filled with prayer Maybe I’ll follow, I’ll follow that.
High Frequency Words. High Frequency Words a about.
LING/C SC 581: Advanced Computational Linguistics
Reading Projects.
LING 388: Computers and Language
LING 388: Computers and Language
Key Question: What linguistic features and techniques are used to represent gender? Re-read either Daily Mail article, or opening to Mrs. Dalloway.
I am going to stay at home. He She It is going to stay at home. You We
Unit 7 Earthquakes Lesson 25 I. Go over the following words and
LING 388: Computers and Language
Fry Word Test First 300 words in 25 word groups
Key Question: In what ways does language-choice shape our understanding of gender? Starter Think back to yesterday’s lesson: How many gender-specific and.
Descriptive Paragraph
LONDRA NELLA LETTERATURA INGLESE
Vocabulary.
100 High Frequency Words.
Sight Words.
LING/C SC 581: Advanced Computational Linguistics
The cat who ran away Megan Clark-Miller
Vocabulary.
the you are to was they of that as in for I and it with is on my a he
Write the word..
LING 388: Computers and Language
Parts of Speech.
Test #2 Review and Preparation Mr. Eric Kelley 6th Grade English
The.
Presentation transcript:

LING 388: Computers and Language Lecture 25

nltk book: chapter 3 Last time, we discussed the problem of word tokenization… >>> text = 'That U.S.A. poster-print costs $12.40...' >>> word_tokenize(text) ['That', 'U.S.A.', 'poster-print', 'costs', '$', '12.40', '...']

nltk book: chapter 3 3.8 Segmentation Sentence segmentation Brown corpus (pre-segmented): >>> len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents()) 20.250994070456922 (average sentence length in terms of number of words) >>> raw = "'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL. Soup does very\nwell without--Maybe it's always pepper that makes people hot-tempered,'..." >>> nltk.sent_tokenize(raw) ["'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL.", "Soup does very\nwell without--Maybe it's always pepper that makes people hot-tempered,'..."] >>> nltk.sent_tokenize(raw)[0]                                                                   "'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL." >>> nltk.sent_tokenize(raw)[1]                                                                   "Soup does very\nwell without--Maybe it's always pepper that makes people hot-tempered,'..."

nltk book: chapter 3 Mrs. Dalloway revisited: >>> from urllib import request >>> url = "http://gutenberg.net.au/ebooks02/0200991.txt" >>> response = request.urlopen(url)  >>> raw = response.read().decode('latin-1')  >>> raw = raw[431:] >>> raw = raw[1217:] >>> raw = raw[:368976] >>> raw[:100] 'Title:      Mrs. Dalloway\r\nAuthor:     Virginia Woolf\r\n\r\n\r\n\r\n\r\nMrs. Dalloway said she would buy the ' >>> raw[-100:] 's me with extraordinary excitement?\r\n\r\nIt is Clarissa, he said.\r\n\r\nFor there she was.\r\n\r\n\r\n\r\nTHE END'

nltk book: chapter 3 >>> sents = nltk.sent_tokenize(raw) 0. 'Title:      Mrs. Dalloway\r\nAuthor:     Virginia Woolf\r\n\r\n\r\n \r\n\r\nMrs. Dalloway said she would buy the flowers herself.' 'For Lucy had her work cut out for her.' "The doors would be taken\r\noff their hinges; Rumpelmayer's men were coming." 'And then, thought\r\nClarissa Dalloway, what a morning--fresh as if issued to children\r\non a beach.' 'What a lark!' 'What a plunge!' 'For so it had always seemed to her,\r\nwhen, with a little squeak of the hinges, which she could hear now,\r\nshe had burst open the French windows and plunged at Bourton into\r\nthe open air.'

nltk book: chapter 3 Famous for her stream-of-consciousness style of writing: >>> sents[7] 'How fresh, how calm, stiller than this of course,\r\nthe air was in the early morning; like the flap of a wave; the kiss\r\nof a wave; chill and sharp and yet (for a girl of eighteen as she\r\nthen was) solemn, feeling as she did, standing there at the open\r\nwindow, that something awful was about to happen; looking at the\r\nflowers, at the trees with the smoke winding off them and the rooks\r\nrising, falling; standing and looking until Peter Walsh said,\r\n"Musing among the vegetables?"' >>> s7 = word_tokenize(sents[7]) >>> len(s7) 107 cf. Brown corpus average of 20 words/sentence

Python save/restore corpus json can be used (and is more standard across programming languages), but pickle is a Python library for this purpose: >>> import pickle >>> f = open('dalloway.pickle','wb') wb = write binary >>> pickle.dump(raw,f) >>> f.close() >>> f = open('dalloway.pickle','rb') rb = read binary >>> raw2 = pickle.load(f) >>> raw == raw2 True

nltk book: chapter 3 Python formatted output (printed as just one column, not three…) Python formatted output >>> import nltk >>> import pickle >>> f = open('dalloway.pickle','rb') >>> raw = pickle.load(f) >>> f.close() >>> fd = nltk.FreqDist(nltk.word_tokenize(raw)) >>> t50 = fd.most_common(50) >>> m = max(len(t[0]) for t in t50) >>> m 8 >>> for x,y in t50: ...     print('{:{width}} {}'.format(x,y,width=m)) ...  , 6098 . 3017 the 3015 and 1625 of 1525 ; 1473 to 1447 a 1328 was 1254 her 1227 she 1157 in 1107 had 928 he 908 it 712 that 622 with 565 --       545 his      490 ''       458 for      446 on       441 at       427 him      421 said     410 not      403 as       396 ``       388 She      372 ?        361 !        346 one      317 's       306 all      305 they     305 (        290 )        290 would    278 were     276 But      270 He       269 so       266 which    266 could    264 Clarissa 263 this     254 thought  252 be       245 up       232 like     232

nltk book: chapter 3

nltk book: chapter 3 At the terminal, line wrap is quite arbitrary… >>> from textwrap import fill >>> pieces = ['{} ({})'.format(t[0],t[1]) for t in t50] >>> s50 = ' '.join(pieces) >>> print(fill(s50)) , (6098) . (3017) the (3015) and (1625) of (1525) ; (1473) to (1447) a (1328) was (1254) her (1227) she (1157) in (1107) had (928) he (908) it (712) that (622) with (565) -- (545) his (490) '' (458) for (446) on (441) at (427) him (421) said (410) not (403) as (396) `` (388) She (372) ? (361) ! (346) one (317) 's (306) all (305) they (305) ( (290) ) (290) would (278) were (276) But (270) He (269) so (266) which (266) could (264) Clarissa (263) this (254) thought (252) be (245) up (232) like (232) fill() Text is preferably wrapped on whitespaces and right after the hyphens in hyphenated words; only then will long words be broken if necessary

nltk book: chapter 3 More fancy formatting: >>> pieces = ['{}_({})'.format(t[0],t[1]) for t in t50] >>> s50 = ' '.join(pieces) >>> print(fill(s50).replace('_',' ')) , (6098) . (3017) the (3015) and (1625) of (1525) ; (1473) to (1447) a (1328) was (1254) her (1227) she (1157) in (1107) had (928) he (908) it (712) that (622) with (565) -- (545) his (490) '' (458) for (446) on (441) at (427) him (421) said (410) not (403) as (396) `` (388) She (372) ? (361) ! (346) one (317) 's (306) all (305) they (305) ( (290) ) (290) would (278) were (276) But (270) He (269) so (266) which (266) could (264) Clarissa (263) this (254) thought (252) be (245) up (232) like (232)

nltk book: chapter 4 4 Writing Structured Programs teaches Python with many cool examples relevant to text processing Example: find the longest words in Milton's Paradise Lost >>> text = nltk.corpus.gutenberg.words('milton-paradise.txt') >>> maxlen = max(len(word) for word in text) >>> [word for word in text if len(word) == maxlen] ['unextinguishable', 'transubstantiate', 'inextinguishable', 'incomprehensible']

nltk book: chapter 4 Advanced topic: Generators (functions with yield instead of return)

nltk book: chapter 4 Brown corpus

nltk book: chapter 4

nltk book: chapter 4 WordNet

nltk book: chapter 4