LING/C SC 581: Advanced Computational Linguistics

Slides:



Advertisements
Similar presentations
The people Look for some people. Write it down. By the water
Advertisements

Programming for Linguists
Responding to Text Dependent Questions
Mrs Dalloway said she was going to buy some flowers.
Visual Notes on Mrs. Dalloway’s Walk through London E.K. Sparks Clemson University Fall 2002.
The Concept of Time in the Modern Age From literature to painting Laurano Federica 5 A A.S. 2012/
The Hours by Michael Cunningham The Hours by Michael Cunningham "To look life in the faith, always, to look life in the faith, and to know it for what.
100 Most Common Words.
Virginia Woolf ( ) *.
Stream of Consciousness A Literary Technique. Stream of Consciousness First assumes that the human mind is a constant stream of thoughts, associations,
.  Entertain  Inform  Educate  Blogs  Sell  Date  Gamble  Religion.
THE COLLECTOR by John Fowles
Created by Verna C. Rentsch and Joyce Cooling Nelson School
Features of the monologue style in Alan Bennett’s Talking Heads
ON-LINE DOCUMENTS 3 DAY /17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Terminology and concepts. Using terminology and concepts correctly helps you to articulate your responses to literary texts: in detail with precision.
Sight words.
Facebook Mrs. Dalloway The party! WallPhotosFlairBoxesMrs. DallowayLogout View photos of Mrs. Dalloway (6) Send Clarissa a message Poke message Wall InfoPhotosBoxes.
Direct speech "Oh, good gracious me!" said Lucy "Look at him" said Mr Emerson to Lucy "Very well, dear," said Miss Bartlett, with a faint flush of pleasure.
Lucrezia (Rezia) Smith such a lovely day for a walk :] View photos of Rezia (152) Send Rezia a message Poke Wall InfoPhotos+ Information Married to: Septimus.
Virginia Woolf Mrs. Dalloway.
Sight Words.
Modernism – a movement in art and literature that occurred around the time immediately before and during the First and Second World Wars. Among the factors.
Mrs Dalloway By Virginia Woolf. * a novel by Virginia Woolf * created from two short stories, "Mrs Dalloway in Bond Street" and the unfinished "The Prime.
“As a woman I have no country. As a woman my country is the whole world.” “For most of history, Anonymous was a woman.” Virginia Woolf.
First Grade Rainbow Words By Mrs. Saucedo , Maxwell School
What words does the author use to describe sleep in the first paragraph? – “nightly journey from consciousness” – “mysterious world of sleep” These phrases.
Section B The Accident. Reading More Questions and Answers : Direction: Answer the following questions according to the text. 1. What’s you feeling after.
STREAM OF CONSCIOUSNESS Definition: Is a literary technique used by Richardson, Virginia Woolf and James Joyce. Stream of consciuousness is characterized.
ELLIPSIS!!!! By James Wheelock. Definition: Ellipsis Ellipsis means leaving (omitting) out a word that need not be said because it is already implied.
Modernism, Literature and the Feminist Perspective
First 100 high frequency words
Posters. Posters when it happened. suddenly, several car alarms went off. Speaking activity The following people live in a city that was hit by an.
Unit 1.
Modernism and Heart of Darkness
THE COLLECTOR by John Fowles
the and a to said in he I of it was you they on she is for at his but
Period 6 Grammar (II) Indefinite Pronouns.
RESPONDING TO TEXT DEPENDENT QUESTIONS
Mrs. Dalloway said she would buy the flowers herself.
LING/C SC/PSYC 438/538 Lecture 2 Sandiway Fong.
High Frequency Words. High Frequency Words a about.
LING 388: Computers and Language
THE BEST THING IN THE WORLD
LING 388: Computers and Language
LING 388: Computers and Language
LING 388: Computers and Language
CSCE 590 Web Scraping - NLTK
Virginia Woolf Virginia Woolf..
Key Question: What linguistic features and techniques are used to represent gender? Re-read either Daily Mail article, or opening to Mrs. Dalloway.
Practise the dialogue Mother: What’s the matter, Sally? You look unhappy? Sally: I have got a headache and I feel terrible. Mother: Oh! I’m really sorry.
WHAT’S UP?! A GUIDE TO CITING YOUR SOURCES, YO!
LING/C SC 581: Advanced Computational Linguistics
1. My sister (to be) _________ fourteen last January.
Fry Word Test First 300 words in 25 word groups
Key Question: In what ways does language-choice shape our understanding of gender? Starter Think back to yesterday’s lesson: How many gender-specific and.
LONDRA NELLA LETTERATURA INGLESE
LING 408/508: Computational Techniques for Linguists
How to Cite Evidence and Make Inferences: Back it up with proof!
Practise the dialogue Mother: What’s the matter, Sally? You look unhappy? Sally: I have got a headache and I feel terrible. Mother: Oh! I’m really sorry.
A Grimm Story: Jennifer’s Dream
Mrs. Dalloway Novel moves from one character to another
LING/C SC 581: Advanced Computational Linguistics
CSCE 590 Web Scraping - NLTK
“I’M BLUE” Relative Clauses (Performed by Eiffel 65)
First Grade High Frequency Words Kinder. review Pre-1st Grade
CS2911 Week 3, Lab Today Thursday Friday Review Muddiest Point Lab 3
LING 388: Computers and Language
LING/C SC 581: Advanced Computational Linguistics
Presentation transcript:

LING/C SC 581: Advanced Computational Linguistics Lecture 2 Jan 15th

From last time Did everyone install Python 3 and nltk/nltk_data? We'll do a Homework 2 on this today…

Importing your own corpus Learning to import your own texts plain text OR Beautiful Soup (html) Read nltk book chapter 3 Assume import nltk, re, pprint from nltk import word_tokenize Reading local files

Project Gutenberg http://www.gutenberg.org/catalog/

Step 1: download Download: raw file = 1 (long) string Text number 2554 is an English translation of Crime and Punishment

Step 1: word_tokenize() Tokenize: list of words

Step 1: Beautiful Soup . html: get_text() from BeautifulSoup

nltk Text object: methods .collocations() and .concordance(word)

Step 1: getting rid of extraneous start/end text Adjusting start and end:

Mrs. Dalloway Project Gutenberg Australia (not indexed by www.gutenberg.org) http://gutenberg.net.au/ebooks02/0200991.txt Mrs. Dalloway by Virginia Woolf (1925) Code to read plaintext: from urllib import request url = "http://gutenberg.net.au/ebooks02/0200991.txt" response = request.urlopen(url) raw = response.read().decode('latin-1') # utf-8 common

Mrs. Dalloway Dealing with html >>> html = request.urlopen(url).read().decode('latin-1') >>> html[:60] '\r\n\r\n<table width="45%" border ="0">\r\n<tr>\r\n<td bgcolor="#' >>> from bs4 import BeautifulSoup >>> raw = BeautifulSoup(html).get_text() >>> tokens = word_tokenize(raw) >>> len(tokens) 77977 >>> tokens[:100] ['ï', '»', '¿', 'Project', 'Gutenberg', 'Australia', 'a', 'treasure-trove', 'of', 'literature', 'treasure', 'found', 'hidden', 'with', 'no', 'evidence', 'of', 'ownership', 'Title', ':', 'Mrs.', 'Dalloway', '(', '1925', ')', 'Author', ':', 'Virginia', 'Woolf', '*', 'A', 'Project', 'Gutenberg', 'of', 'Australia', 'eBook', '*', 'eBook', 'No', '.', ':', '0200991.txt', 'Edition', ':', '1', 'Language', ':', 'English', 'Character', 'set', 'encoding', ':', 'Latin-1', '(', 'ISO-8859-1', ')', '--', '8', 'bit', 'Date', 'first', 'posted', ':', 'November', '2002', 'Date', 'most', 'recently', 'updated', ':', 'November', '2002', 'This', 'eBook', 'was', 'produced', 'by', ':', 'Don', 'Lainson', 'dlainson', '@', 'sympatico.ca', 'Project', 'Gutenberg', 'of', 'Australia', 'eBooks', 'are', 'created', 'from', 'printed', 'editions', 'which', 'are', 'in', 'the', 'public', 'domain', 'in'] >>> tokens[-100:] ['shall', 'go', 'and', 'talk', 'to', 'him', '.', 'I', 'shall', 'say', 'goodnight', '.', 'What', 'does', 'the', 'brain', 'matter', ',', "''", 'said', 'Lady', 'Rosseter', ',', 'getting', 'up', ',', '``', 'compared', 'with', 'the', 'heart', '?', "''", '``', 'I', 'will', 'come', ',', "''", 'said', 'Peter', ',', 'but', 'he', 'sat', 'on', 'for', 'a', 'moment', '.', 'What', 'is', 'this', 'terror', '?', 'what', 'is', 'this', 'ecstasy', '?', 'he', 'thought', 'to', 'himself', '.', 'What', 'is', 'it', 'that', 'fills', 'me', 'with', 'extraordinary', 'excitement', '?', 'It', 'is', 'Clarissa', ',', 'he', 'said', '.', 'For', 'there', 'she', 'was', '.', 'THE', 'END', 'This', 'site', 'is', 'full', 'of', 'FREE', 'ebooks', '-', 'Project', 'Gutenberg', 'Australia']

Mrs. Dalloway >>> raw[:150] '\r\n\r\n<table width="45%" border ="0">\r\n<tr>\r\n<td bgcolor="#FFE4E1"><font color="#800000" size="5"><p style="text-align:center"><b><a href="http://gut'

Mrs. Dalloway >>> response = request.urlopen(url) >>> raw = response.read().decode('latin-1') >>> m = re.search('Title',raw) >>> m <_sre.SRE_Match object; span=(426, 431), match='Title'> >>> raw = raw[431:] <_sre.SRE_Match object; span=(1217, 1222), match='Title'> >>> raw = raw[1217:] >>> raw[:200] 'Title: Mrs. Dalloway\r\nAuthor: Virginia Woolf\r\n\r\n\r\n\r\n\r\nMrs. Dalloway said she would buy the flowers herself.\r\n\r\nFor Lucy had her work cut out for her. The doors would be taken\r\noff their hing'

Mrs. Dalloway >>> raw[-400:] 'What is\r\nit that fills me with extraordinary excitement?\r\n\r\nIt is Clarissa, he said.\r\n\r\nFor there she was.\r\n\r\n\r\n\r\nTHE END\r\n\r\n\r\n\r\n\r\n\r\n</pre>\r\n<p style="margin- left:10%"><img src="/pga-australia.jpg" width="80" height="75" alt=""> </p>\r\n\r\n<p><b>This site is full of FREE ebooks - <a href="http://gutenberg.net.au" target="_blank">Project Gutenberg Australia</a></b></p>\r\n<!-- ad goes here -- >\r\n\r\n\r\n\r\n\r\n' >>> m = re.search('THE END',raw) >>> m <_sre.SRE_Match object; span=(368969, 368976), match='THE END'> >>> raw = raw[:368976] 'Sally.  "I shall go\r\nand talk to him.  I shall say goodnight.  What does the brain\r\nmatter," said Lady Rosseter, getting up, "compared with the heart?"\r\n\r\n"I will come," said Peter, but he sat on for a moment.  What is\r\nthis terror? what is this ecstasy? he thought to himself.  What is\r\nit that fills me with extraordinary excitement?\r\n\r\nIt is Clarissa, he said.\r\n\r\nFor there she was.\r\n\r\n\r\n\r\nTHE END'

Mrs. Dalloway >>> tokens = word_tokenize(raw) >>> type(tokens) <class 'list'> >>> len(tokens) 77718 >>> tokens[:100] ['Title', ':', 'Mrs.', 'Dalloway', 'Author', ':', 'Virginia', 'Woolf', 'Mrs.', 'Dalloway', 'said', 'she', 'would', 'buy', 'the', 'flowers', 'herself', '.', 'For', 'Lucy', 'had', 'her', 'work', 'cut', 'out', 'for', 'her', '.', 'The', 'doors', 'would', 'be', 'taken', 'off', 'their', 'hinges', ';', 'Rumpelmayer', "'s", 'men', 'were', 'coming', '.', 'And', 'then', ',', 'thought', 'Clarissa', 'Dalloway', ',', 'what', 'a', 'morning', '--', 'fresh', 'as', 'if', 'issued', 'to', 'children', 'on', 'a', 'beach', '.', 'What', 'a', 'lark', '!', 'What', 'a', 'plunge', '!', 'For', 'so', 'it', 'had', 'always', 'seemed', 'to', 'her', ',', 'when', ',', 'with', 'a', 'little', 'squeak', 'of', 'the', 'hinges', ',', 'which', 'she', 'could', 'hear', 'now', ',', 'she', 'had', 'burst']

Mrs. Dalloway >>> text = nltk.Text(tokens) >>> len(text) 77718 >>> len(set(text)) 7623 >>> len(set(text)) / len(text) 0.09808538562495175 >>> text.count('Dalloway') 104 >>> text.count('Mrs.') 118 >>> fd = nltk.FreqDist(text) >>> fd FreqDist({',': 6098, '.': 3017, 'the': 3015, 'and': 1625, 'of': 1525, ';': 1473, 'to': 1447, 'a': 1328, 'was': 1254, 'her': 1227, ...}) >>> print(fd) <FreqDist with 7623 samples and 77718 outcomes> >>> fd['Dalloway'] 104 >>> text.collocations() Peter Walsh; Sir William; Lady Bruton; Miss Kilman; Dr. Holmes; Prime Minister; Ellie Henderson; Mrs. Filmer; Mrs. Dalloway; Hugh Whitbread; Warren Smith; Sally Seton; Aunt Helena; Big Ben; Richard Dalloway; motor car; Miss Parry; motor cars; years ago; Bond Street

Mrs. Dalloway >>> fd2 = nltk.FreqDist(len(word) for word in text) >>> fd2.most_common() [(3, 17056), (1, 13433), (4, 11541), (2, 11452), (5, 7915), (6, 5236), (7, 4496), (8, 2980), (9, 1684), (10, 966), (11, 446), (12, 253), (13, 158), (14, 51), (15, 37), (16, 4), (18, 4), (17, 3), (19, 1), (20, 1), (21, 1)] >>> fd2.plot()

Mrs. Dalloway

Mrs. Dalloway Searching Tokenized Text in nltk angle brackets <…> mark token boundaries >>> text[:20] ['Title', ':', 'Mrs.', 'Dalloway', 'Author', ':', 'Virginia', 'Woolf', 'Mrs.', 'Dalloway', 'said', 'she', 'would', 'buy', 'the', 'flowers', 'herself', '.', 'For', 'Lucy'] >>> text.findall(r"<Mrs\.> (<\w+>)") Dalloway; Dalloway; Foxcroft; Dalloway; Asquith; Dalloway; Richard; Dalloway; Dalloway; Dalloway; Coates; Coates; Bletchley; Bletchley; Dempster; Dempster; Dempster; Dempster; Dempster; Dempster; Dempster; Dalloway; Walker; Dalloway; Walker; Dalloway; Dalloway; Dalloway; Dalloway; Turner; Filmer; Hugh; Septimus; Filmer; Filmer; Warren; Smith; Filmer; Smith; Warren; Dalloway; Whitbread; Marsham; Marsham; Marsham; Marsham; Hilbery; Dalloway; Dalloway; Dalloway; Dalloway; Dalloway; Dalloway; Marsham; Marsham; Dalloway; Dalloway; Gorham; Dalloway; Filmer; Peters; Peters; Filmer; Peters; Peters; Filmer; Peters; Peters; Peters; Peters; Filmer; Peters; Peters; Peters; Filmer; Filmer; Filmer; Williams; Filmer; Filmer; Filmer; Filmer; Filmer; Filmer; Filmer; Filmer; Burgess; Burgess; Burgess; Morris; Morris; Walker; Walker; Dalloway; Walker; Walker; Walker; Parkinson; Barnet; Barnet; Barnet; Barnet; Barnet; Garrod; Hilbery; Mount; Dakers; Durrant; Hilbery; Hilbery; Dalloway; Dalloway; Dalloway; Dalloway; Hilbery; Hilbery

nltk: .sent_tokenize() 3.8   Segmentation Sentence segmentation Brown corpus (pre-segmented): >>> len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents()) 20.250994070456922 (average sentence length in terms of number of words) >>> raw = "'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL. Soup does very\nwell without--Maybe it's always pepper that makes people hot-tempered,'..." >>> nltk.sent_tokenize(raw) ["'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL.", "Soup does very\nwell without--Maybe it's always pepper that makes people hot-tempered,'..."] >>> nltk.sent_tokenize(raw)[0]                                                                   "'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL." >>> nltk.sent_tokenize(raw)[1]                                                                   "Soup does very\nwell without--Maybe it's always pepper that makes people hot-tempered,'..."

Homework 2 Virginia Woolf was famous for her stream-of-consciousness style of writing: How fresh, how calm, stiller than this of course, the air was in the early morning; like the flap of a wave; the kiss of a wave; chill and sharp and yet (for a girl of eighteen as she then was) solemn, feeling as she did, standing there at the open window, that something awful was about to happen; looking at the flowers, at the trees with the smoke winding off them and the rooks rising, falling; standing and looking until Peter Walsh said, "Musing among the vegetables?"--was that it?--"I prefer men to cauliflowers"--was that it? Dumbledore's death in the style… https://www.theguardian.com/books/2005/jul/13/harrypotter.jkjoannekathleenrowl ing4 Download Mrs. Dalloway http://gutenberg.net.au/ebooks02/0200991.txt

Homework 2 Compute the average sentence length of Mrs. Dalloway Compare with the average sentence length of the Brown Corpus Is it true that stream-of-conscious writing leads to (significantly) longer sentences? Submit homework by Friday evening One PDF file: show your workings (Python interpreter)