LING 388: Computers and Language

LING 388: Computers and Language
Lecture 23

nltk book: chapter 3 Last time:

nltk book: chapter 3 Download: raw file = 1 (long) string
Text number 2554 is an English translation of Crime and Punishment

nltk book: chapter 3 Adjusting start and end:

nltk book: chapter 3 Searching Tokenized Text in nltk
angle brackets <…> mark token boundaries >>> text[:20] ['Title', ':', 'Mrs.', 'Dalloway', 'Author', ':', 'Virginia', 'Woolf', 'Mrs.', 'Dalloway', 'said', 'she', 'would', 'buy', 'the', 'flowers', 'herself', '.', 'For', 'Lucy'] >>> text.findall(r"<Mrs\.> (<\w+>)") Dalloway; Dalloway; Foxcroft; Dalloway; Asquith; Dalloway; Richard; Dalloway; Dalloway; Dalloway; Coates; Coates; Bletchley; Bletchley; Dempster; Dempster; Dempster; Dempster; Dempster; Dempster; Dempster; Dalloway; Walker; Dalloway; Walker; Dalloway; Dalloway; Dalloway; Dalloway; Turner; Filmer; Hugh; Septimus; Filmer; Filmer; Warren; Smith; Filmer; Smith; Warren; Dalloway; Whitbread; Marsham; Marsham; Marsham; Marsham; Hilbery; Dalloway; Dalloway; Dalloway; Dalloway; Dalloway; Dalloway; Marsham; Marsham; Dalloway; Dalloway; Gorham; Dalloway; Filmer; Peters; Peters; Filmer; Peters; Peters; Filmer; Peters; Peters; Peters; Peters; Filmer; Peters; Peters; Peters; Filmer; Filmer; Filmer; Williams; Filmer; Filmer; Filmer; Filmer; Filmer; Filmer; Filmer; Filmer; Burgess; Burgess; Burgess; Morris; Morris; Walker; Walker; Dalloway; Walker; Walker; Walker; Parkinson; Barnet; Barnet; Barnet; Barnet; Barnet; Garrod; Hilbery; Mount; Dakers; Durrant; Hilbery; Hilbery; Dalloway; Dalloway; Dalloway; Dalloway; Hilbery; Hilbery

nltk book: chapter 3 nltk Text: .collocations() and .concordance(word)

nltk book: chapter 3

nltk book: chapter 3 >>> from nltk.corpus import brown >>> hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned'])) >>> hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>") speed and other activities; water and other liquids; tomb and other landmarks; Statues and other monuments; pearls and other jewels; charts and other items; roads and other features; figures and other objects; military and other areas; demands and other factors; abstracts and other compilations; iron and other metals

nltk book: chapter 3 3.5 Useful Applications of Regular Expressions
Extracting Word Pieces Relative frequency of sequences of two or more vowels >>> import nltk >>> import re >>> from nltk.corpus import ptb >>> t = ptb.words(categories='news') >>> w = set(t) >>> len(w) 49817 >>> len(t) >>> fd = nltk.FreqDist(vvs for word in w for vvs in re.findall(r'[aeiou]{2,}', word)) >>> fd.most_common(10) [('io', 2090), ('ea', 1882), ('ou', 1421), ('ie', 1418), ('ia', 1083), ('ai', 970), ('ee', 874), ('oo', 783), ('au', 448), ('ei', 442)] >>> fd.plot() ptb: you don't have access to the full treebank (in this course) treebank: use this instead

nltk book: chapter 3 iou eau

nltk book: chapter 3 Readability: leave out word-internal vowels

nltk book: chapter 3 Word stemming
book example from Monty Python And The Holy Grail /watch?v=eXmwK2-R2dY

nltk book: chapter 3

nltk book: chapter 3 Class IndexedText defined next slide:

nltk book: chapter 3 Advanced: generates (index, word) tuples

nltk book: chapter 3 >>> from nltk import word_tokenize >>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords ... is no basis for a system of government. Supreme executive power derives from ... a mandate from the masses, not from some farcical aquatic ceremony.""" >>> tokens = word_tokenize(raw) >>> wnl = nltk.WordNetLemmatizer() >>> [wnl.lemmatize(t) for t in tokens] ['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond', 'distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']

nltk book: chapter 3 3.7 Regular Expressions for Tokenizing Text
word_tokenize() we take for granted. Simple tokenization: >>> raw "'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL. Soup does very\nwell without--Maybe it's always pepper that makes people hot-tempered,'..." >>> import re >>> re.split(r' ', raw) ["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone\nthough),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very\nwell', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."] >>> re.split(r'\s', raw) any whitespace ["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]

nltk book: chapter 3 Split on non-word characters:
>>> re.split(r'\W+', raw) ['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without', 'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered', ''] >>> re.findall(r'\w+|\S\w*', raw) Punctuation ["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'won', "'t", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that', 'makes', 'people', 'hot', '-tempered', ',', "'", '.', '.', '.']

word (possibly hyphenated)
nltk book: chapter 3 re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw) ["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I', "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...'] (?: ….) means non-capturing group word (possibly hyphenated) apostrophe -- or … punctuation

nltk book: chapter 3 Remind ourselves what word_tokenize() does for ':
>>> word_tokenize(raw) ["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'wo', "n't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', 'it', "'s", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']

nltk book: chapter 3 >>> text = 'That U.S.A. poster-print costs $ ' >>> word_tokenize(text) ['That', 'U.S.A.', 'poster-print', 'costs', '$', '12.40', '...']

LING 388: Computers and Language

Similar presentations

Presentation on theme: "LING 388: Computers and Language"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LING 388: Computers and Language

Similar presentations

Presentation on theme: "LING 388: Computers and Language"— Presentation transcript:

Similar presentations

About project

Feedback