LING 388: Computers and Language

Slides:



Advertisements
Similar presentations
Listen -- strange women lying in ponds distributing swords is no basis for a system of government. Supreme executive power derives from a mandate from.
Advertisements

Regular Expressions Pattern and Match objects Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Programming for Linguists
Python 3 March 15, NLTK import nltk nltk.download()
NLTK & Python Day 4 LING Computational Linguistics Harry Howard Tulane University.
Strings and regular expressions Day 10 LING Computational Linguistics Harry Howard Tulane University.
1 I256: Applied Natural Language Processing Marti Hearst Sept 6, 2006.
Stemming, tagging and chunking Text analysis short of parsing.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Introduction to C Topics Compilation Using the gcc Compiler
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Methods in Computational Linguistics II with reference to Matt Huenerfauth’s Language Technology material Lecture 4: Matching Things. Regular Expressions.
NLTK & BASIC TEXT STATS DAY /08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
CMSC 104, Version 9/011 Introduction to C Topics Compilation Using the gcc Compiler The Anatomy of a C Program 104 C Programming Standards and Indentation.
Amy Dai Machine learning techniques for detecting topics in research papers.
NLTK & Python Day 5 LING Computational Linguistics Harry Howard Tulane University.
REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
Python – May 16 Recap lab Simple string tokenizing Random numbers Tomorrow: –multidimensional array (list of list) –Exceptions.
Regular expressions Day 11 LING Computational Linguistics Harry Howard Tulane University.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
PZ02CX Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ02CX - Perl Programming Language Design and Implementation.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
CSE 311 Foundations of Computing I Lecture 18 Recursive Definitions: Context-Free Grammars and Languages Autumn 2011 CSE 3111.
Literacy Fiction Non-Fiction Poetry. Speaking Listening and responding Group discussion and interaction Drama Word recognition Word structure and spelling.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture5 2 August 2007.
Java Basics Regular Expressions.  A regular expression (RE) is a pattern used to search through text.  It either matches the.
Information Retrieval in Practice
Python for NLP and the Natural Language Toolkit
UMBC CMSC 104 – Section 01, Fall 2016
Sources, attribution and plagiarism for news and feature writing
Programming what is C++
Regular Expressions Upsorn Praphamontripong CS 1110
NLTK Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O'REILLY, 2009.
Looking for Patterns - Finding them with Regular Expressions
Introduction to Programming (CS 201)
Text Based Information Retrieval
Perl Programming Language Design and Implementation (4th Edition)
Monty python and The holy grail
Natural Language Processing (NLP)
CS 430: Information Discovery
Formatting and Citation Review
LING 388: Computers and Language
LING 388: Computers and Language
Text Analytics Giuseppe Attardi Università di Pisa
LING 388: Computers and Language
LING 388: Computers and Language
LING 388: Computers and Language
CSCE 590 Web Scraping - NLTK
Key Question: What linguistic features and techniques are used to represent gender? Re-read either Daily Mail article, or opening to Mrs. Dalloway.
Poster Title Researchers’ Names Company or Institution
The reader is conditioned to the filter of narrator
LING 388: Computers and Language
Presentation Title Presentation Title Presentation Title
Creating your first C program
Chapter 11 Introduction to Programming in C
TEAM 2 EMERGING INFORMATION TECHNOLOGIES I
LING/C SC 581: Advanced Computational Linguistics
LING/C SC 581: Advanced Computational Linguistics
Natural Language Processing (NLP)
12th Computer Science – Unit 5
Introduction to Computer Science
CSCE 590 Web Scraping - NLTK
CSA2050: Introduction to Computational Linguistics
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
Introduction to Computer Science
Natural Language Processing (NLP)
PYTHON - VARIABLES AND OPERATORS
Presentation transcript:

LING 388: Computers and Language Lecture 23

nltk book: chapter 3 Last time:

nltk book: chapter 3 Download: raw file = 1 (long) string Text number 2554 is an English translation of Crime and Punishment

nltk book: chapter 3 Adjusting start and end:

nltk book: chapter 3 Searching Tokenized Text in nltk angle brackets <…> mark token boundaries >>> text[:20] ['Title', ':', 'Mrs.', 'Dalloway', 'Author', ':', 'Virginia', 'Woolf', 'Mrs.', 'Dalloway', 'said', 'she', 'would', 'buy', 'the', 'flowers', 'herself', '.', 'For', 'Lucy'] >>> text.findall(r"<Mrs\.> (<\w+>)") Dalloway; Dalloway; Foxcroft; Dalloway; Asquith; Dalloway; Richard; Dalloway; Dalloway; Dalloway; Coates; Coates; Bletchley; Bletchley; Dempster; Dempster; Dempster; Dempster; Dempster; Dempster; Dempster; Dalloway; Walker; Dalloway; Walker; Dalloway; Dalloway; Dalloway; Dalloway; Turner; Filmer; Hugh; Septimus; Filmer; Filmer; Warren; Smith; Filmer; Smith; Warren; Dalloway; Whitbread; Marsham; Marsham; Marsham; Marsham; Hilbery; Dalloway; Dalloway; Dalloway; Dalloway; Dalloway; Dalloway; Marsham; Marsham; Dalloway; Dalloway; Gorham; Dalloway; Filmer; Peters; Peters; Filmer; Peters; Peters; Filmer; Peters; Peters; Peters; Peters; Filmer; Peters; Peters; Peters; Filmer; Filmer; Filmer; Williams; Filmer; Filmer; Filmer; Filmer; Filmer; Filmer; Filmer; Filmer; Burgess; Burgess; Burgess; Morris; Morris; Walker; Walker; Dalloway; Walker; Walker; Walker; Parkinson; Barnet; Barnet; Barnet; Barnet; Barnet; Garrod; Hilbery; Mount; Dakers; Durrant; Hilbery; Hilbery; Dalloway; Dalloway; Dalloway; Dalloway; Hilbery; Hilbery

nltk book: chapter 3 nltk Text: .collocations() and .concordance(word)

nltk book: chapter 3

nltk book: chapter 3 >>> from nltk.corpus import brown >>> hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned'])) >>> hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>") speed and other activities; water and other liquids; tomb and other landmarks; Statues and other monuments; pearls and other jewels; charts and other items; roads and other features; figures and other objects; military and other areas; demands and other factors; abstracts and other compilations; iron and other metals

nltk book: chapter 3 3.5 Useful Applications of Regular Expressions Extracting Word Pieces Relative frequency of sequences of two or more vowels >>> import nltk >>> import re >>> from nltk.corpus import ptb >>> t = ptb.words(categories='news') >>> w = set(t)                                                                  >>> len(w) 49817 >>> len(t) 1253013 >>> fd = nltk.FreqDist(vvs for word in w for vvs in re.findall(r'[aeiou]{2,}', word)) >>> fd.most_common(10) [('io', 2090), ('ea', 1882), ('ou', 1421), ('ie', 1418), ('ia', 1083), ('ai', 970), ('ee', 874), ('oo', 783), ('au', 448), ('ei', 442)] >>> fd.plot() ptb: you don't have access to the full treebank (in this course) treebank: use this instead

nltk book: chapter 3 iou eau

nltk book: chapter 3 Readability: leave out word-internal vowels

nltk book: chapter 3 Word stemming book example from Monty Python And The Holy Grail https://www.youtube.com /watch?v=eXmwK2-R2dY

nltk book: chapter 3

nltk book: chapter 3 Class IndexedText defined next slide:

nltk book: chapter 3 Advanced: generates (index, word) tuples

nltk book: chapter 3 >>> from nltk import word_tokenize >>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords ... is no basis for a system of government. Supreme executive power derives from ... a mandate from the masses, not from some farcical aquatic ceremony.""" >>> tokens = word_tokenize(raw) >>> wnl = nltk.WordNetLemmatizer() >>> [wnl.lemmatize(t) for t in tokens] ['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond', 'distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']

nltk book: chapter 3 3.7 Regular Expressions for Tokenizing Text word_tokenize() we take for granted. Simple tokenization: >>> raw "'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL. Soup does very\nwell without--Maybe it's always pepper that makes people hot-tempered,'..." >>> import re >>> re.split(r' ', raw) ["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone\nthough),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very\nwell', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."] >>> re.split(r'\s', raw) any whitespace ["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]

nltk book: chapter 3 Split on non-word characters: >>> re.split(r'\W+', raw) ['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without', 'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered', ''] >>> re.findall(r'\w+|\S\w*', raw) Punctuation ["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'won', "'t", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that', 'makes', 'people', 'hot', '-tempered', ',', "'", '.', '.', '.']

word (possibly hyphenated) nltk book: chapter 3 re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw) ["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I', "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...'] (?: ….) means non-capturing group word (possibly hyphenated) apostrophe -- or … punctuation

nltk book: chapter 3 Remind ourselves what word_tokenize() does for ': >>> word_tokenize(raw) ["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'wo', "n't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', 'it', "'s", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']

nltk book: chapter 3 >>> text = 'That U.S.A. poster-print costs $12.40...' >>> word_tokenize(text) ['That', 'U.S.A.', 'poster-print', 'costs', '$', '12.40', '...']