Week 8 The Natural Language Toolkit (NLTK)‏ Except where otherwise noted, this work is licensed under:

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
1 I256: Applied Natural Language Processing Marti Hearst Aug 30, 2006.
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
LING 388 Language and Computers Lecture 22 11/25/03 Sandiway FONG.
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
LING 581: Advanced Computational Linguistics Lecture Notes May 5th.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
1 I256: Applied Natural Language Processing Marti Hearst Sept 25, 2006.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
1 Python Chapter 3 Reading strings and printing. © Samuel Marateck.
CALL: Computer-Assisted Language Learning. 2/14 Computer-Assisted (Language) Learning “Little” programs Purpose-built learning programs (courseware) Using.
Part-of-Speech Tagging & Sequence Labeling
2/28/2008. >>> Overview Arrays in Python – a.k.a. Lists Ranges are Lists Strings vs. Lists Tuples vs. Lists Map-Reduce Lambda Review: Printing to a file.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Computer Literacy PowerPoint Dustin Llanes Comm. 165.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
ELN – Natural Language Processing Giuseppe Attardi
An Introduction to Textual Programming
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Programming Languages V Deena Engel’s class.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
MOM! Phineas and Ferb are … Aims:
Sir Tim Berners-Lee (1955-) British computer scientist Inventor of the World Wide Web in 1989 (developed the first HTML protocol and sent the first messages.
October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Shallow Parsing and Chunking Python and NLTK NLTK Exercises.
Data Collections: Dictionaries CSC 161: The Art of Programming Prof. Henry Kautz 11/4/2009.
Python for Informatics: Exploring Information
Computer Programming TCP1224 Chapter 3 Completing the Problem-Solving Process and Getting Started with C++
BY TSHISHONGA AW /04/081 Co-Supervisor : Mr Reg Dodds Supervisor :Professor I.M Venter APPLYING VENDA TEXT TOWARDS THE DEVELOPMENT OF AN INTELLIGENT.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Python From the book “Think Python”
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
For Wednesday No reading Homework –Chapter 23, exercise 15 –Process: 1.Create 5 sentences 2.Select a language 3.Translate each sentence into that language.
Tokenization & POS-Tagging
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Grammars Grammars can get quite complex, but are essential. Syntax: the form of the text that is valid Semantics: the meaning of the form – Sometimes semantics.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
GRAMMARS & PARSING Lecture 8 CS2110 – Spring If you are going to form a group for A2, please do it before tomorrow (Friday) noon.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Statistical Machine Translation Raghav Bashyal. Statistical Machine Translation Uses pre-translated text (copora) Compare translated text to original.
File I/O CMSC 201. Overview Today we’ll be going over: String methods File I/O.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Notes on Python Regular Expressions and parser generators (by D. Parson) These are the Python supplements to the author’s slides for Chapter 1 and Section.
Tools for Linguistic Analysis. Overview of Linguistic Tools  Dictionaries  Linguistic Inquiry and Word Count (LIWC) Linguistic Inquiry and Word Count.
Hybrid Method for Tagging Arabic Text Written By: Yamina Tlili-Guiassa University Badji Mokhtar Annaba, Algeria Presented By: Ahmed Bukhamsin.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
CSC 1010 Programming for All Lecture 2 Introduction to Python Some material based on material from Marty Stepp, Instructor, University of Washington.
POS Tagger and Chunker for Tamil
Python - 2 Jim Eng Overview Lists Dictionaries Try... except Methods and Functions Classes and Objects Midterm Review.
October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Python and NLTK Shallow Parsing and Chunking NLTK Lite.
Part-of-Speech Tagging & Sequence Labeling Hongning Wang
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Problem Solving with NLTK MSE 2400 EaLiCaRA Dr. Tom Way.
Python for NLP and the Natural Language Toolkit
Notes on Python Regular Expressions and parser generators (by D
Natural Language Processing (NLP)
For -G7 programing language Teacher / Shamsa Hassan Alhassouni.
LING/C SC 581: Advanced Computational Linguistics
Python - Strings.
LING/C SC/PSYC 438/538 Lecture 13 Sandiway Fong.
Natural Language Processing (NLP)
CSA2050: Introduction to Computational Linguistics
CMPT 120 Lecture 3 - Introduction to Computing Science – Programming language, Variables, Strings, Lists and Modules.
STRING MANUPILATION.
Natural Language Processing (NLP)
Presentation transcript:

Week 8 The Natural Language Toolkit (NLTK)‏ Except where otherwise noted, this work is licensed under:

2 List methods Getting information about a list –list.index(item)‏ –list.count(item)‏ These modify the list in-place, unlike str operations –list.append(item)‏ –list.insert(index, item)‏ –list.remove(item)‏ –list.extend(list2)‏ same as list += list2 –list.sort()‏ –list.reverse()‏

3 List exercise Write a script to print the most frequent token in a text file.

4 And now for something completely different

5 So far, we've studied programming syntax and techniques What about tasks for programming? –Homework –Mathematics, statistics –Biology –Animation –Website development –Game development –Natural language processing Programming tasks? (Sage)‏ (Biopython)‏ (Blender)‏ (Django)‏ (PyGame)‏ (NLTK)‏

6 Natural Language Processing (NLP)‏ How can we make a computer understand language? –Can a human write/talk to the computer? Or can the computer guess/predict the input? –Can the computer talk back? –Based on language rules, patterns, or statistics For now, statistics are more accurate and popular

7 Some areas of NLP shallow processing – the surface level –tokenization –part-of-speech tagging –forms of words deep processing – the underlying structures of language –word order (syntax)‏ –meaning –translation natural language generation

8 The NLTK A collection of: –Python functions and objects for accomplishing NLP tasks –sample texts (corpora)‏ Available at: –Requires Python 2.4 or higher –Click 'Download' and follow instructions for your OS

9 Tokenization Say we want to know the words in Marty's vocabulary – "You know what I hate? Anybody who drives an S.U.V. I'd really like to find Mr. It-Costs-Me-100-Dollars-To-Gas-Up and kick him square in the teeth. Booyah. Be like, I'm Marty Stepp, the best ever. Booyah!" How do we split his speech into tokens?

10 Tokenization (cont.)‏ How do we split his speech into tokens? >>> martysSpeech.split()‏ ['You', 'know', 'what', 'I', 'hate?', 'Anybody', 'who', 'drives', 'an', 'S.U.V.', "I'd", 'really', 'like', 'to', 'find', 'Mr.', 'It-Costs-Me-100- Dollars-To-Gas-Up', 'and', 'kick', 'him', 'square', 'in', 'the', 'teeth.', 'Booyah.', 'Be', 'like,', "I'm", 'Marty', 'Stepp,', 'the', 'best', 'ever.', 'Booyah!'] Now, how often does he use the word "booyah"? >>> martysSpeech.split().count("booyah")‏ 0 >>> # What the!

11 Tokenization (cont.)‏ We could lowercase the speech We could write our own method to split on "." split on ",", split on "-", etc. The NLTK already has several tokenizer options Try: nltk.tokenize.WordPunctTokenizer –tokenizes on all punctuation nltk.tokenize.PunktWordTokenizer –trained algorithm to statistically split on words

12 Part-of-speech (POS) tagging If you know a token's POS you know: –is it the subject? –is it the verb? –is it introducing a grammatical structure? –is it a proper name?

13 Part-of-speech (POS) tagging Exercise: most frequent proper noun in the Penn Treebank? –Try: nltk.corpus.treebank Python's dir() to list attributes of an object –Example: >>> dir("hello world!")‏ [..., 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower',...]

14 Tuples‏ tagged_words() gives us a list of tuples –tuple: the same thing as a list, but you can't change it –in this case, the tuples are a (word, tag) pairs >>> # Get the (word, tag) pair at list index 0... >>> pair = nltk.corpus.treebank.tagged_words()[0] >>> pair ('Pierre', 'NNP')‏ >>> word = pair[0] >>> tag = pair[1] >>> print word, tag Pierre NNP >>> word, tag = pair # or unpack in 1 line! >>> print word, tag Pierre NNP

15 POS tagging (cont.)‏ How do we tag plain sentences? –A NLTK tagger needs a list of tagged sentences to train on We'll use nltk.corpus.treebank.tagged_sents()‏ –Then it is ready to tag any input! (but how well?)‏ –Try these tagger objects: nltk.UnigramTagger(tagged_sentences)‏ nltk.TrigramTagger(tagged_sentences)‏ –Call the tagger's tag(tokens) method >>> tagger = nltk.UnigramTagger(tagged_sentences)‏ >>> result = tagger.tag(tokens)‏ >>> result [('You', 'PRP'), ('know', 'VB'), ('what', 'WP'), ('I', 'PRP'), ('hate', None), ('?', '.'),...]

16 POS tagging (cont.)‏ Exercise: Mad Libs –I have a passage I want filled with the right parts of speech –Let's use random picks from our own data! –This code will print it out: print properNoun1, "has always been a", adjective1, \ singularNoun, "unlike the", adjective2, \ properNoun2, "who I", pastVerb, "as he was", \ ingVerb, "yesterday."

17 Eliza (NLG)‏ Eliza simulates a Rogerian psychotherapist With while loops and tokenization, you can make a chat bot! –Try: nltk.chat.eliza.eliza_chat()‏

18 Parsing Syntax is as important for a compiler as it is for natural language Realizing the hidden structure of a sentence is useful for: –translation –meaning analysis –relationship analysis –a cool demo! Try: –nltk.draw.rdparser.demo()‏

19 Conclusion NLTK: NLP made easy with Python –Functions and objects for: tokenization, tagging, generation, parsing,... and much more! –Even armed with these tools, NLP has a lot of difficult problems! Also saw: –List methods –dir()‏ –Tuples