1 I256: Applied Natural Language Processing Marti Hearst August 28, 2006.

Slides:



Advertisements
Similar presentations
How do you study for a test ?
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Natural Language Processing aka Computational Linguistics aka Text Analytics:
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Announcements You survived midterm 2! No Class / No Office hours Friday.
Choosing a Topic and Developing Research Questions
Presented by: April Schneeman Special Education Teacher Pontiac Township High School.
What is the “Big6?”. A PROBLEM SOLVING PROCESS WHAT KINDS OF PROBLEMS? Big6 can be used to solve any problem such as: –Buying a car –Making a banana.
Supporting Your Child With Reading In Preparation For SATs.
Leksička semantika i pragmatika 6. predavanje. Headlines Police Begin Campaign To Run Down Jaywalkers Iraqi Head Seeks Arms Teacher Strikes Idle Kids.
Drawing Trees & Ambiguity in Trees. Some Phrase Structure Rules of English S’ -> (Comp) S S’ -> (Comp) S S -> {NP/S’} (T) VP S -> {NP/S’} (T) VP VP 
1 Computer Processing of Natural Language Prof. Hearst i141 November 26, 2008.
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Mon 3-4 TA: Fadi Biadsy 702 CEPSR,
CSCD 555 Research Methods for Computer Science
I256 Applied Natural Language Processing Fall 2009 Lecture 12 Projects Barbara Rosario.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst August 30, 2004.
Project Report1 Dave Inman Project report. Project Report2 Ways to write a report Top down: Write the structure of the report (maybe use the web templates.
CS4705 Natural Language Processing Fall  How can machines recognize and generate text and speech? ◦ Human language phenomena ◦ Theories, often.
Natural Language Processing Prof: Jason Eisner Webpage: syllabus, announcements, slides, homeworks.
Natural Language Processing Ellen Back, LIS489, Spring 2015.
1 CS101 Introduction to Computing Lecture 19 Programming Languages.
SI485i : NLP Day 1 Intro to NLP. Assumptions about You You know… how to program Java basic UNIX usage basic probability and statistics (we’ll also review)
Tuesday, January 8, 2013, 12:30pm-3:30 pm Hollywood Road Education Services - Room 2.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Point of View, Myth, and Discovering the Theme
How Do I Find a Job to Apply to?
RESEARCHING TIPS & STRATEGIES Summer 2008 Melanie Wilson Academic Success Center MSC 207.
CSCI 200 Introduction To Programming with Visual Basic Bob Bradley.
Tux Paint Reviewed by team iTeach Jodi Hovest, Scottie Fetters, & Melanie Stainbrook.
Supporting Your Child With Reading In Preparation For SATs.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
9/8/20151 Natural Language Processing Lecture Notes 1.
Introduction to Natural Language Processing Heshaam Faili University of Tehran.
1 Ling 569: Introduction to Computational Linguistics Jason Eisner Johns Hopkins University Tu/Th 1:30-3:20 (also this Fri 1-5)
How to Create a Research PowerPoint
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Amy Pregulman Stanley British Primary School January 2015.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
How to Teach Using Go for it! An Introduction. Each unit of the Go for it! textbook has the following: Language goals that are listed in the Teachers’
1 Computational Linguistics Ling 200 Spring 2006.
Presentation by Dianne Smith, MJE. Something went wrong In jet crash, expert says.
CHAPTER 13 NATURAL LANGUAGE PROCESSING. Machine Translation.
Introduction to CL & NLP CMSC April 1, 2003.
LISTENING SKILLS March 28, Today Listening for lectures.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
 Choose the other four vocabulary words that you didn’t use last class and write one sentence for each. 1. Deliberately 2. Demonstrate 3. Infer 4. Contrast.
How to Ask Reading Questions 北一女中 寧曉君老師
NACLO 2008 North American Computation Linguistics Olympiad Brandeis CL Olympiad Team James Pustejovsky Tai Sassen-Liang Sharone Horowit-Hendler Noam Sienna.
CSA2050 Introduction to Computational Linguistics Lecture 1 Overview.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
Lesson 2 Artificial Intelligence Lesson 2 Artificial Intelligence.
CS 188: Artificial Intelligence Spring 2009 Natural Language Processing Dan Klein – UC Berkeley 1.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
Drawing Trees & Ambiguity in Trees
Dialog Processing with Unsupervised Artificial Neural Networks Andrew Richardson Thomas Jefferson High School for Science and Technology Computer Systems.
1. Wikis for Classes By Luis Avila 2 Why do we choose a wiki for ? It was tough as a solution for communicate with students and parents. It is a nice.
Mining of Massive Datasets Edited based on Leskovec’s from
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Comprehension in KS2. By the end of the session  Understand what inference and deduction are.  Know why inference and deduction are important skills.
Reading Between the Lines. By the end of the session  Understand what inference and deduction are.  Know why inference and deduction are important skills.
Unit 1 How can we become good learners?
Language Learning for Busy People These documents are private and confidential. Please do not distribute.. Intermediate: I Disagree.
Key Stage 2 SATs Parents’ Meeting Wednesday 4 th March 2015.
Parents’ Reading Workshop Lin Jowitt & Michelle Winstone English Co-ordinators.
My Favorite Top 5 Free Keyword Research Tools –
Smarter Balanced Scores & Reports. The new assessment, Smarter Balanced, replaces our previous statewide assessment, the New England Common Assessment.
/665 Natural Language Processing
What the problem looks like:
CSCI 5832 Natural Language Processing
Presentation transcript:

1 I256: Applied Natural Language Processing Marti Hearst August 28, 2006

2 Today Motivation: SIMS student projects Course Goals Why NLP is difficult How to solve it? Corpus-based statistical approaches What we’ll do in this course

3 ANLP Motivation: SIMS Masters Projects HomeSkim (2005) Chan, Lib, Mittal, Poon Apartment search mashup Extracted fields from Craigslist listings Orpheus (2004) Maury, Viswanathan, Yang Tool for discovering new and independent recording artists Extracted artists, links, reviews from music websites Breaking Story (2002) Reffell, Fitzpatrick, Aydelott Summarize trends in news feeds Categories and entities assigned to all news articles

4

5

6 HomeSkim Craigslist Analysis

7

8

9

10

11 Goals of this Course Learn about the problems and possibilities of natural language analysis: What are the major issues? What are the major solutions? –How well do they work? –How do they work (but to a lesser extent than CS )? At the end you should: Agree that language is subtle and interesting! Feel some ownership over the algorithms Be able to assess NLP problems –Know which solutions to apply when, and how Be able to read papers in the field

12 Today Motivation: SIMS student projects Course Goals Why NLP is difficult. How to solve it? Corpus-based statistical approaches. What we’ll do in this course.

13 We’ve past the year 2001, but we are not close to realizing the dream (or nightmare …)

Dave Bowman: “Open the pod bay doors, HAL” HAL 9000: “I’m sorry Dave. I’m afraid I can’t do that.”

15 Why is NLP difficult? Computers are not brains There is evidence that much of language understanding is built-in to the human brain Computers do not socialize Much of language is about communicating with people Key problems: Representation of meaning Language presupposed knowledge about the world Language only reflects the surface of meaning Language presupposes communication between people

16 Adapted from Robert Berwick's 6.863J Hidden Structure English plural pronunciation Toy + s  toyz; add z Book + s  books; add s Church + s  churchiz; add iz Box + s  boxiz; add iz Sheep + s  sheep; add nothing What about new words? Bach+ ‘s  boxs; why not boxiz?

17 Language subtleties Adjective order and placement A big black dog A big black scary dog A big scary dog A scary big dog A black big dog Antonyms Which sizes go together? –Big and little –Big and small –Large and small Large and little

18 Adapted from Robert Berwick's 6.863J World Knowledge is subtle He arrived at the lecture. He chuckled at the lecture. He arrived drunk. He chuckled drunk. He chuckled his way through the lecture. He arrived his way through the lecture.

19 Adapted from Robert Berwick's 6.863J Words are ambiguous (have multiple meanings) I know that. I know that block. I know that blocks the sun. I know that block blocks the sun.

20 How can a machine understand these differences? Get the cat with the gloves.

21 How can a machine understand these differences? Get the sock from the cat with the gloves. Get the glove from the cat with the socks.

22 How can a machine understand these differences? Decorate the cake with the frosting. Decorate the cake with the kids. Throw out the cake with the frosting. Throw out the cake with the kids.

23 Adapted from Robert Berwick's 6.863J Headline Ambiguity Iraqi Head Seeks Arms Juvenile Court to Try Shooting Defendant Teacher Strikes Idle Kids Kids Make Nutritious Snacks British Left Waffles on Falkland Islands Red Tape Holds Up New Bridges Bush Wins on Budget, but More Lies Ahead Hospitals are Sued by 7 Foot Doctors

24 The Role of Memorization Children learn words quickly Around age two they learn about 1 word every 2 hours. (Or 9 words/day) Often only need one exposure to associate meaning with word –Can make mistakes, e.g., overgeneralization “I goed to the store.” Exactly how they do this is still under study Adult vocabulary Typical adult: about 60,000 words Literate adults: about twice that.

25 The Role of Memorization Dogs can do word association too! Rico, a border collie in Germany Knows the names of each of 100 toys Can retrieve items called out to him with over 90% accuracy. Can also learn and remember the names of unfamiliar toys after just one encounter, putting him on a par with a three-year-old child.

26 Adapted from Robert Berwick's 6.863J But there is too much to memorize! establish establishment the church of England as the official state church. disestablishment antidisestablishment antidisestablishmentarian antidisestablishmentarianism is a political philosophy that is opposed to the separation of church and state.

27 Rules and Memorization Current thinking in psycholinguistics is that we use a combination of rules and memorization However, this is very controversial Mechanism: If there is an applicable rule, apply it However, if there is a memorized version, that takes precedence. (Important for irregular words.) –Artists paint “still lifes”  Not “still lives” –Past tense of  think  thought  blink  blinked This is a simplification; for more on this, see Pinker’s “Words and Rules” and “The Language Instinct”.

28 Representation of Meaning I know that block blocks the sun. How do we represent the meanings of “block”? How do we represent “I know”? How does that differ from “I know that.”? Who is “I”? How do we indicate that we are talking about earth’s sun vs. some other planet’s sun? When did this take place? What if I move the block? What if I move my viewpoint? How do we represent this?

29 How to tackle these problems? The field was stuck for quite some time. A new approach started around 1990 Well, not really new, but the first time around, in the 50’s, they didn’t have the text, disk space, or GHz Main idea: combine memorizing and rules How to do it: Get large text collections (corpora) Compute statistics over the words in those collections Surprisingly effective Even better now with the Web

30 Example Problem Grammar checker example: Which word to use? Solution: look at which words surround each use: I am in my third year as the principal of Anamosa High School. School-principal transfers caused some upset. This is a simple formulation of the quantum mechanical uncertainty principle. Power without principle is barren, but principle without power is futile. (Tony Blair)

31 Using Very, Very Large Corpora Keep track of which words are the neighbors of each spelling in well-edited text, e.g.: Principal: “high school” Principle: “rule” At grammar-check time, choose the spelling best predicted by the surrounding words. Surprising results: Log-linear improvement even to a billion words! Getting more data is better than fine-tuning algorithms!

32 The Effects of LARGE Datasets From Banko & Brill ‘01

33 Adapted from Robert Berwick's 6.863J Real-World Applications of NLP Spelling Suggestions/Corrections Grammar Checking Synonym Generation Information Extraction Text Categorization Automated Customer Service Speech Recognition (limited) Machine Translation In the (near?) future: Question Answering Improving Web Search Engine results Automated Metadata Assignment Online Dialogs

34 Automatic Help Desk Translation at Microsoft

35 Synonym Generation

36 Synonym Generation

37 What We’ll Do in this Course Read research papers and tutorials Use NLTK-lite (Natural Language ToolKit) to try out various algorithms Some homeworks will be to do some NLTK exercises We’ll do some of this in class Adopt a large text collection Use a wide range of NLP techniques to process it Work together to build a useful resource. Final project Either extend work on the collection we’ve been using, or chose from some suggestions I provide. Your own idea only if I think it is very likely to work well.

38 Assignment for Thursday Load python and NLTK-lite onto your computers Read Chapter 1 of Jurafsky & Martin Read NLTK-lite tutorial sections

39 Python A terrific programming language Interpreted Object-oriented Easy to interface to other things (web, DBMS, TK) Good stuff from: java, lisp, tcl, perl Easy to learn –I learned it this summer by reading Learning Python FUN! Assignment for Thursday: Load python and NLTK-lite onto your computers Read Chapter 1 of Jurafsky & Martin Read NLTK-lite tutorial Chapter 2 sections

40 Questions?