1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems www.sims.berkeley.edu/~hearst.

Slides:

Advertisements

Similar presentations

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research.

Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

N-Grams and Corpus Linguistics 6 July Linguistics vs. Engineering “But it must be recognized that the notion of “probability of a sentence” is an.

Introduction to Information Retrieval

Spelling correction as an iterative process that exploits the collective knowledge of web users Silviu Cucerzan and Eric Brill July, 2004 Speaker: Mengzhe.

Spelling Correction as an iterative process that exploits the collective knowledge of web users Silviu Cucerzan & Eric Bill Microsoft Research Proceedings.

Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the five essential properties of an algorithm.

CS 4705 N-Grams and Corpus Linguistics Julia Hirschberg CS 4705.

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 N-Grams CSC 9010: Special Topics. Natural Language Processing.

1 N-Grams and Corpus Linguistics September 2009 Lecture #5.

N-Grams and Corpus Linguistics.  Regular expressions for asking questions about the stock market from stock reports  Due midnight, Sept. 29 th  Use.

1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.

The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

N-Grams and Corpus Linguistics

CS 4705 Lecture 13 Corpus Linguistics I. From Knowledge-Based to Corpus-Based Linguistics A Paradigm Shift begins in the 1980s –Seeds planted in the 1950s.

Page 1 Language Modeling. Page 2 Next Word Prediction From a NY Times story... Stocks... Stocks plunged this …. Stocks plunged this morning, despite a.

CS 4705 Lecture 6 N-Grams and Corpus Linguistics.

N-Grams and Language Modeling

CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.

Introduction to Language Models Evaluation in information retrieval Lecture 4.

CS 4705 N-Grams and Corpus Linguistics. Homework Use Perl or Java reg-ex package HW focus is on writing the “grammar” or FSA for dates and times The date.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 15, 2004.

CS 4705 N-Grams and Corpus Linguistics. Spelling Correction, revisited M$ suggests: –ngram: NorAm –unigrams: anagrams, enigmas –bigrams: begrimes –trigrams:

CS 4705 Lecture 14 Corpus Linguistics II. Relating Conditionals and Priors P(A | B) = P(A ^ B) / P(B) –Or, P(A ^ B) = P(A | B) P(B) Bayes Theorem lets.

Natural Language Understanding

To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.

SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.

1 N-Grams and Corpus Linguistics September 6, 2012 Lecture #4.

NGrams 09/16/2004 Instructor: Rada Mihalcea Note: some of the material in this slide set was adapted from an NLP course taught by Bonnie Dorr at Univ.

Formal Models of Language. Slide 1 Language Models A language model an abstract representation of a (natural) language phenomenon. an approximation to.

Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 7 8 August 2007.

Instructions for using this template. Remember this is Jeopardy, so where I have written “Answer” this is the prompt the students will see, and where.

For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

N-Grams and Corpus Linguistics guest lecture by Dragomir Radev

Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.

1 The Ferret Copy Detector Finding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing.

The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,

NLP. Introduction to NLP Extrinsic –Use in an application Intrinsic –Cheaper Correlate the two for validation purposes.

S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.

Presenter: Shanshan Lu 03/04/2010

Chapter 23: Probabilistic Language Models April 13, 2004.

1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab

Estimating N-gram Probabilities Language Modeling.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.

Programming Fundamentals. Overview of Previous Lecture Phases of C++ Environment Program statement Vs Preprocessor directive Whitespaces Comments.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Generating Query Substitutions Alicia Wood. What is the problem to be solved?

Autumn Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University.

For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.

Statistical Methods for NLP Diana Trandab ă ț

Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.

Distinguishing authorship

Text Based Information Retrieval

Authorship Attribution Using Probabilistic Context-Free Grammars

Introduction to Textual Analysis

N-Grams and Corpus Linguistics

Lecture 13 Corpus Linguistics I CS 4705.

Presentation transcript:

1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems

2 School of Information Management & Systems

3 SIMS Information economics and policy Sociology of information Human- computer interaction Information assurance Information design and architecture

4 How do we Automatically Analyze Human Language? The answer is … forget all that logic and inference stuff you’ve been learning all semester! Instead, we do something entirely different. Gather HUGE collections of text, and compute statistics over them. This allows us to make predictions. Nearly always a VERY simple algorithm and a VERY large text collection do better than a smart algorithm using knowledge engineering.

5 Statistical Natural Language Processing Chapter 23 of the textbook Prof. Russell said it won’t be on the final Today: 3 Applications Author Identification Speech Recognition (language models) Spelling Correction

6 Slide adapted from Fred S. Roberts Author Identification 1.Disputed authorship (choose among k known authors) 2.Document pair analysis: Were two documents written by the same author? 3.Odd-person-out: Were these documents written by one of this set of authors or by someone else? 4.Clustering of “putative” authors (e.g., internet handles: termin8r, heyr, KaMaKaZie) Problem Variations

7 Slide adapted form Glenn Fung The Federalist Papers  Written in by Alexander Hamilton, John Jay and James Madison to persuade the citizens of New York to ratify the constitution.  Papers consisted of short essays, 900 to 3500 words in length.  Authorship of 12 of those papers have been in dispute (Madison or Hamilton). These papers are referred to as the disputed Federalist papers.

8 Stylometry The use of metrics of literary style to analyze texts. Sentence length Paragraph length Punctuation Density of parts of speech Vocabulary Mosteller & Wallace, 1964 Federalist papers problem Used Naïve Bayes and 30 “marker” words more typical of one or the other author Concluded the disputed documents written by Madison.

9 Slide adapted from Glenn Fung  Find a hyperplane based on 3 words: to upon would=  All disputed papers end up on the Madison side of the plane. An Alternative Method (Fung)

10 Slide adapted from Glenn Fung

11 Slide adapted from Fred S. Roberts Features for Author ID Typically seek a small number of textual characteristics that distinguish the texts of authors (Burrows, Holmes, Binongo, Hoover, Mosteller & Wallace, McMenamin, Tweedie, etc.) Typically use “function words” (a, with, as, were, all, would, etc.) followed by analysis Function words are “topic-independent” However, Hoover (2003) shows that using all high- frequency words does a better job than function words alone.

12 Slide adapted from Fred S. Roberts Idiosyncratic usage (misspellings, repeated neologisms, etc.) are apparently also useful. For example, Foster’s unmasking of Klein as the author of “Primary Colors”: “Klein and Anonymous loved unusual adjectives ending in -y and –inous: cartoony, chunky, crackly, dorky, snarly,…, slimetudinous, vertiginous, …” “Both Klein and Anonymous added letters to their interjections: ahh, aww, naww.” “Both Klein and Anonymous loved to coin words beginning in hyper-, mega-, post-, quasi-, and semi- more than all others put together” “Klein and Anonymous use “riffle” to mean rifle or rustle, a usage for which the OED provides no instance in the past thousand years” Idiosyncratic Features

13 Language Modeling A fundamental concept in NLP Main idea: For a given language, some words are more likely than others to follow each other, or You can predict (with some degree of accuracy) the probability that, given a word, a particular other word will follow it.

14 Adapted from slide by Bonnie Dorr Next Word Prediction From a NY Times story... Stocks... Stocks plunged this …. Stocks plunged this morning, despite a cut in interest rates Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall... Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began

15 Adapted from slide by Bonnie Dorr Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since last … Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since last Tuesday's terrorist attacks.

16 Adapted from slide by Bonnie Dorr Next Word Prediction Clearly, we have the ability to predict future words in an utterance to some degree of accuracy. How? Domain knowledge Syntactic knowledge Lexical knowledge Claim: A useful part of the knowledge needed to allow word prediction can be captured using simple statistical techniques In particular, we'll rely on the notion of the probability of a sequence (a phrase, a sentence)

17 Adapted from slide by Bonnie Dorr Applications of Language Models Why do we want to predict a word, given some preceding words? Rank the likelihood of sequences containing various alternative hypotheses, –e.g. for spoken language recognition Theatre owners say unicorn sales have doubled... Theatre owners say popcorn sales have doubled... Assess the likelihood/goodness of a sentence –for text generation or machine translation. The doctor recommended a cat scan. El doctor recommendó una exploración del gato.

18 Adapted from slide by Bonnie Dorr N-Gram Models of Language Use the previous N-1 words in a sequence to predict the next word Language Model (LM) unigrams, bigrams, trigrams,… How do we train these models? Very large corpora

19 Notation P(unicorn) Read this as “The probability of seeing the token unicorn” P(unicorn|mythical) Called the Conditional Probability. Read this as “The probability of seeing the token unicorn given that you’ve seen the token mythical

20 Adapted from slide by Bonnie Dorr Speech Recognition Example From BeRP: The Berkeley Restaurant Project (Jurafsky et al.) A testbed for a Speech Recognition project System prompts user for information in order to fill in slots in a restaurant database. –Type of food, hours open, how expensive After getting lots of input, can compute how likely it is that someone will say X given that they already said Y. P(I want to each Chinese food) = P(I | ) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese)

21 Adapted from slide by Bonnie Dorr A Bigram Grammar Fragment from BeRP.001Eat British.03Eat today.007Eat dessert.04Eat Indian.01Eat tomorrow.04Eat a.02Eat Mexican.04Eat at.02Eat Chinese.05Eat dinner.02Eat in.06Eat lunch.03Eat breakfast.06Eat some.03Eat Thai.16Eat on

22 Adapted from slide by Bonnie Dorr.01British lunch.05Want a.01British cuisine.65Want to.15British restaurant.04I have.60British food.08I don’t.02To be.29I would.09To spend.32I want.14To have.02 I’m.26To eat.04 Tell.01Want Thai.06 I’d.04Want some.25 I

23 Adapted from slide by Bonnie Dorr P(I want to eat British food) = P(I| ) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) =.25*.32*.65*.26*.001*.60 = vs. I want to eat Chinese food = Probabilities seem to capture “syntactic'' facts, “world knowledge'' eat is often followed by an NP British food is not too popular N-gram models can be trained by counting and normalization

24 Spelling Correction How to do it? Standard approach Rely on a dictionary for comparison Assume a single “point change” –Insertion, deletion, transposition, substitution –Don’t handle word substitution Problems Might guess the wrong correction Dictionary not comprehensive –Shrek, Britney Spears, nsync, p53, ground zero May spell the word right but use it in the wrong place –principal, principle –read, red

25 New Approach: Use Search Engine Query Logs! Leverage off of the mistakes and corrections that millions of other people have already made!

26 Spelling Correction via Query Logs Cucerzan and Brill ‘04 Main idea: Iteratively transform the query into other strings that correspond to more likely queries. Use statistics from query logs to determine likelihood. –Despite the fact that many of these are misspelled –Assume that the less wrong a misspelling is, the more frequent it is, and correct > incorrect Example: ditroitigers -> detroittigers -> detroit tigers

27 Spelling Correction via Query Logs (Cucerzan and Brill ’04)

28 Spelling Correction Algorithm Algorithm: Compute the set of all possible alternatives for each word in the query –Look at word unigrams and bigrams from the logs –This handles concatenation and splitting of words Find the best possible alternative string to the input –Do this efficiently with a modified Viterbi algorithm Constraints: No 2 adjacent in-vocabulary words can change simultaneously Short queries have further (unstated) restrictions In-vocabulary words can’t be changed in the first round of iteration

29 Spelling Correction Algorithm Comparing string similarity Damerau-Levenshtein edit distance: –The minimum number of point changes required to transform a string into another Trading off distance function leniency: A rule that allows only one letter change can’t fix: –dondal duck -> donald duck A too permissive rule makes too many errors: –log wood -> dog food Actual measure: “A modified context-dependent weighted Damerau- Levenshtein edit function” –Point changes: insertion, deletion, substitution, immediate transpositions, long-distance movement of letters –“Weights interactively refined using statistics from query logs”

30 Spelling Correction Evaluation Emphasizing coverage 1044 randomly chosen queries Annotated by two people (91.3% agreement) 180 misspelled; annotators provided corrections 81.1% system agreement with annotators –131 false positives  2002 kawasaki ninja zx6e -> 2002 kawasaki ninja zx6r –156 suggestions for the misspelled queries 2 iterations were sufficient for most corrections Problem: annotators were guessing user intent

31 Spell Checking: Summary Can use the collective knowledge stored in query logs Works pretty well despite the noisiness of the data Exploits the errors made by people Might be further improved to incorporate text from other domains

32 Other Search Engine Applications Many other applications apply to search engines and related topics. One more example … automatic synonym and related word generation.

33 Synonym Generation

34 Synonym Generation

35 Synonym Generation

36 Speaking of Search Engines … Introducing a New Course! Search Engines: Technology, Society, and Business IS141 (2 units) Mondays 4-6pm + 1hr section CCN No prerequisites

37 A Great Line-up of World-Class Experts!

38 A Great Line-up of World-Class Experts!

39 Thank you! Prof. Marti Hearst School of Information Management & Systems