I256 Applied Natural Language Processing Fall 2009

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
Decision Making Under Risk Continued: Bayes’Theorem and Posterior Probabilities MGS Chapter 8 Slides 8c.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
1 Essential Probability & Statistics (Lecture for CS598CXZ Advanced Topics in Information Retrieval ) ChengXiang Zhai Department of Computer Science University.
1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
What is Statistical Modeling
Data Mining Classification: Naïve Bayes Classifier
Software Engineering Laboratory1 Introduction of Bayesian Network 4 / 20 / 2005 CSE634 Data Mining Prof. Anita Wasilewska Hiroo Kusaba.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 12 Jim Martin.
September SOME BASIC NOTIONS OF PROBABILITY THEORY Universita’ di Venezia 29 Settembre 2003.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION I Probability Theory Review.
1 Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
PROBABILITY David Kauchak CS159 – Spring Admin  Posted some links in Monday’s lecture for regular expressions  Logging in remotely  ssh to vpn.cs.pomona.edu.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Word Sense Disambiguation Minho Kim Foundation of Statistical Natural Language Processing.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Word Sense Disambiguation Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Statistical NLP Spring 2010 Lecture 5: WSD / Maxent Dan Klein – UC Berkeley.
Uncertainty in Expert Systems
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Lecture notes 9 Bayesian Belief Networks.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
NLP. Introduction to NLP Very important for language processing Example in speech recognition: –“recognize speech” vs “wreck a nice beach” Example in.
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
4 Proposed Research Projects SmartHome – Encouraging patients with mild cognitive disabilities to use digital memory notebook for activities of daily living.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 24 (14/04/06) Prof. Pushpak Bhattacharyya IIT Bombay Word Sense Disambiguation.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Probability. Probability Probability is fundamental to scientific inference Probability is fundamental to scientific inference Deterministic vs. Probabilistic.
CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Overview of Statistical NLP IR Group Meeting March 7, 2006.
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
Probability David Kauchak CS158 – Fall 2013.
Mathematical Foundations
Basic Probability Theory
Markov ó Kalman Filter Localization
Statistical NLP: Lecture 9
Uncertainty in AI.
Statistical NLP: Lecture 4
Bayes for Beginners Luca Chech and Jolanda Malamud
Chapter 14 February 26, 2004.
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

I256 Applied Natural Language Processing Fall 2009 Lecture 5 Word Sense Disambiguation (WSD) Intro on Probability Theory Graphical Models Naïve Bayes Naïve Bayes for WSD Barbara Rosario

Word Senses Words have multiple distinct meanings, or senses: Plant: living plant, manufacturing plant, … Title: name of a work, ownership document, form of address, material at the start of a film, … Many levels of sense distinctions Homonymy: totally unrelated meanings (river bank, money bank) Polysemy: related meanings (star in sky, star on tv, title) Systematic polysemy: productive meaning extensions (metonymy such as organizations to their buildings) or metaphor Sense distinctions can be extremely subtle (or not) Granularity of senses needed depends a lot on the task Taken from Dan Klein’s cs 288 slides

Word Sense Disambiguation Determine which of the senses of an ambiguous word is invoked in a particular use of the word Example: living plant vs. manufacturing plant How do we tell these senses apart? “Context” The manufacturing plant which had previously sustained the town’s economy shut down after an extended labor strike. Maybe it’s just text categorization Each word sense represents a topic Why is it important to model and disambiguate word senses? Translation Bank  banca or riva Parsing For PP attachment, for example information retrieval To return documents with the right sense of bank Adapted from Dan Klein’s cs 288 slides

Resources WordNet SensEval SemCor OtherResources Hand-build (but large) hierarchy of word senses Basically a hierarchical thesaurus SensEval AWSD competition Training / test sets for a wide range of words, difficulties, and parts-of-speech Bake-off where lots of labs tried lots of competing approaches SemCor A big chunk of the Brown corpus annotated with WordNet senses OtherResources The Open Mind Word Expert Parallel texts Taken from Dan Klein’s cs 288 slides

Features Bag-of-words (use words around with no order) The manufacturing plant which had previously sustained the town’s economy shut down after an extended labor strike. Bags of words = {after, manufacturing, which, labor, ..} Bag-of-words classification works ok for noun senses 90% on classic, shockingly easy examples (line, interest, star) 80% on senseval-1 nouns 70% on senseval-1 verbs

Verb WSD Why are verbs harder? Verb Example: “Serve” Verbal senses less topical More sensitive to structure, argument choice Better disambiguated by their argument (subject-object): importance of local information For nouns, a wider context likely to be useful Verb Example: “Serve” [function] The tree stump serves as a table [enable] The scandal served to increase his popularity [dish] We serve meals for the homeless [enlist] She served her country [jail] He served six years for embezzlement [tennis] It was Agassi's turn to serve [legal] He was served by the sheriff Different types of information may be appropriate for different part of speech Adapted from Dan Klein’s cs 288 slides

Better features There are smarter features: Subcategorization: Argument selectional preference: serve NP[meals] vs. serve NP[papers] vs. serve NP[country] Subcategorization: [function] serve PP[as] [enable] serve VP[to] [tennis] serve <intransitive> [food] serve NP {PP[to]} Can capture poorly (but robustly) with local windows… but we can also use a parser and get these features explicitly Other constraints (Yarowsky 95) One-sense-per-discourse One-sense-per-collocation (pretty reliable when it kicks in: manufacturing plant, flowering plant) Taken from Dan Klein’s cs 288 slides

Various Approaches to WSD Unsupervised learning We don’t know/have the labels More than disambiguation is discrimination Cluster into groups and discriminate between these groups without giving labels Clustering Example: EM (expectation-minimization), Bootstrapping (seeded with some labeled data) Indirect supervision (See Session 7.3 of Stat NLP book) From thesauri From WordNet From parallel corpora Supervised learning Adapted from Dan Klein’s cs 288 slides

Supervised learning Supervised learning When we know the truth (true senses) (not always true or easy) Classification task Most systems do some kind of supervised learning Many competing classification technologies perform about the same (it’s all about the knowledge sources you tap) Problem: training data available for only a few words Examples: Bayesian classification Naïve Bayes (simplest example of Graphical models) (We’ll talk more about supervised learning/classification during the course) Adapted from Dan Klein’s cs 288 slides

Today Introduction to probability theory Introduction to graphical models Probability theory plus graph theory Naïve bayes (simple graphical model) Naïve bayes for WSD (classification task)

Why Probability? Statistical NLP aims to do statistical inference for the field of NLP Statistical inference consists of taking some data (generated in accordance with some unknown probability distribution) and then making some inference about this distribution.

Why Probability? Examples of statistical inference are WSD, the task of language modeling (ex how to predict the next word given the previous words), topic classification, etc. In order to do this, we need a model of the language. Probability theory helps us finding such model

Probability Theory How likely it is that something will happen Sample space Ω is listing of all possible outcome of an experiment Sample space can be continuous or discrete For language applications it’s discrete (i.e. words) Event A is a subset of Ω Probability function (or distribution)

http://ai.stanford.edu/~paskin/gm-short-course/lec1.pdf

http://ai.stanford.edu/~paskin/gm-short-course/lec1.pdf

http://ai.stanford.edu/~paskin/gm-short-course/lec1.pdf

http://ai.stanford.edu/~paskin/gm-short-course/lec1.pdf

Prior Probability Prior probability: the probability before we consider any additional knowledge

Conditional probability Sometimes we have partial knowledge about the outcome of an experiment Conditional (or Posterior) Probability Suppose we know that event B is true The probability that A is true given the knowledge about B is expressed by

http://ai.stanford.edu/~paskin/gm-short-course/lec1.pdf

Conditional probability (cont) Note: P(A,B) = P(A ∩ B) Chain Rule P(A, B) = P(A|B) P(B) = The probability that A and B both happen is the probability that B happens times the probability that A happens, given B has occurred. P(A, B) = P(B|A) P(A) = The probability that A and B both happen is the probability that A happens times the probability that B happens, given A has occurred. Multi-dimensional table with a value in every cell giving the probability of that specific state occurring

Chain Rule P(A,B) = P(A|B)P(B) = P(B|A)P(A) P(A,B,C,D…) = P(A)P(B|A)P(C|A,B)P(D|A,B,C..)

Chain Rule  Bayes' rule P(A,B) = P(A|B)P(B) = P(B|A)P(A) Bayes' rule Useful when one quantity is more easy to calculate; trivial consequence of the definitions we saw but it’ s extremely useful

Bayes' rule Bayes' rule translates causal knowledge into diagnostic knowledge. For example, if A is the event that a patient has a disease, and B is the event that she displays a symptom, then P(B | A) describes a causal relationship, and P(A | B) describes a diagnostic one (that is usually hard to assess). If P(B | A), P(A) and P(B) can be assessed easily, then we get P(A | B) for free. The Chain Rule is used in many places in Stat NLP such as Markov Model

Example S:stiff neck, M: meningitis P(S|M) =0.5, P(M) = 1/50,000 P(S)=1/20 I have stiff neck, should I worry?

(Conditional) independence Two events A e B are independent of each other if P(A) = P(A|B) Two events A and B are conditionally independent of each other given C if P(A|C) = P(A|B,C)

Back to language Statistical NLP aims to do statistical inference for the field of NLP Topic classification P( topic | document ) Language models P (word | previous word(s) ) WSD P( sense | word) Two main problems Estimation: P in unknown: estimate P Inference: We estimated P; now we want to find (infer) the topic of a document, o the sense of a word

Language Models (Estimation) In general, for language events, P is unknown We need to estimate P, (or model M of the language) We’ll do this by looking at evidence about what P must be based on a sample of data

Estimation of P Frequentist statistics Bayesian statistics Parametric Non-parametric (distribution free) Bayesian statistics Bayesian statistics measures degrees of belief Degrees are calculated by starting with prior beliefs and updating them in face of the evidence, using Bayes theorem 2 different approaches, 2 different philosophies

Inference The central problem of computational Probability Theory is the inference problem: Given a set of random variables X1, … , Xk and their joint density P(X1, … , Xk), compute one or more conditional densities given observations. Compute P(X1 | X2 … , Xk) P(X3 | X1 ) P(X1 , X2 | X3, X4,) Etc … Many problems can be formulated in these terms.

Bayes decision rule w: ambiguous word S = {s1, s2, …, sn } senses for w C = {c1, c2, …, cn } context of w in a corpus V = {v1, v2, …, vj } words used as contextual features for disambiguation Bayes decision rule Decide sj if P(sj | c) > P(sk | c) for sj ≠ sk We want to assign w to the sense s’ where s’ = argmaxsk P(sk | c)

Bayes classification for WSD We want to assign w to the sense s’ where s’ = argmaxsk P(sk | c) We usually do not know P(sk | c) but we can compute it using Bayes rule

Naïve Bayes classifier Naïve Bayes classifier widely used in machine learning Estimate P(c | sk) and P(sk)

Naïve Bayes classifier Estimate P(c | sk) and P(sk) w: ambiguous word S = {s1, s2, …, sn } senses for w C = {c1, c2, …, cn } context of w in a corpus V = {v1, v2, …, vj } words used as contextual features for disambiguation Naïve Bayes assumption:

Naïve Bayes classifier Naïve Bayes assumption: Two consequences All the structure and linear ordering of words within the context is ignored bags of words model The presence of one word in the model is independent of the others Not true but model “easier” and very “efficient” “easier” “efficient” mean something specific in the probabilistic framework We’ll see later (but easier to estimate parameters and more efficient inference) Naïve Bayes assumption is inappropriate if there are strong dependencies, but often it does very well (partly because the decision may be optimal even if the assumption is not correct)

Naïve Bayes for WSD Bayes decision rule Naïve Bayes assumption Estimation Count of vj when sk Prior probability of sk

Naïve Bayes Algorithm for WSD TRAINING (aka Estimation) For all of senses sk of w do For all words vj in the vocabulary calculate end

Naïve Bayes Algorithm for WSD TESTING (aka Inference or Disambiguation) For all of senses sk of w do For all words vj in the context window c calculate end Choose s= sk of w do

Next week Introduction to Graphical Models Part of speech tagging Readings: Chapter 5 NLTL book Chapter 10 of Foundation of Stat NLP