CPSC 503 Computational Linguistics

Slides:



Advertisements
Similar presentations
Lecture Discrete Probability. 5.2 Recap Sample space: space of all possible outcomes. Event: subset of of S. p(s) : probability of element s of.
Advertisements

Statistical NLP Course for Master in Computational Linguistics 2nd Year Diana Trandabat.
1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.
Chapter 4 Using Probability and Probability Distributions
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 4-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Probability Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
Background Knowledge Brief Review on Counting,Counting, Probability,Probability, Statistics,Statistics, I. TheoryI. Theory.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Discrete random variables Probability mass function Distribution function (Secs )
Pattern Classification, Chapter 1 1 Basic Probability.
Probability Distributions: Finite Random Variables.
CHAPTER 10: Introducing Probability
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.
Some basic concepts of Information Theory and Entropy
2. Mathematical Foundations
Stat 1510: Introducing Probability. Agenda 2  The Idea of Probability  Probability Models  Probability Rules  Finite and Discrete Probability Models.
Copyright ©2011 Nelson Education Limited. Probability and Probability Distributions CHAPTER 4 Part 2.
Chapter 1 Probability and Distributions Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
IBS-09-SL RM 501 – Ranjit Goswami 1 Basic Probability.
Theory of Probability Statistics for Business and Economics.
1 Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book.
Expected values and variances. Formula For a discrete random variable X and pmf p(X): Expected value: Variance: Alternate formula for variance:  Var(x)=E(X^2)-[E(X)]^2.
CPSC 531: Probability Review1 CPSC 531:Probability & Statistics: Review Instructor: Anirban Mahanti Office: ICT Class.
5.3 Random Variables  Random Variable  Discrete Random Variables  Continuous Random Variables  Normal Distributions as Probability Distributions 1.
9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Probability Dr. Jan Hajič CS Dept., Johns Hopkins Univ.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
November 2004CSA4050: Crash Concepts in Probability1 CSA4050: Advanced Topics in NLP Probability I Experiments/Outcomes/Events Independence/Dependence.
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
Random Variables an important concept in probability.
Natural Language Processing Giuseppe Attardi Introduction to Probability IP notice: some slides from: Dan Jurafsky, Jim Martin, Sandiway Fong, Dan Klein.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 6 Random Variables 6.1 Discrete and Continuous.
9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing Probability AI-Lab
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory II AI-lab
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
Statistical NLP Course for Master in Computational Linguistics 2nd Year Diana Trandabat.
1 COMP 791A: Statistical Language Processing Mathematical Essentials Chap. 2.
Discrete Random Variable Random Process. The Notion of A Random Variable We expect some measurement or numerical attribute of the outcome of a random.
Conditional Probability 423/what-is-your-favorite-data-analysis-cartoon 1.
Statistical methods in NLP Course 2 Diana Trandab ă ț
Statistical methods in NLP Course 2
Pattern Recognition Probability Review
CHAPTER 6 Random Variables
CHAPTER 6 Random Variables
Random Variables.
Natural Language Processing
Appendix A: Probability Theory
UNIT 8 Discrete Probability Distributions
Corpora and Statistical Methods
Aim – How do we analyze a Discrete Random Variable?
Mathematical Foundations
Basic Probability Theory
Basic Probability aft A RAJASEKHAR YADAV.
Natural Language Processing
Introduction to Probability
STA 291 Spring 2008 Lecture 7 Dustin Lueker.
CSCI 5832 Natural Language Processing
Statistical NLP: Lecture 4
CHAPTER 10: Introducing Probability
CHAPTER 6 Random Variables
CHAPTER 6 Random Variables
CHAPTER 6 Random Variables
CHAPTER 6 Random Variables
CHAPTER 6 Random Variables
CHAPTER 6 Random Variables
Expected values and variances
CPSC 503 Computational Linguistics
CHAPTER 6 Random Variables
Presentation transcript:

CPSC 503 Computational Linguistics Intro probability and information Theory Lecture 5 Giuseppe Carenini 4/12/2019 CPSC503 Spring 2004

Today 28/1 Why do we need probabilities and information theory? Basic Probability Theory Basic Information Theory Topics that are more important in the field (not just the ones used in the textbook). An overview NOT COMPLETE… try to clarify the key concepts and Address most common misunderstandings 4/12/2019 CPSC503 Spring 2004

Why do we need probabilities? For Spelling errors: what is the most probable correct word? For real-word spelling errors, speech and hand writing recognition - What is the most probable next word? Part-of-speech tagging, word-sense disambiguation, probabilistic parsing Basic question: What is the probability of sequence of words? (e.g. of a sentence) 4/12/2019 CPSC503 Spring 2004

Disambiguation Tasks Example: “I made her duck” Part-of-speech tagging duck : V+ / N+ make : create / cook Word Sense Disambiguation her : possessive adjective / dative pronoun Syntactic Disambiguation (I (made (her duck))) vs. (I (made (her) (duck)) Duck is also semantically ambiguous N (bird vs. cotton fabric) V (plunge under water, lower the head or body suddently) Find the most likely interpretation in the given context make : transitive (single direct obj.) / ditransitive (two objs) / cause (direct obj. + verb) 4/12/2019 CPSC503 Spring 2004

Why do we need information theory? How much information is contained in a particular probabilistic model (PM)? How predictive a PM is? Given two PMs, which one better matches a corpus? Entropy, Mutual Information, Relative Entropy, Cross-Entropy, Perplexity 4/12/2019 CPSC503 Spring 2004

Basic Probability/Info Theory An overview (not complete! sometimes imprecise!) Clarify basic concepts you may encounter in NLP Try to address common misunderstandings 4/12/2019 CPSC503 Spring 2004

Experiments and Sample Spaces Uncertain Situation: Experiment, Process, Test…. Set of possible basic outcomes: sample space Ω Coin toss (Ω={head,tail}, die (Ω={1..6}), Opinion poll (Ω={yes,no}), Quality test (Ω={bad,good}) Lottery (|Ω|  105 – 107) #of traffic accidents in Canada in 2005 (Ω=N) missing word (|Ω|  vocabulary size) That will exactly produce one out of several possible outcome What are the possible outcomes of any process trying to find out what that word is? 4/12/2019 CPSC503 Spring 2004

Events Event A is a set of basic outcomes A  Ω and all A 2Ω (the event space) Ω is the certain event, Ø is the impossible event Examples: Experiment: three times coin toss Ω = {HHH, HHT, HTH, THH, TTH, HTT, THT,TTT} Cases with exactly two tails A={TTH, HTT, THT} All heads A={HHH} That will exactly produce one out of several possible outcome 2Ω powerset of Ω 4/12/2019 CPSC503 Spring 2004

Probability Function/Distribution Intuition: measure of how likely an event is Formally: P: 2Ω  [0,1], P(Ω) = 1 If A and B are disjoint events: P(AB)=P(A)+P(B) Immediate consequences: P(Ø)=0, P(A)=1- P(A), AB  P(A) < P(B) a ΩP(a) = 1 How to estimate P(A) Repeat the “experiment” n times c = # times outcome  A P(A)  c/n Axiomatic definition For the missing word we can “repeat” the experiment by considering each word in the text as missing and Counting how many times each word is the outcome 4/12/2019 CPSC503 Spring 2004

Missing Word from Book 4/12/2019 CPSC503 Spring 2004 Find out what is the missing word Assuming it is a word already in the text Assuming word are morphologically disambiguated 4/12/2019 CPSC503 Spring 2004

Joint and Conditional Probability P(A,B) = P(AB) P(A|B) = P(A,B)/P(B) Bayes Rule P(A,B) = P(B,A) (since P(AB) = P(AB))  P(A|B) P(B) = P(B|A) P(A)  P(A|B) = P(B|A) P(A) / P(B) Axiomatic definition For the missing word we can “repeat” the experiment by considering each word in the text as missing and Counting how many times each word is the outcome 4/12/2019 CPSC503 Spring 2004

Missing Word: Independence Find out what is the missing word Assuming it is a word already in the text Assuming word are morphologically disambiguated 4/12/2019 CPSC503 Spring 2004

Independence How does P(A|B) relates P(B)? If knowing that B is the case does not change the probability of A (i.e., P(A|B)=P(A)) A and B are independent Immediate consequence: P(A,B)=P(A)*P(B) Axiomatic definition For the missing word we can “repeat” the experiment by considering each word in the text as missing and Counting how many times each word is the outcome 4/12/2019 CPSC503 Spring 2004

Chain Rule The rule The “proof” P(A,B,C,D…) = P(A,B,C,D,..) = P(A) P(B|A) P(C|A,B) P(D|A,B,C) P(..|A,B,C,D) P(A,B,C,D,..) = P(A) P(A,B)/P(A) P(A,B,C)/P(A,B) P(A,B,C,D)/P(A,B,C) P(..,A,B,C,D)/P(A,B,C,D) General form P(A1,..,An) = product i{1..n} P(Aj| inters j{1..i-1}Aj) General form: 4/12/2019 CPSC503 Spring 2004

Random Variables and pmf Random variables (RV) X allow us to talk about the probabilities of numerical values that are related to the event space Examples:die “natural numbering” [1,6], English word length [1,?] Probability mass function ?So far, event space that differs with every problem we look at ? Random Variable if a function from the sample space to the real number for continuous RV. Because a RV has a numerical range, we can often do math more easily by working with the values of a RV rather then directly with events and we can define a probability for a random V Or to the integer numbers for discrete RV 4/12/2019 CPSC503 Spring 2004

Example: English Word length p(x) Or # of words in a document 1 5 10 15 25 Sampling? How to do it? 4/12/2019 CPSC503 Spring 2004

Expectation and Variance The Expectation is the (expected) mean or average of a RV Example:rolling one die (3.5) The variance of a RV is a measure of whether the values of the RV tend to be consistent over samples or to vary a lot σ is the standard deviation 4/12/2019 CPSC503 Spring 2004

Joint, Marginal and Conditional RV/Distributions Sometimes you need to define more then one RV or prob distribution over a sample space We call all of them distributions Bayes and Chain Rule also apply ! 4/12/2019 CPSC503 Spring 2004

Joint Distributions (word length & word class) Y N V Adj Adv X 1 2 3 4 … ……… ……… ……… Word class is not a RV Note: fictional numbers 4/12/2019 CPSC503 Spring 2004

Conditional and Independence (word length & word class) ……… 1 2 3 4 … N V Adj Adv X Y ……… 1 2 3 4 … N V Adj Adv X Y Word class is not a RV 4/12/2019 CPSC503 Spring 2004

Standard Distributions Discrete Binomial Multinomial Continuous Normal Go back to your Stats textbook … 4/12/2019 CPSC503 Spring 2004

Today 28/1 Why do we need probabilities and information theory? Basic Probability Theory Basic Information Theory Topics that are more important in the field (not just the ones used in the textbook). An overview NOT COMPLETE… try to clarify the key concepts and Address most common misunderstandings 4/12/2019 CPSC503 Spring 2004

Entropy Def1. Measure of uncertainty Def2. Measure of the information that we need to resolve an uncertain situation Def3. Measure of the information that we obtain form an experiment that resolves an uncertain situation X not limited to numbers ranges over a set of basic elements Can be words part-of-speech Let p(x)=P(X=x); where x  X. H(p)= H(X)= - xX p(x)log2p(x) It is normally measured in bits. 4/12/2019 CPSC503 Spring 2004

Entropy (extra-slides) Using the formula: Example Example: binary outcome The Limits (why exactly that formula?) Entropy and Expectation Coding interpretation Joint and Conditional Entropy Summary of key Properties X not limited to numbers ranges over a set of basic elements Can be words part-of-speech 4/12/2019 CPSC503 Spring 2004

Mutual Information Chain Rule for Entropy:H(X,Y)=H(X)+H(Y|X) By the chain rule for entropy, we have H(X,Y) = H(X)+ H(Y|X) = H(Y)+H(X|Y) Therefore, H(X)-H(X|Y)=H(Y)-H(Y|X) This difference is called the mutual information between X and Y, I(X,Y). reduction in uncertainty of one random variable due to knowing about another the amount of information one random variable contains about another It can be proven 4/12/2019 CPSC503 Spring 2004

Relative Entropy or Kullback-Leibler Divergence Def. The relative entropy is a measure of how different two probability distributions (over the same event space) are. D(p||q)= xX p(x)log(p(x)/q(x)) average number of bits wasted by encoding events from a distribution p with distribution q. I(X,Y) = D(p(x,y)||p(x)p(y)) The KL divergence between p and q can also be seen as the average number of bits that are wasted by encoding events from a distribution p with a code based on a not-quite-right distribution q. I(x,y) measures how much deviates from independence! 4/12/2019 CPSC503 Spring 2004

Next Time Probabilistic models applied to spelling Read Chp. 5 up to pag.156 4/12/2019 CPSC503 Spring 2004