1 Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Statistical NLP Course for Master in Computational Linguistics 2nd Year Diana Trandabat.
Pattern Recognition and Machine Learning
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
Flipping an unfair coin three times Consider the unfair coin with P(H) = 1/3 and P(T) = 2/3. If we flip this coin three times, the sample space S is the.
Visual Recognition Tutorial
Probability theory Much inspired by the presentation of Kren and Samuelsson.
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Pattern Classification, Chapter 1 1 Basic Probability.
Visual Recognition Tutorial
Thanks to Nir Friedman, HU
1 11 Lecture 12 Overview of Probability and Random Variables (I) Fall 2008 NCTU EE Tzu-Hsien Sang.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
CHAPTER 10: Introducing Probability
Review of Probability Theory. © Tallal Elshabrawy 2 Review of Probability Theory Experiments, Sample Spaces and Events Axioms of Probability Conditional.
1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.
Basic Concepts in Information Theory
Some basic concepts of Information Theory and Entropy
2. Mathematical Foundations
Stat 1510: Introducing Probability. Agenda 2  The Idea of Probability  Probability Models  Probability Rules  Finite and Discrete Probability Models.
Copyright ©2011 Nelson Education Limited. Probability and Probability Distributions CHAPTER 4 Part 2.
ECSE 6610 Pattern Recognition Professor Qiang Ji Spring, 2011.
Statistical Decision Theory
Statistics for Engineer Week II and Week III: Random Variables and Probability Distribution.
IBS-09-SL RM 501 – Ranjit Goswami 1 Basic Probability.
Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION I Probability Theory Review.
5.3 Random Variables  Random Variable  Discrete Random Variables  Continuous Random Variables  Normal Distributions as Probability Distributions 1.
STA347 - week 51 More on Distribution Function The distribution of a random variable X can be determined directly from its cumulative distribution function.
CHAPTER 10: Introducing Probability ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
NLP. Introduction to NLP Very important for language processing Example in speech recognition: –“recognize speech” vs “wreck a nice beach” Example in.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
12/7/20151 Probability Introduction to Probability, Conditional Probability and Random Variables.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Probability Distributions
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Probability and Distributions. Deterministic vs. Random Processes In deterministic processes, the outcome can be predicted exactly in advance Eg. Force.
MATH 256 Probability and Random Processes Yrd. Doç. Dr. Didem Kivanc Tureli 14/10/2011Lecture 3 OKAN UNIVERSITY.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Probability. Probability Probability is fundamental to scientific inference Probability is fundamental to scientific inference Deterministic vs. Probabilistic.
9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing Probability AI-Lab
Mutual Information, Joint Entropy & Conditional Entropy
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Statistical NLP Course for Master in Computational Linguistics 2nd Year Diana Trandabat.
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
1 COMP 791A: Statistical Language Processing Mathematical Essentials Chap. 2.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Discrete Random Variable Random Process. The Notion of A Random Variable We expect some measurement or numerical attribute of the outcome of a random.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Statistical methods in NLP Course 2 Diana Trandab ă ț
Statistical methods in NLP Course 2
Lecture 1.31 Criteria for optimal reception of radio signals.
Probability David Kauchak CS158 – Fall 2013.
Probability and Statistics for Particle Physics
Learning Tree Structures
Discrete and Continuous Random Variables
Aim – How do we analyze a Discrete Random Variable?
Mathematical Foundations
Basic Probability Theory
Statistical NLP: Lecture 4
CHAPTER 10: Introducing Probability
CPSC 503 Computational Linguistics
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

1 Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book

2 Chapter 2 Mathematical Foundations January 8, 2007 Note: These notes are by Barbara Rosario from UC Berkely ( ) It has beenhttp:// augmented by material drawn from Chris Manning’s Book

By Barbara Rosario3 Mathematical Foundations Elementary Probability Theory Essential Information Theory

4 Motivations Statistical NLP aims to do statistical inference for the field of NL Statistical inference consists of taking some data (generated in accordance with some unknown probability distribution) and then making some inference about this distribution.

5 Motivations (Cont) An example of statistical inference is the task of language modeling (ex how to predict the next word given the previous words) In order to do this, we need a model of the language. Probability theory helps us finding such model

6 Probability Spaces Probability Theory: a way of predicting how likely something will happen Experiment/Trial: Tossing 3 coins for example Basic Outcomes – Sample Space Ω Probability Function – distributes mass of 1 through out the sample space Ω P: F  [0,1] P(Ω) = 1

7 Example Ex 1 (pg 41) A fair coin is tossed 3 times. What is the probability of 2 heads. Ω = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} Event A = {HHT, HTH, THH} P(A) = |A| / |Ω| = 3/8

8 Probability Theory How likely it is that something will happen Sample space Ω is listing of all possible outcome of an experiment Event A is a subset of Ω Probability function (or distribution)

9 Prior Probability Prior probability: the probability before we consider any additional knowledge

10 Conditional probability Sometimes we have partial knowledge about the outcome of an experiment Conditional (or Posterior) Probability Suppose we know that event B is true The probability that A is true given the knowledge about B is expressed by

11 Conditional probability and independence

12 Conditional probability (cont) Joint probability of A and B. 2-dimensional table with a value in every cell giving the probability of that specific state occurring

13 Chain Rule P(A,B) = P(A|B)P(B) = P(B|A)P(A) P(A,B,C,D…) = P(A)P(B|A)P(C|A,B)P(D|A,B,C..)

14 (Conditional) independence Two events A & B are independent of each other if P(A) = P(A|B) Two events A and B are conditionally independent of each other given C if P(A|C) = P(A|B,C)

15 Bayes’ Theorem Bayes’ Theorem lets us swap the order of dependence between events We saw that Bayes’ Theorem:

16 Example S:stiff neck, M: meningitis P(S|M) =0.5, P(M) = 1/50,000 P(S)=1/20 I have stiff neck, should I worry?

17 Random Variables So far, event space that differs with every problem we look at Random variables (RV) X allow us to talk about the probabilities of numerical values that are related to the event space

18 Probability mass function (pmf) pmf p(x) = p(X = x) = P(A x ) where A x = { ω ε Ω: X(ω) = x} ∑p(x i ) = ∑P(A xi ) = P(Ω) = 1 i i

19 Expectation and Variance Expectation is the mean average of a random variable E(X) = ∑xp(x) x

20 Expectation The Expectation is the mean or average of a RV

21 Variance The variance of a RV is a measure of whether the values of the RV tend to be consistent over trials or to vary a lot σ is the standard deviation

22 Example of variance calculation

23 Back to the Language Model In general, for language events, P is unknown We need to estimate P, (or model M of the language) We’ll do this by looking at evidence about what P must be based on a sample of data

24 Estimation of P Frequentist statistics Bayesian statistics

25 Frequentist Statistics Relative frequency: proportion of times an outcome u occurs C(u) is the number of times u occurs in N trials For the relative frequency tends to stabilize around some number: probability estimates

26 Frequentist Statistics (cont) Two different approach: Parametric Non-parametric (distribution free)

27 Parametric Methods Assume that some phenomenon in language is acceptably modeled by one of the well- known family of distributions (such binomial, normal) We have an explicit probabilistic model of the process by which the data was generated, and determining a particular probability distribution within the family requires only the specification of a few parameters (less training data)

28 Non-Parametric Methods No assumption about the underlying distribution of the data For ex, simply estimate P empirically by counting a large number of random events is a distribution-free method Less prior information, more training data needed

29 Binomial Distribution (Parametric) Series of trials with only two outcomes, each trial being independent from all the others Number r of successes out of n trials given that the probability of success in any trial is p:

30 Continuous Two parameters: mean μ and standard deviation σ Used in clustering Normal (Gaussian) Distribution (Parametric)

31 Frequentist Statistics D: data M: model (distribution P) Θ: parameters (es μ, σ) For M fixed: Maximum likelihood estimate: choose such that

32 Frequentist Statistics Model selection, by comparing the maximum likelihood: choose such that

33 Estimation of P Frequentist statistics Parametric methods Standard distributions: Binomial distribution (discrete) Normal (Gaussian) distribution (continuous) Maximum likelihood Non-parametric methods Bayesian statistics

34 Bayesian Statistics Bayesian statistics measures degrees of belief Degrees are calculated by starting with prior beliefs and updating them in face of the evidence, using Bayes theorem

35 Bayesian Statistics (cont)

36 Bayesian Statistics (cont) M is the distribution; for fully describing the model, I need both the distribution M and the parameters θ

37 Frequentist vs. Bayesian Bayesian Frequentist

38 Bayesian Updating How to update P(M)? We start with a priori probability distribution P(M), and when a new datum comes in, we can update our beliefs by calculating the posterior probability P(M|D). This then becomes the new prior and the process repeats on each new datum

39 Bayesian Decision Theory Suppose we have 2 models and ; we want to evaluate which model better explains some new data. is the most likely model, otherwise

40 Essential Information Theory Developed by Shannon in the 40s Maximizing the amount of information that can be transmitted over an imperfect communication channel Data compression (entropy) Transmission rate (channel capacity)

41 Entropy X: discrete RV, p(X) Entropy (or self-information) Entropy measures the amount of information in a RV; it’s the average length of the message needed to transmit an outcome of that variable using the optimal code

42 Entropy (cont) i.e when the value of X is determinate, hence providing no new information

43 Example of entropy calculation

44 Example: Simplified Polynesian

45 Joint Entropy The joint entropy of 2 RV X,Y is the amount of the information needed on average to specify both their values

46 Conditional Entropy The conditional entropy of a RV Y given another X, expresses how much extra information one still needs to supply on average to communicate Y given that the other party knows X

47 Chain Rule

48 Mutual Information I(X,Y) is the mutual information between X and Y. It is the reduction of uncertainty of one RV due to knowing about the other, or the amount of information one RV contains about the other

49 Mutual Information (cont) I is 0 only when X,Y are independent: H(X|Y)=H(X) H(X)=H(X)-H(X|X)=I(X,X) Entropy is the self-information

50 Entropy and Linguistics Entropy is measure of uncertainty. The more we know about something the lower the entropy. If a language model captures more of the structure of the language, then the entropy should be lower. We can use entropy as a measure of the quality of our models

51 Entropy and Linguistics H: entropy of language; we don’t know p(X); so..? Suppose our model of the language is q(X) How good estimate of p(X) is q(X)?

52 Entropy and Linguistic Kullback-Leibler Divergence Relative entropy or KL (Kullback- Leibler) divergence

53 Entropy and Linguistic Measure of how different two probability distributions are Average number of bits that are wasted by encoding events from a distribution p with a code based on a not-quite right distribution q Goal: minimize relative entropy D(p||q) to have a probabilistic model as accurate as possible

54 The Noisy Channel Model The aim is to optimize in terms of throughput and accuracy the communication of messages in the presence of noise in the channel Duality between compression (achieved by removing all redundancy) and transmission accuracy (achieved by adding controlled redundancy so that the input can be recovered in the presence of noise)

55 The Noisy Channel Model Goal: encode the message in such a way that it occupies minimal space while still containing enough redundancy to be able to detect and correct errors W XW*Y encoder decoder Channel p(y|x) message input to channel Output from channel Attempt to reconstruct message based on output

56 The Noisy Channel Model Channel capacity: rate at which one can transmit information through the channel with an arbitrary low probability of being unable to recover the input from the output We reach a channel capacity if we manage to design an input code X whose distribution p(X) maximizes I between input and output

57 Linguistics and the Noisy Channel Model In linguistic we can’t control the encoding phase. We want to decode the output to give the most likely input. decoder Noisy Channel p(o|I) I O

58 The noisy Channel Model p(i) is the language model and is the channel probability Ex: Machine translation, optical character recognition, speech recognition