7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Statistical NLP Course for Master in Computational Linguistics 2nd Year Diana Trandabat.
Pattern Recognition and Machine Learning
Binomial Distribution & Bayes’ Theorem. Questions What is a probability? What is the probability of obtaining 2 heads in 4 coin tosses? What is the probability.
Copyright © Cengage Learning. All rights reserved.
1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.
Chapter 6 Information Theory
Probability Probability Principles of EngineeringTM
Background Knowledge Brief Review on Counting,Counting, Probability,Probability, Statistics,Statistics, I. TheoryI. Theory.
Introduction to Probability and Statistics
September SOME BASIC NOTIONS OF PROBABILITY THEORY Universita’ di Venezia 29 Settembre 2003.
Ka-fu Wong © 2003 Chap 6- 1 Dr. Ka-fu Wong ECON1003 Analysis of Economic Data.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Probability and Probability Distributions
Pattern Classification, Chapter 1 1 Basic Probability.
Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model.
Information Theory and Security
CHAPTER 10: Introducing Probability
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
Review of Probability Theory. © Tallal Elshabrawy 2 Review of Probability Theory Experiments, Sample Spaces and Events Axioms of Probability Conditional.
1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.
Some basic concepts of Information Theory and Entropy
§1 Entropy and mutual information
2. Mathematical Foundations
Problem A newly married couple plans to have four children and would like to have three girls and a boy. What are the chances (probability) their desire.
INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.
Stat 1510: Introducing Probability. Agenda 2  The Idea of Probability  Probability Models  Probability Rules  Finite and Discrete Probability Models.
IBS-09-SL RM 501 – Ranjit Goswami 1 Basic Probability.
Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION I Probability Theory Review.
Theory of Probability Statistics for Business and Economics.
1 Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 34 Chapter 11 Section 1 Random Variables.
Expected values and variances. Formula For a discrete random variable X and pmf p(X): Expected value: Variance: Alternate formula for variance:  Var(x)=E(X^2)-[E(X)]^2.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
STA347 - week 51 More on Distribution Function The distribution of a random variable X can be determined directly from its cumulative distribution function.
Week 21 Conditional Probability Idea – have performed a chance experiment but don’t know the outcome (ω), but have some partial information (event A) about.
Information Theory Basics What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel.
CHAPTER 10: Introducing Probability ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
NLP. Introduction to NLP Very important for language processing Example in speech recognition: –“recognize speech” vs “wreck a nice beach” Example in.
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
12/7/20151 Probability Introduction to Probability, Conditional Probability and Random Variables.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Probability and Distributions. Deterministic vs. Random Processes In deterministic processes, the outcome can be predicted exactly in advance Eg. Force.
Discrete Random Variables. Introduction In previous lectures we established a foundation of the probability theory; we applied the probability theory.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing Probability AI-Lab
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
Chapter 8: Probability: The Mathematics of Chance Probability Models and Rules 1 Probability Theory  The mathematical description of randomness.  Companies.
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
Mutual Information, Joint Entropy & Conditional Entropy
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
Statistical NLP Course for Master in Computational Linguistics 2nd Year Diana Trandabat.
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
1 COMP 791A: Statistical Language Processing Mathematical Essentials Chap. 2.
Discrete Random Variable Random Process. The Notion of A Random Variable We expect some measurement or numerical attribute of the outcome of a random.
Statistical methods in NLP Course 2 Diana Trandab ă ț
Statistical methods in NLP Course 2
Probability David Kauchak CS158 – Fall 2013.
Corpora and Statistical Methods
Mathematical Foundations
Basic Probability Theory
Binomial Distribution & Bayes’ Theorem
Probability distributions
Statistical NLP: Lecture 4
CHAPTER 10: Introducing Probability
CPSC 503 Computational Linguistics
Presentation transcript:

7 - 1 Chapter 7 Mathematical Foundations

7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something will happen. The process by which an observation is made is called an experiment or a trial. The collection of basic outcomes (or sample points) for our experiment is called the sample space  (Omega). An event is a subset of the sample space. Probabilities are numbers between 0 and 1, where 0 indicates impossibility and 1 certainty. A probability function/distribution distributes a probability mass of 1 throughout the sample space.

7 - 3 Example A fair coin is tossed 3 times. What is the chance of 2 heads? –  ={HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} –uniform distribution P(basic outcome)= The chance of getting 2 heads when tossing 3 coins A={HHT, HTH, THH} P(A)= the event of interest an experiment sample space a subset of  probabilistic function

7 - 4 Conditional Probability Conditional probabilities measure the probability of events given some knowledge. Prior probabilities measure the probabilities of events before we consider our additional knowledge. Posterior probabilities are probabilities that result from using our additional knowledge. 1 st 2 nd 3 rd HHT TH Event B: 1 st is H Event A: 2 Hs in 1 st, 2 nd and 3 rd P(B)= P(A  B)=  P(A|B)=

7 - 5 A B The multiplication rule The chain rule (used in Markov models, …)

7 - 6 Independence The chain rule relates intersection with conditionalization (important to NLP) Independence and conditional independence of events are two very important notions in statistics –independence: – conditional independence

7 - 7 Bayes’ Theorem Bayes’ Theorem lets us swap the order of dependence between events. This is important when the former quantity is difficult to determine. P(A) is a normalizing constant.

7 - 8 Pick the best conclusion given some evidence –(1) evaluate the probability P(c|e) <--- unknown –(2) Select c with the largest P(c|e). P( c ) and P(e|c) are known. Example. –Relative probability of a disease –how often a symptom is associated An Application

7 - 9 Bayes’ Theorem A B

Bayes’ Theorem if, P(A) > 0, (i  j) (used in noisy channel model) The group of sets B i partition A

An Example A parasitic gap occurs once in 100,000 sentences. A complicated pattern matcher attempts to identify sentences with parasitic gaps. The answer is positive with probability 0.95 when a sentence has a parasitic gap, and the answer is positive with probability when it has no parasitic gap. When the test says that a sentence contains a parasitic gaps, what is the probability that this is true? P(G)= ,, P(T|G)=0.95,

Random Variables A random variable is a function X:  (sample space)  R n –Talk about the probabilities that are related to the event space A discrete random variable is a function X:  (sample space)  S where S is a countable subset of R. If X:  (sample space)  {0,1}, then X is called a Bernoulli trial. The probability mass function (pmf) for a random variable X gives the probability that the random variable has different numeric values. where pmf

Events: toss two dice and sum their faces  ={(1,1), (1,2), …, (1,6), (2,1), (2,2), …, (6,1), (6,2), …, (6,6)} S={2, 3, 4, …, 12} X:   S pmf p(3)=p(X=3)=P(A 3 )=P({(1,2),(2,1)})=2/36 where A 3 ={  : X(  )=3}={(1,2),(2,1)}

Expectation The expectation is the mean ( , Mu ) or average of a random variable. Example –Roll one die and Y is the value on its face E(aY+b)=aE(Y)+b E(X+Y)=E(X)+E(Y)

Variance The variance (  2, Sigma 2 ) of a random variable is a measure of whether the values of the random variable tend to be consistent over trials or to vary a lot. standard deviation (  ) –square root of the variance

X: toss two dice, and sum of their faces, I.e., X=Y+Y Y: toss one die, and its value X: a random variable that is the sum of the numbers on two dice

Joint and Conditional Distributions More than one random variable can be defined over a sample space. In this case, we talk about a joint or multivariate probability distribution. The joint probability mass function for two discrete random variables X and Y is: p(x, y)=P(X=x, Y=y) The marginal probability mass function totals up the probability masses for the values of each variable separately. If X and Y are independent, then

Joint and Conditional Distributions Similar intersection rules hold for joint distributions as for events. Chain rule in terms of random variables

Estimating Probability Functions What is the probability that sentence “The cow chewed its cud” will be uttered? Unknown  P must be estimated from a sample of data. An important measure for estimating P is the relative frequency of the outcome, i.e., the proportion of times a certain outcome occurs. Assuming that certain aspects of language can be modeled by one of the well-known distribution is called using a parametric approach. If no such assumption can be made, we must use a non-parametric approach or distribution-free approach.

parametric approach Select an explicit probabilistic model Specify a few parameters to determine a particular probability distribution The amount of training data required is not great, and can be calculated to make good probability estimates

Standard Distributions In practice, one commonly finds the same basic form of a probability mass function, but with different constants employed. Families of pmfs are called distributions and the constants that define the different possible pmfs in one family are called parameters. Discrete Distributions: the binomial distribution, the multinomial distribution, the Poisson distribution. Continuous Distributions: the normal distribution, the standard normal distribution.

Binomial distribution A series of trials with only two outcomes, each trial is independent from all others The number r of successes out of n trials given that the probability of success in any trial is p: expectation: npvariance: np(1-p) parametersvariable

7 - 23

The normal distribution two parameters for mean  and standard deviation , the curve is given by standard normal distribution –  =0,  =1

7 - 25

Baysian Statistics I: Bayesian Updating frequentist statistics vs. Bayesian statistics –Toss a coin 10 times and get 8 heads  8/10 (maximum likelihood estimate) –8 heads out of 10 just happens sometimes given a small sample Assume that the data are coming in sequentially and are independent. Given an a priori probability distribution, we can update our beliefs when a new datum comes in by calculating the Maximum A Posteriori (MAP) distribution. The MAP probability becomes the new prior and the process repeats on each new datum.

 m : the model that asserts P(head)=m s: a particular sequence of observations yielding i heads and j tails Find the MLE (maximum likelihood estimate) by differentiating the above polynomial. Frequentist Statistics 8 heads and 2 tails: 8/(8+2)=0.8

a priori probabilistic distribution: belief in the fairness of a coin: a regular, fair one a particular sequence of observations: i heads and j tails New belief in the fairness of a coin?? (when i=8,j=2) new priori Bayesian Statistics

Bayesian Statistics II: Bayesian Decision Theory Bayesian Statistics can be used to evaluate which model or family of models better explains some data. We define two different models of the event and calculate the likelihood ratio between these two models.

Entropy The entropy is the average uncertainty of a single random variable. Let p(x)=P(X=x); where x  X. H(p)= H(X)= -  x  X p(x)log 2 p(x) In other words, entropy measures the amount of information in a random variable. It is normally measured in bits. Toss two coins and count the number of heads p(0)=1/4, p(1)=1/2, p(2)=1/4

Example Roll an 8-sided die Entropy (another view) –the average length of the message needed to transmit an outcome of that variable –Optimal code to send a message of probability p(i):

Example Problem: send a friend a message that is a number from 0 to 3. How long a message must you send ? (in terms of number of bits) Example: watch a house with two occupants. Case Situation Probability1 Probability2 0 no occupants st occupant nd occupant both occupant Probability Code Probability2 Code

Variable-length encoding Code tree: –(1) all messages are handled. –(2) when one message ends and the next starts. –Fewer bits for more frequent messages more bits for less frequent messages. 0-No occupants 10-both occupants 110-First occupant111-Second occupant

W : random variable for a message V(W) : possible messages P : probability distribution lower bound on the number of bits needed to encode such a message : (entropy of a random variable)

(1) entropy of a message: –lower bound for the average number of bits needed to transmit that message. (2) encoding method: – using  -log P(w)  bits entropy (another view) –a measure of the uncertainty about what a message says. –fewer bits for more certain message more bits for less certain message

7 - 36

Simplified Polynesian ptkaiu 1/81/41/81/41/81/8 per-letter entropy ptkaiu Fewer bits are used to send more frequent letters.

Joint Entropy and Conditional Entropy The joint entropy of a pair of discrete random variables X, Y ~ p(x,y) is the amount of information needed on average to specify both their values. The conditional entropy of a discrete random variable Y given another X, for X, Y ~ p(x,y), expresses how much extra information you still need to supply on average to communicate Y given that the other party knows X.

Chain rule for entropy Proof:

Simplified Polynesian revised Distinction between model and reality Simplified Polynesian has syllable structure All words are consist of sequences of CV (consonant-vowel) syllables. A better model in terms of two random variables C and V.

consonant vowelvowel P(C,) P(,V) P(V,C)

7 - 42

7 - 43

Short Summary Better understanding means much less uncertainty (2.44bits < 5 bits) Incorrect model’s cross entropy is larger than that of the correct model correct model approximate model

Entropy rate: per-letter/word entropy The amount of information contained in a message depends on the length of the message The entropy of a human language L

Mutual Information By the chain rule for entropy, we have H(X,Y) = H(X)+ H(Y|X) = H(Y)+H(X|Y) Therefore, H(X)-H(X|Y)=H(Y)-H(Y|X) This difference is called the mutual information between X and Y. It is the reduction in uncertainty of one random variable due to knowing about another, or, in other words, the amount of information one random variable contains about another.

H(X|Y)H(Y|X) H(X,Y) I(X;Y) H(Y)H(X)

It is 0 only when two variables are independent, but for two dependent variables, mutual information grows not only with the degree of dependence, but also according to the entropy of the variables Conditional mutual information Chain rule for mutual information

Pointwise Mutual Information Applications: clustering words word sense disambiguation

Clustering by Next Word (Brown, et al., 1992) 1.Each word was characterized by the word that immediately followed it. c(w i ) ==  w j total … w i w j … in the corpus 2.Define the distance measure on such vectors. mutual information I(x, y) the amount of information one outcome gives us about the other I(x, y) == (-log P(x)) - (-log P(x|y)) == log def x uncertainty x gives y uncertainty certainty (x gives y)

Example. How much information the word “pancake” gives us about the following word “syrup”

Physical meaning of MI (1) w i and w j have no particular relation to each other. I(x; y)  0 P(w j | w i ) = P(w j ) x 與 y 相互之間不會給對方信息

Physical meaning of MI (2) w i and w j are perfectly coordinated. a very large number I(x; y) >> 0 當知道 y 後,提供很多有關 x 的信息

Physical meaning of MI (3) w i and w j are negatively correlated a very small negative number I(x; y) << 0 王不見王

The Noisy Channel Model Assuming that you want to communicate messages over a channel of restricted capacity, optimize (in terms of throughput and accuracy) the communication in the presence of noise in the channel. A channel’s capacity can be reached by designing an input code that maximizes the mutual information between the input and output over all possible input distributions. Encoder W Message from a finite alphabet Channel p(y|x) X Input to channel Decoder Y Output from channel Attempt to reconstruct message based on output noisynoisy channelchannel

p p A binary symmetric channel. A 1 or 0 in the input gets flipped on transmission with probability p. I(X;Y)= H(Y) – H(Y|X) = H(Y) – H(p) The channel capacity is 1 bit only if the entropy H(p) is 0. I.E., If p=0 the channel reliability transmits a 0 as 0 and 1 as 1. If p=1 it always flips bits The channel capacity is 0 when both 0s and 1s are transmitted with equal probability as 0s and 1s (i.e., p=1/2).  Completely noisy binary channel H(p)=-  p(x)log 2 p(x)

p(i): language model p(o|i): channel probability Noisy Channel p(o|i) I Decoder O The noisy channel model in linguistics

Speech Recognition Find the sequence of words that maximizes Maximize P( ) P(Speech Signal | ) | Speech Signal ) Speech Signal ) Speech Signal | ) language modelacoustic aspects of speech signal Speech Signal)

The dog... big pig Assume P(big | the) = P(pig | the). P(the big dog) = P(the)P(big | the)P(dog | the big) P(the pig dog) = P(the)P(pig | the)P(dog | the pig)  P(dog | the big) > P(dog | the pig)  “the big dog” is selected. => “dog” selects “big”

7 - 60

Relative Entropy or Kullback-Leibler Divergence For 2 pmfs, p(x) and q(x), their relative entropy is: The relative entropy (also known as the Kullback- Leibler divergence) is a measure of how different two probability distributions (over the same event space) are. The KL divergence between p and q can also be seen as the average number of bits that are wasted by encoding events from a distribution p with a code based on a not-quite-right distribution q. i.e., no bits are wasted when p=q

Application: measure selectional preferences in selection A measure of how far a joint distribution is from independence: Conditional relative entropy: Chain rule for relative entropy: Recall

The Relation to Language: Cross-Entropy Entropy can be thought of as a matter of how surprised we will be to see the next word given previous words we already saw. The cross entropy between a random variable X with true probability distribution p(x) and another pmf q (normally a model of p) is given by: Cross-entropy can help us find out what our average surprise for the next word is.

Cross Entropy Cross entropy of a language L=(X i )~p(x) The language is “nice” Large body of utterances available

How much good does the approximate model do ? correct model approximate model

proof: ….(1) ….(2) (correct model)(approximate model) … (1) … (2) y=x-1 e

cross entropy of a language L give a model M (“all” English test) infinite representative samples of English text

7 - 68

Cross Entropy as a Model Evaluator Example: –Find the best model to produce message of 20 words. Correct probabilistic model. Message Probability M M M M M M M M8 0.25

Per-word cross entropy Approximate model * (0.05 log P(M1) log P(M2) log P(M3) log P(M4) log P(M5) log P(M6) log P(M7) log P(M8) ) 100 samples M1 M1 M1 M1 M1 (5) M2 M2 M2 M2 M2 (5) M3 M3 M3 M3 M3 (5) M4 M4 M4... M4 (10) M5 M5 M5... M5 (10) M6 M6 M6... M6 (20) M7 M7 M7... M7 (20) M8 M8 M8... M8 (25) Each message is independent of the next

= M1 : M2 :. M8 : P(M1) * P(M1) * … * P(M1) 5 times P(M2) * P(M2) * … * P(M2) 5 times P(M8) * P(M8) * … * P(M8) 25 times *... Log( ) = 5 log P(M1) + 5 log P(M2) + 5 log P(M3)+ 10 log P(M4) + 10 log P(M5) + 20 log P(M6)+ 20 log P(M7) + 25 log P(M8)

per-word cross entropy the test suite of 100 examples was exactly indicative of the probabilistic model. To measure the cross entropy of a model works only if the test sequence has not been used by model builder. closed test (inside test) open test (outside test)

The Composition of Brown and LOB Corpora Brown Corpus: –Brown University Standard Corpus of Present-day American English. LOB Corpus: –Lancaster/Oslo-Bergen Corpus of British English.

Text Category Number of Texts Brown LOB A Press:reportage 報導文學 B Press:editorial 社論 C Press:reviews 書評 D Religion 宗教性 E Skills and hobbies 技藝,商業性,娛樂性 F Popular lore 名間傳奇 G Belles lettres,biography,memoirs,etc 純文學 H Miscellaneous(mainly government documents) 雜類 ( 包括政府、出版品、基金會報告等 ) J Learned(including suence and technology) 學術性和科學論文 K General fiction 一般小說 L Mystery and detective 神秘及偵探小說 M Suence fiction 科幻小說 6 6 N Adverture and western fiction 探險及西部小說 P Romancer and love story 浪漫愛情小說 R Humour 幽默體 9 9

The Entropy of English We can model English using n-gram models (also known a Markov chains). These models assume limited memory, i.e., we assume that the next word depends only on the previous k ones [kth order Markov approximation].

P(w 1,n ) bigram: trigram:

The Entropy of English What is the Entropy of English?

Perplexity A measure related to the notion of cross-entropy and used in the speech recognition community is called the perplexity. A perplexity of k means that you are as surprised on average as you would have been if you had had to guess between k equiprobable choices at each step.