1 COMP 791A: Statistical Language Processing Mathematical Essentials Chap. 2.

Slides:



Advertisements
Similar presentations
Probability Probability Principles of EngineeringTM
Advertisements

Statistical NLP Course for Master in Computational Linguistics 2nd Year Diana Trandabat.
Random Variable A random variable X is a function that assign a real number, X(ζ), to each outcome ζ in the sample space of a random experiment. Domain.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 11 (Lab): Probability reminder.
SI485i : NLP Day 2 Probability Review. Introduction to Probability Experiment (trial) Repeatable procedure with well-defined possible outcomes Outcome.
Lecture 10 – Introduction to Probability Topics Events, sample space, random variables Examples Probability distribution function Conditional probabilities.
June 3, 2008Stat Lecture 6 - Probability1 Probability Introduction to Probability, Conditional Probability and Random Variables Statistics 111 -
1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.
Chapter 4 Using Probability and Probability Distributions
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 4-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Probability Probability Principles of EngineeringTM
Probability theory Much inspired by the presentation of Kren and Samuelsson.
September SOME BASIC NOTIONS OF PROBABILITY THEORY Universita’ di Venezia 29 Settembre 2003.
Pattern Classification, Chapter 1 1 Basic Probability.
Thanks to Nir Friedman, HU
Information Theory and Security
Probability Distributions: Finite Random Variables.
Review of Probability Theory. © Tallal Elshabrawy 2 Review of Probability Theory Experiments, Sample Spaces and Events Axioms of Probability Conditional.
Lecture 10 – Introduction to Probability Topics Events, sample space, random variables Examples Probability distribution function Conditional probabilities.
2. Mathematical Foundations
Problem A newly married couple plans to have four children and would like to have three girls and a boy. What are the chances (probability) their desire.
Stat 1510: Introducing Probability. Agenda 2  The Idea of Probability  Probability Models  Probability Rules  Finite and Discrete Probability Models.
1 CY1B2 Statistics Aims: To introduce basic statistics. Outcomes: To understand some fundamental concepts in statistics, and be able to apply some probability.
IBS-09-SL RM 501 – Ranjit Goswami 1 Basic Probability.
Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION I Probability Theory Review.
Theory of Probability Statistics for Business and Economics.
1 Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book.
5.3 Random Variables  Random Variable  Discrete Random Variables  Continuous Random Variables  Normal Distributions as Probability Distributions 1.
Random Variables. A random variable X is a real valued function defined on the sample space, X : S  R. The set { s  S : X ( s )  [ a, b ] is an event}.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
Probability The calculated likelihood that a given event will occur
November 2004CSA4050: Crash Concepts in Probability1 CSA4050: Advanced Topics in NLP Probability I Experiments/Outcomes/Events Independence/Dependence.
Week 21 Conditional Probability Idea – have performed a chance experiment but don’t know the outcome (ω), but have some partial information (event A) about.
LECTURE 17 THURSDAY, 22 OCTOBER STA 291 Fall
NLP. Introduction to NLP Very important for language processing Example in speech recognition: –“recognize speech” vs “wreck a nice beach” Example in.
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
12/7/20151 Probability Introduction to Probability, Conditional Probability and Random Variables.
Natural Language Processing Giuseppe Attardi Introduction to Probability IP notice: some slides from: Dan Jurafsky, Jim Martin, Sandiway Fong, Dan Klein.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Review of Chapter
Random Variables Learn how to characterize the pattern of the distribution of values that a random variable may have, and how to use the pattern to find.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 6: Random Variables Section 6.1 Discrete and Continuous Random Variables.
Probability and Distributions. Deterministic vs. Random Processes In deterministic processes, the outcome can be predicted exactly in advance Eg. Force.
Probability Theory Modelling random phenomena. Permutations the number of ways that you can order n objects is: n! = n(n-1)(n-2)(n-3)…(3)(2)(1) Definition:
MATH 256 Probability and Random Processes Yrd. Doç. Dr. Didem Kivanc Tureli 14/10/2011Lecture 3 OKAN UNIVERSITY.
Probability. Probability Probability is fundamental to scientific inference Probability is fundamental to scientific inference Deterministic vs. Probabilistic.
9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing Probability AI-Lab
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
Mutual Information, Joint Entropy & Conditional Entropy
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Statistical NLP Course for Master in Computational Linguistics 2nd Year Diana Trandabat.
Discrete Random Variable Random Process. The Notion of A Random Variable We expect some measurement or numerical attribute of the outcome of a random.
Statistical methods in NLP Course 2 Diana Trandab ă ț
Statistical methods in NLP Course 2
PROBABILITY AND PROBABILITY RULES
Natural Language Processing
Appendix A: Probability Theory
CHAPTER 12: Introducing Probability
Basic Probability Theory
Natural Language Processing
Review of Probability and Estimators Arun Das, Jason Rebello
Introduction to Probability
CSCI 5832 Natural Language Processing
Probability distributions
Statistical NLP: Lecture 4
Discrete & Continuous Random Variables
M248: Analyzing data Block A UNIT A3 Modeling Variation.
Probability Probability Principles of EngineeringTM
Presentation transcript:

1 COMP 791A: Statistical Language Processing Mathematical Essentials Chap. 2

2 Motivations Statistical NLP aims to do statistical inference for the field of NL Statistical inference consists of:  taking some data (generated in accordance with some unknown probability distribution)  then making some inference about this distribution Ex. of statistical inference: language modeling  how to predict the next word given the previous words  to do this, we need a model of the language  probability theory helps us finding such model

3 Notions of Probability Theory Probability theory  deals with predicting how likely it is that something will happen Experiment (or trial)  the process by which an observation is made  Ex. tossing a coin twice

4 Sample Spaces and events Sample space Ω :  set of all possible basic outcomes of an experiment Coin toss: Ω = {head, tail} Tossing a coin twice: Ω = {HH, HT, TH, TT} Uttering a word: |Ω| = vocabulary size  Every observation (element in Ω ) is a basic outcome or sample point An event A is a set of basic outcomes with A  Ω  Ω is then the certain event  Ø is the impossible (or null) event Example - rolling a die:  Sample space Ω = {1, 2, 3, 4, 5, 6}  Event A that an even number occurs A = {2, 4, 6}

5 Events and Probability The probability of an event A is denoted p(A)  also called the prior probability  i.e. the probability before we consider any additional knowledge Example: experiment of tossing a coin 3 times  Ω = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}  events with two or more tails: A = {HTT, THT, TTH, TTT} P(A) = |A|/|Ω| = ½ (assuming uniform distribution)  events with all heads: A = {HHH} P(A) = |A|/|Ω| = ⅛

6 Probability Properties A probability function P (or probability distribution):  Distributes a probability mass of 1 over the sample space Ω  [0,1]  P(Ω) = 1  For disjoint events A i (ie : A i  A j = Ø for all i ≠j) P(  A i ) = Σ P(A i ) Immediate consequences:  P(Ø ) = 0  P(Ā) = 1 - P(A)  A  B ==> P(A) ≤ P(B)  Σ aєΩ P(a) = 1

7 Joint probability Joint probability of A and B:  P(A,B) = P(A  B) Ω AB ABAB

8 Conditional probability Prior (or unconditional) probability  Probability of an event before any evidence is obtained  P(A) = 0.1P(rain today) = 0.1  i.e. Your belief about A given that you have no evidence Posterior (or conditional) probability  Probability of an event given that all we know is B (some evidence)  P(A|B) = 0.8P(rain today| cloudy) = 0.8  i.e. Your belief about A given that all you know is B

9 Conditional probability (con’t) Ω AB ABAB

10 Chain rule With 3 events, the probability that A, B and C occur is:  The probability that A occurs  Times, the probability that B occurs, assuming that A occurred  Times, the probability that C occurs, assuming that A and B have occurred With multiple events, we can generalize to the Chain rule: P(A 1, A 2, A 3, A 4,..., A n ) = P (  A i ) = P(A 1 ) × P(A 2 |A 1 ) × P(A 3 |A 1,A 2 ) ×... × P(A n |A 1,A 2,A 3,…,A n-1 ) (important to NLP)

11 Bayes’ theorem

12 So? we typically want to know: P(Cause | Effect) ex: P(Disease | Symptoms) ex: P(linguistic phenomenon | linguistic observations) But this information is hard to gather However P(Effect | Cause) is easier to gather (from training data) So

13 Example Rare syntactic construction occurs in 1/100,000 sentences A system identifies sentences with such a construction, but it is not perfect  If sentence has the construction --> system identifies it 95% of the time  If sentence does not have the construction --> system says it does 0.5% of the time Question:  if the system says that sentence S has the construction… what is the probability that it is right?

14 Example (con’t) What is P(sentence has the construction | the system says yes) ? Let:  cons = sentence has the construction  yes = system says yes  not_cons = sentence does not have the construction we have:  P(cons) = 1/100,000 =  P(yes | cons) = 95% = 0.95  P(yes | not_cons) = 0.5% = P(yes) = ? P(B) = P(B|A) P(A) + P(B|Ā) P(Ā) P(yes) = P(yes | cons) × P(cons) + P(yes | not_cons) × P(not_cons) = 0.95 × ×

15 Example (con’t) So in only 1 sentence out of 500 that the system says yes, it is actually right!!! So:

16 How likely are we to have Head in a coin toss, given that it is raining today?  A: having a head in a coin toss  B: raining today  Some variables are independent… How likely is the word “ambulance” to appear, given that we’ve seen “car accident”?  Words in text are not independent Statistical Independence vs. Statistical Dependence

17 Independent events Two events A and B are independent:  if the occurrence of one of them does not influence the occurrence of the other  i.e. A is independent of B if P(A) = P(A|B) If A and B are independent, then:  P(A,B) = P(A|B) x P(B) (by chain rule) = P(A) x P(B) (by independence) In NLP, we often assume independence of variables

18 Bayes’ Theorem revisited (a golden rule in statistical NLP) If we are interested in which event B is most likely to occur given an observation A we can chose the B with the largest P(B|A) P(A)  is a normalization constant (to ensure 0…1)  is the same for all possible Bs (and is hard to gather anyways)  so we can drop it So Bayesian reasoning: In NLP:

19 Application of Bayesian Reasoning Diagnostic systems:  P(Disease | Symptoms) Categorization:  P(Category of object| Features of object) Text classification: P(sports-news | words in text) Character recognition: P(character | bitmap) Speech recognition: P(words | signals) Image processing: P(face-person | image) …

20 Random Variables A random variable X is a function  X: Ω --> R n (typically n= 1) Example – tossing 2 dice  Ω = {(1,1), (1,2), (1,3), … (6,6)}  X : Ω --> R x assigns to each point in Ω, the sum of the 2 dice  X(1,1) = 2 X(1,2) = 3, … X(6,6) = 12  R x = {2,3,4,5,6,7,8,9,10,11,12} A random variable X is discrete if:  X: Ω --> S where S is a countable subset of R  In particular, if X: Ω --> {0,1} then X is called a Bernoulli trial. A random variable X is continuous if:  X: Ω --> S where S is a continuum of numbers

21 Probability distribution of an RV Let X be a finite random variable  R x = {x 1, x 2, x 3,… x n } A probability mass function f gives the probability of X at different in points in R x  f( x k ) = P(X= x k ) = p( x k )  p( x k ) ≥ 0  Σ k p( x k ) = 1 Xx1x1 x2x2 x3x3 … xnxn p(X)p(x 1 )p(x 2 )p(x 3 )…p(x n )

22 Example: Tossing 2 dice X = sum of the faces  X: Ω --> S  Ω = {(1,1), (1,2), (1,3), …, (6,6)}  S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} X = maximum of the faces  X: Ω --> S  Ω = {(1,1), (1,2), (1,3), …, (6,6)}  S = {1, 2, 3, 4, 5, 6} X p(X)1/362/363/364/365/366/365/364/363/362/361/36 X p(X)1/363/365/367/369/3611/36

23 Expectation The expectation (μ) is the mean (or average or expected value) of a random variable X Intuitively, it is:  the weighted average of the outcomes  where each outcome is weighted by its probability ex: the average sum of the dice If X and Y are 2 random variables on the same sample space, then:  E(X+Y) = E(X) + E(Y)

24 Example The expectation of the sum of the faces on two dice? (the average sum of the dice)  If equiprobable… ( …+12)/11  But not, equiprobable  Or more simply: E(SUM)=E(Die1+Die2)=E(Die1)+E(Die2) Each face on 1 die is equiprobable E(Die) = ( )/6 = 3.5 E(SUM) = = 7 SUM (x i ) p(SUM=x i )1/362/363/364/365/366/365/364/363/362/361/36

25 Variance and standard deviation The variance of a random variable X is a measure of whether the values of the RV tend to be consistent over trials or to vary a lot The standard deviation of X is the square root of the variance Both measure the weighted “spread” of the values x i around the mean E(X)

26 Example What is the variance of the sum of the faces on two dice? SUM (x i ) p(SUM=x i )1/362/363/364/365/366/365/364/363/362/361/36

27 Back to NLP What is the probability that someone says the sentence:“Mary is reading a book.” In general, for language events, the probability function P is unknown We need to estimate P (or a model M of the language) by looking at a sample of data (training set) 2 approaches:  Frequentist statistics  Bayesian statistics (we will not see) Language eventx1x1 x2x2 x3x3 p???

28 Frequentist Statistics To estimate P, we use the relative frequency of the outcome in a sample of data  i.e. the proportion of times a certain outcome o occurs.  Where C(o) is the number of times o occurs in N trials  For N--> ∞ the relative frequency stabilizes to some number: the estimate of the probability function Two approaches to estimate the probability function:  Parametric (assuming a known distribution)  Non-parametric (distribution free)… we will not see

29 Parametric Methods Assume that some phenomenon in language is modeled by a well-known family of distributions (ex. binomial, normal) The advantages:  we have an explicit probabilistic model of the process by which the data was generated  determining a particular probability distribution within the family requires only the specification of a few parameters (so, less training data) But:  Our assumption on the probability distribution may be wrong…

30 Non-Parametric Methods No assumption is made about the underlying distribution of the data For ex, we can simply estimate P empirically by counting a large number of random events But: because we use less prior information (no assumption on the distribution), more training data is needed

31 Standard Distributions Many applications give rise to the same basic form of a probability distribution - but with different parameters. Discrete Distributions:  the binomial distribution (2 outcomes)  the multinomial distribution (more than 2 outcomes)  … Continuous Distributions:  the normal distribution (Gaussian)  …

32 Binomial Distribution (discrete) Also known as Bernoulli distribution Each trial has only two outcomes (success or failure) The probability of success is the same for each trial The trials are independent There are a fixed number of trials Distribution has 2 parameters:  nb of trials n  probability of success p in 1 trial Ex: Flipping a coin 10 times and counting the number of heads that occur  Can only get a head or a tail (2 outcomes)  For each flip there is the same chance of getting a head (same prob.)  The coin flips do not effect each other (independence)  There are 10 coin flips (n = 10)

33 Examples b(n,p) = B(10, 0.7) Nb trials = 10 Prob(head) = 0.7 b(n,p) = B(10, 0.1) Nb trials = 10 Prob(head) = 0.1

34 Binomial probability function let:  n = nb of trials  p = probability of success in any trial  r = nb of successes out of the n trials The number of ways of having r successes in n trials. The probability of having r successes The probability of having n-r failures.

35 Example What is the probability of rolling higher than 4 in 2 rolls of 3 dice rolls? 1 st 2 nd 3 rd Probability * * * n trials =3 p probability of success in 1 trial = r successes = 2

36 Properties of binomial distribution B(n,p)  Mean E(X) = μ = np Ex:  Flipping a coin 10 times  E(head) = 10 x ½ = 5  Variance σ 2 = np(1-p) Ex:  Flipping a coin 10 times  σ 2 = 10 x ½ ( ½ ) = 2.5

37 Binomial distribution in NLP Works well for tossing a coin But, in NLP we do not always have complete independence from one trial to the next  Consecutive sentences are not independent  Consecutive POS tags are not independent So, binomial distribution in NLP is an approximation (but a fair one)  When we count how many times something is present or absent  And we ignore the possibility of dependencies between one trial and the next  Then, we implicitly use the binomial distribution Ex:  Count how many sentences contain the word “the”  Assume each sentence is independent  Count how many times a verb is used as transitive  Assume each occurrence of the verb is independent of the others…

38 Also known as Gaussian distribution (or Bell curve) to model a random variable X on an infinite sample space (ex. height, length…) X is a continuous random variable if there is a function f(x) defined on the real line R = (- ∞, +∞) such that:  f is non-negative f(x) ≥ 0  The area under the curve of f is one  The probability that X lies in the interval [a,b] is equal to the area under f between x=a and x=b Normal Distribution (continuous)

39 has 2 parameters:  mean μ  standard deviation σ Normal Distribution (con’t) n(μ,σ)= n(0,1) μ=0; σ= 1 n(μ,σ)=n(1.5,2) μ=1.5; σ=2

40 The standard normal distribution if μ=0 and σ=1, then called standard normal distribution Z

41 Frequentist vs Bayesian Statistics Assume we toss a coin 10 times, and get 8 heads:  Frequentists will conclude (from the observations) that a head comes 8/10 -- Maximum Likelihood Estimate (MLE)  if we look at the coin, we would be reluctant to accept 8/10… because we have prior beliefs  Bayesian statisticians will use an a-priori probability distribution (their belief) will update the beliefs when new evidence comes in (a sequence of observations) by calculating the Maximum A Posteriori (MAP) distribution. The MAP probability becomes the new prior probability and the process repeats on each new observation

42 Essential Information Theory Developed by Shannon in the 40s To maximize the amount of information that can be transmitted over an imperfect communication channel (the noisy channel) Notion of entropy (informational content):  How informative is a piece of information? ex. How informative is the answer to a question If you already have a good guess about the answer, the actual answer is less informative… low entropy

43 Entropy - intuition Ex: Betting 1$ to the flip of a coin  If the coin is fair: Expected gain is ½ (+1) + ½ (-1) = 0$ So you’d be willing to pay up to 1$ for advanced information (1$ - 0$ average win)  If the coin is rigged P(head) = 0.99 P(tail) = 0.01 assuming you bet on head (!) Expected gain is 0.99(+1) (-1) = 0.98$ So you’d be willing to pay up to 2¢ for advanced information (1$ $ average win)  Entropy of fair coin is 1$ > entropy of rigged coin 0.02$

44 Entropy Let X be a discrete RV Entropy (or self-information) measures the amount of information in a RV  average uncertainty of a RV  the average length of the message needed to transmit an outcome x i of that variable  the size of the search space consisting of the possible values of a RV and its associated probabilities measured in bits Properties:  H(X) ≥ 0  If H(X) = 0 then it provides no new information

45 Example: The coin flip Fair coin: Rigged coin: P(head) Entropy

46 In simplified Polynesian, we have 6 letters with frequencies:  The per-letter entropy is  We can design a code that on average takes 2.5bits to transmit a letter  Can be viewed as the average nb of yes/no questions you need to ask to identify the outcome (ex: is it a ‘t’? Is it a ‘p’?) ptkaiu 1/81/41/81/41/8 ptkaiu Example: Simplified Polynesian

47 Entropy in NLP Entropy is a measure of uncertainty  The more we know about something the lower its entropy So if a language model captures more of the structure of the language, then its entropy should be lower in NLP, language models are compared by using their entropy.  ex: given 2 grammars and a corpus, we use entropy to determine which grammar beter matches the corpus.