9/14/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing Probability AI-Lab 2003.09.03.

Slides:



Advertisements
Similar presentations
Discrete Random Variables To understand what we mean by a discrete random variable To understand that the total sample space adds up to 1 To understand.
Advertisements

Lecture Discrete Probability. 5.2 Recap Sample space: space of all possible outcomes. Event: subset of of S. p(s) : probability of element s of.
Statistical NLP Course for Master in Computational Linguistics 2nd Year Diana Trandabat.
Copyright © Cengage Learning. All rights reserved. 8.6 Probability.
June 3, 2008Stat Lecture 6 - Probability1 Probability Introduction to Probability, Conditional Probability and Random Variables Statistics 111 -
Probability Distributions: Finite Random Variables.
UNDERSTANDING INDEPENDENT EVENTS Adapted from Walch Education.
CHAPTER 10: Introducing Probability
2. Mathematical Foundations
Stat 1510: Introducing Probability. Agenda 2  The Idea of Probability  Probability Models  Probability Rules  Finite and Discrete Probability Models.
14/6/1435 lecture 10 Lecture 9. The probability distribution for the discrete variable Satify the following conditions P(x)>= 0 for all x.
Expected values and variances. Formula For a discrete random variable X and pmf p(X): Expected value: Variance: Alternate formula for variance:  Var(x)=E(X^2)-[E(X)]^2.
CPSC 531: Probability Review1 CPSC 531:Probability & Statistics: Review Instructor: Anirban Mahanti Office: ICT Class.
5.3 Random Variables  Random Variable  Discrete Random Variables  Continuous Random Variables  Normal Distributions as Probability Distributions 1.
Random Variables. A random variable X is a real valued function defined on the sample space, X : S  R. The set { s  S : X ( s )  [ a, b ] is an event}.
9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Probability Dr. Jan Hajič CS Dept., Johns Hopkins Univ.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
Probability The calculated likelihood that a given event will occur
November 2004CSA4050: Crash Concepts in Probability1 CSA4050: Advanced Topics in NLP Probability I Experiments/Outcomes/Events Independence/Dependence.
STA347 - week 51 More on Distribution Function The distribution of a random variable X can be determined directly from its cumulative distribution function.
1 Since everything is a reflection of our minds, everything can be changed by our minds.
Copyright © Cengage Learning. All rights reserved. 8.6 Probability.
CHAPTER 10: Introducing Probability ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Rules of Probability. Recall: Axioms of Probability 1. P[E] ≥ P[S] = 1 3. Property 3 is called the additive rule for probability if E i ∩ E j =
12/7/20151 Probability Introduction to Probability, Conditional Probability and Random Variables.
1 Randomized Algorithms Andreas Klappenecker. 2 Randomized Algorithms A randomized algorithm is an algorithm that makes random choices during their execution.
Natural Language Processing Giuseppe Attardi Introduction to Probability IP notice: some slides from: Dan Jurafsky, Jim Martin, Sandiway Fong, Dan Klein.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Binomial Distribution
Random Variables Learn how to characterize the pattern of the distribution of values that a random variable may have, and how to use the pattern to find.
Probability and Distributions. Deterministic vs. Random Processes In deterministic processes, the outcome can be predicted exactly in advance Eg. Force.
CHAPTER 10: Introducing Probability ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Discrete Math Section 16.3 Use the Binomial Probability theorem to find the probability of a given outcome on repeated independent trials. Flip a coin.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory II AI-lab
Statistical NLP Course for Master in Computational Linguistics 2nd Year Diana Trandabat.
1 COMP 791A: Statistical Language Processing Mathematical Essentials Chap. 2.
Discrete Random Variable Random Process. The Notion of A Random Variable We expect some measurement or numerical attribute of the outcome of a random.
Terminologies in Probability
Math 145 October 5, 2010.
PROBABILITY AND PROBABILITY RULES
Natural Language Processing
Basic Probability aft A RAJASEKHAR YADAV.
Natural Language Processing
Introduction to Probability
Math 145.
Probability distributions
Terminologies in Probability
Math 145 February 22, 2016.
Chapter 6: Random Variables
CHAPTER 10: Introducing Probability
Terminologies in Probability
Terminologies in Probability
Random Variable Two Types:
Random Variables and Probability Distributions
CPSC 503 Computational Linguistics
Terminologies in Probability
Math 145 February 26, 2013.
Math 145 June 11, 2014.
Discrete & Continuous Random Variables
Math 145 September 29, 2008.
Math 145 June 8, 2010.
Math 145 October 3, 2006.
Terminologies in Probability
6.2 Probability Models.
Terminologies in Probability
Math 145 September 24, 2014.
Math 145 October 1, 2013.
Math 145 February 24, 2015.
*Introduction to Natural Language Processing ( ) Probability
Presentation transcript:

9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing Probability AI-Lab

9/14/1999JHU CS / Intro to NLP/Jan Hajic2 Experiments & Sample Spaces Experiment, process, test,... Set of possible basic outcomes: sample space  –coin toss (  = {head,tail}), die (  = {1..6}) –yes/no opinion poll, quality test (bad/good) (  = {0,1}) –lottery (|  |      –# of traffic accidents somewhere per year (  = N) –spelling errors (  =  * ), where Z is an alphabet, and Z * is a set of possible strings over such and alphabet –missing word (|  |  vocabulary size)

9/14/1999JHU CS / Intro to NLP/Jan Hajic3 Events Event A is a set of basic outcomes Usually A    and all A Î  2  (the event space) –  is then the certain event, Æ  is the impossible event Example: –experiment: three times coin toss  = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} –count cases with exactly two tails: then A = {HTT, THT, TTH} –all heads: A = {HHH}

9/14/1999JHU CS / Intro to NLP/Jan Hajic4 Probability Repeat experiment many times, record how many times a given event A occurred ( “ count ” c 1 ). Do this whole series many times; remember all c i s. Observation: if repeated really many times, the ratios of c i /T i (where T i is the number of experiments run in the i-th series) are close to some (unknown but) constant value. Call this constant a probability of A. Notation: p(A)

9/14/1999JHU CS / Intro to NLP/Jan Hajic5 Estimating probability Remember:... close to an unknown constant. We can only estimate it: –from a single series (typical case, as mostly the outcome of a series is given to us and we cannot repeat the experiment), set p(A) = c 1 /T 1. –otherwise, take the weighted average of all c i /T i (or, if the data allows, simply look at the set of series as if it is a single long series). This is the best estimate.

9/14/1999JHU CS / Intro to NLP/Jan Hajic6 Example Recall our example: –experiment: three times coin toss  = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} –count cases with exactly two tails: A = {HTT, THT, TTH} Run an experiment 1000 times (i.e tosses) Counted: 386 cases with two tails ( HTT, THT, or TTH ) estimate: p(A) = 386 / 1000 =.386 Run again: 373, 399, 382, 355, 372, 406, 359 –p(A) =.379 (weighted average) or simply 3032 / 8000 Uniform distribution assumption: p(A) = 3/8 =.375

9/14/1999JHU CS / Intro to NLP/Jan Hajic7 Basic Properties Basic properties: –p: 2  ®  [0,1] –p(  ) = 1 –Disjoint events: p( È A i ) = å i p(A i ) [NB: axiomatic definition of probability: take the above three conditions as axioms] Immediate consequences: –p( Æ ) = 0, p(  A ) = 1 - p(A), A Í  p(A) £ p(B) – å a  Î  p(a) = 1

9/14/1999JHU CS / Intro to NLP/Jan Hajic8 Joint and Conditional Probability p(A,B) = p(A Ç  B) p(A|B) = p(A,B) / p(B) –Estimating form counts: p(A|B) = p(A,B) / p(B) = (c(A Ç B) / T) / (c(B) / T) = = c(A Ç B) / c(B)  A B A Ç B

9/14/1999JHU CS / Intro to NLP/Jan Hajic9 Bayes Rule p(A,B) = p(B,A) since p(A Ç  p(B Ç  –therefore: p(A|B) p(B) = p(B|A) p(A), and therefore p(A|B) = p(B|A) p(A) / p(B) !  A B A Ç B

9/14/1999JHU CS / Intro to NLP/Jan Hajic10 Independence Can we compute p(A,B) from p(A) and p(B)? Recall from previous foil: p(A|B) = p(B|A) p(A) / p(B) p(A|B) p(B) = p(B|A) p(A) p(A,B) = p(B|A) p(A)... we ’ re almost there: how p(B|A) relates to p(B)? –p(B|A) = P(B) iff A and B are independent Example: two coin tosses, weather today and weather on March 4th 1789; Any two events for which p(B|A) = P(B)!

9/14/1999JHU CS / Intro to NLP/Jan Hajic11 Chain Rule p(A 1, A 2, A 3, A 4,..., A n ) = ! p(A 1 |A 2,A 3,A 4,...,A n )  p(A 2 |A 3,A 4,...,A n )   p(A 3 |A 4,...,A n ) ... p(A n-1 |A n )  p(A n ) this is a direct consequence of the Bayes rule.

9/14/1999JHU CS / Intro to NLP/Jan Hajic12 The Golden Rule (of Classic Statistical NLP) Interested in an event A given B (where it is not easy or practical or desirable) to estimate p(A|B)): take Bayes rule, max over all Bs: argmax A p(A|B) = argmax A p(B|A). p(A) / p(B) = argmax A p(B|A) p(A) !... as p(B) is constant when changing As

9/14/1999JHU CS / Intro to NLP/Jan Hajic13 Random Variables is a function X:  ®  Q –in general: Q = R n, typically R –easier to handle real numbers than real-world events random variable is discrete if Q is countable (i.e. also if finite) Example: die: natural “ numbering ” [1,6], coin: {0,1} Probability distribution: –p X (x) = p(X=x) = df p(A x ) where A x = {a Î  : X(a) = x} –often just p(x) if it is clear from context what X is

9/14/1999JHU CS / Intro to NLP/Jan Hajic14 Expectation Joint and Conditional Distributions is a mean of a random variable (weighted average) –E(X) = å x Î X(  x. p X (x) Example: one six-sided die: 3.5, two dice (sum) 7 Joint and Conditional distribution rules: –analogous to probability of events Bayes: p X|Y (x,y) = notation p XY (x|y) = even simpler notation p(x|y) = p(y|x). p(x) / p(y) Chain rule: p(w,x,y,z) = p(z).p(y|z).p(x|y,z).p(w|x,y,z)

9/14/1999JHU CS / Intro to NLP/Jan Hajic15 Standard distributions Binomial (discrete) –outcome: 0 or 1 (thus: binomial) –make n trials –interested in the (probability of) number of successes r Must be careful: it ’ s not uniform! p b (r|n) = ( ) / 2 n (for equally likely outcome) ( ) counts how many possibilities there are for choosing r objects out of n; = n! / (n-r)!r! nrnr nrnr

9/14/1999JHU CS / Intro to NLP/Jan Hajic16 Continuous Distributions The normal distribution ( “ Gaussian ” ) p norm (x|  ) = e -(x-  ) 2 /(2  2 ) /  Ö  where: –  is the mean (x-coordinate of the peak) (0) –  is the standard deviation (1)  x other: hyperbolic, t