Intro to Probability Slides from Professor Pan,Yan, SYSU.

Slides:



Advertisements
Similar presentations
Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.
Advertisements

Joint and marginal distribution functions For any two random variables X and Y defined on the same sample space, the joint c.d.f. is For an example, see.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Nguyen Ngoc Anh Nguyen Ha Trang
5.4 Joint Distributions and Independence
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Chapter 6 Continuous Random Variables and Probability Distributions
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Evaluating Hypotheses
Sections 4.1, 4.2, 4.3 Important Definitions in the Text:
Machine Learning CMPT 726 Simon Fraser University
Lecture 4 Probability and what it has to do with data analysis.
Lecture 7 1 Statistics Statistics: 1. Model 2. Estimation 3. Hypothesis test.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Review of Probability and Statistics
1 Fin500J Topic 10Fall 2010 Olin Business School Fin500J: Mathematical Foundations in Finance Topic 10: Probability and Statistics Philip H. Dybvig Reference:
(1) A probability model respecting those covariance observations: Gaussian Maximum entropy probability distribution for a given covariance observation.
Jointly distributed Random variables
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Sampling Distributions  A statistic is random in value … it changes from sample to sample.  The probability distribution of a statistic is called a sampling.
Chapter Two Probability Distributions: Discrete Variables
Physics Fluctuomatics (Tohoku University) 1 Physical Fluctuomatics 2nd Mathematical Preparations (1): Probability and statistics Kazuyuki Tanaka Graduate.
: Appendix A: Mathematical Foundations 1 Montri Karnjanadecha ac.th/~montri Principles of.
Two Random Variables W&W, Chapter 5. Joint Distributions So far we have been talking about the probability of a single variable, or a variable conditional.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
0 K. Salah 2. Review of Probability and Statistics Refs: Law & Kelton, Chapter 4.
Sample variance and sample error We learned recently how to determine the sample variance using the sample mean. How do we translate this to an unbiased.
STA347 - week 51 More on Distribution Function The distribution of a random variable X can be determined directly from its cumulative distribution function.
Expectation for multivariate distributions. Definition Let X 1, X 2, …, X n denote n jointly distributed random variable with joint density function f(x.
1 G Lect 2M Examples of Correlation Random variables and manipulated variables Thinking about joint distributions Thinking about marginal distributions:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Physics Fluctuomatics/Applied Stochastic Process (Tohoku University) 1 Physical Fluctuomatics Applied Stochastic Process 3rd Random variable, probability.
1 Topic 5 - Joint distributions and the CLT Joint distributions –Calculation of probabilities, mean and variance –Expectations of functions based on joint.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
7 sum of RVs. 7-1: variance of Z Find the variance of Z = X+Y by using Var(X), Var(Y), and Cov(X,Y)
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Expectation. Let X denote a discrete random variable with probability function p(x) (probability density function f(x) if X is continuous) then the expected.
Physics Fluctuomatics (Tohoku University) 1 Physical Fluctuomatics 3rd Random variable, probability distribution and probability density function Kazuyuki.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
ES 07 These slides can be found at optimized for Windows)
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing Probability AI-Lab
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Conditional Expectation
CS479/679 Pattern Recognition Dr. George Bebis
Probability and Information Theory
Chapter 3: Maximum-Likelihood Parameter Estimation
Oliver Schulte Machine Learning 726
Random Variable 2013.
PDF, Normal Distribution and Linear Regression
Probability Theory and Parameter Estimation I
Graduate School of Information Sciences, Tohoku University
Ch3: Model Building through Regression
Graduate School of Information Sciences, Tohoku University
Special Topics In Scientific Computing
Review of Probability and Estimators Arun Das, Jason Rebello
EC 331 The Theory of and applications of Maximum Likelihood Method
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Handout Ch 4 實習.
Handout Ch 4 實習.
Probability overview Event space – set of possible outcomes
Chapter 8 Estimation.
Moments of Random Variables
Presentation transcript:

Intro to Probability Slides from Professor Pan,Yan, SYSU

Probability Theory Example of a random experiment –We poll 60 users who are using one of two search engines and record the following: X Each point corresponds to one of 60 users Two search engines Number of “good hits” returned by search engine

X Probability Theory Random variables –X and Y are called random variables –Each has its own sample space: S X = {0,1,2,3,4,5,6,7,8} S Y = {1,2}

X Probability Theory Probability –P(X=i,Y=j) is the probability (relative frequency) of observing X = i and Y = j –P(X,Y) refers to the whole table of probabilities –Properties: 0 ≤ P ≤ 1,  P = P(X=i,Y=j)P(X=i,Y=j)

Probability Theory Marginal probability –P(X=i) is the marginal probability that X = i, ie, the probability that X = i, ignoring Y X P(X)P(X) P(Y)P(Y)

Probability Theory Marginal probability –P(X=i) is the marginal probability that X = i, ie, the probability that X = i, ignoring Y –From the table: P(X=i) =  j P(X=i,Y=j) Note that  i P(X=i) = 1 and  j P(Y=j) = 1 X P(Y=j)P(Y=j) P(X=i)P(X=i) SUM RULE

Probability Theory Conditional probability –P(X=i|Y=j) is the probability that X=i, given that Y = j –From the table: P(X=i|Y=j) =  P(X=i,Y=j) / P(Y=j) X P(X|Y=1) P(Y=1)

Probability Theory Conditional probability –How about the opposite conditional probability, P(Y=j|X=i)? –P(Y=j|X=i) =  P(X=i,Y=j) / P(X=i) Note that  j P(Y=j|X=i)=1 X X P(Y=j|X=i)P(Y=j|X=i) P(X=i,Y=j)P(X=i,Y=j) P(X=i)P(X=i)

Summary of types of probability Joint probability: P(X,Y) Marginal probability (ignore other variable): P(X) and P(Y) Conditional probability (condition on the other variable having a certain value): P(X|Y) and P(Y|X)

Probability Theory Constructing joint probability –Suppose we know The probability that the user will pick each search engine, P(Y=j), and For each search engine, the probability of each number of good hits, P(X=i|Y=j) –Can we construct the joint probability, P(X=i,Y=j) ? –Yes. Rearranging P(X=i|Y=j) =  P(X=i,Y=j) / P(Y=j) we get P(X=i,Y=j) =  P(X=i|Y=j) P(Y=j) PRODUCT RULE

Summary of computational rules SUM RULE: P(X) =  Y P(X,Y) P(Y) =  X P(X,Y) –Notation: We simplify P(X=i,Y=j) for clarity PRODUCT RULE: P(X,Y) = P(X|Y) P(Y) P(X,Y) = P(Y|X) P(X)

Ordinal variables In our example, X has a natural order 0…8 –X is a number of hits, and –For the ordering of the columns in the table below, nearby X ’s have similar probabilities Y does not have a natural order X

Probabilities for real numbers Can’t we treat real numbers as IEEE DOUBLES with 2 64 possible values? Hah, hah. No! How about quantizing real variables to reasonable number of values? Sometimes works, but… –We need to carefully account for ordinality –Doing so can lead to cumbersome mathematics

Probability theory for real numbers Quantize X using bins of width  Then, X  {.., -2 , - , 0, , 2 ,..} Define P Q (X=x) = Probability that x  X ≤ x+  Problem: P Q (X=x) depends on the choice of  Solution: Let   0 Problem: In that case, P Q (X=x)  0 Solution: Define a probability density P(x) = lim  0 P Q (X=x) /  = lim  0 (Probability that x  X ≤ x+  ) / 

Probability theory for real numbers Probability density –Suppose P(x) is a probability density –Properties P(x)  0 It is NOT necessary that P(x) ≤ 1  x P(x) dx = 1 –Probabilities of intervals: P(a  X ≤ b) =  b x=a P(x) dx

Probability theory for real numbers Joint, marginal and conditional densities Suppose P(x,y) is a joint probability density –  x  y P(x,y) dx dy = 1 – P( (X,Y)  R) =  R P(x,y) dx dy Marginal density: P(x) =  y P(x,y) dy Conditional density: P(x|y) = P(x,y) / P(y) x y R

The Gaussian distribution  is the standard deviation

Mean and variance The mean of X is E[X] =  X X P(X) or E[X] =  x x P(x) dx The variance of X is VAR(X) =  X ( X-E[X] ) 2 P(X) or VAR(X) =  x ( x - E[X] ) 2 P(x)dx The std dev of X is STD(X) = SQRT(VAR(X)) The covariance of X and Y is COV(X,Y) =  X  Y ( X-E[X] ) ( Y-E[Y] ) P(X,Y) or COV(X,Y) =  x  y ( x-E[X] ) ( y-E[Y] ) P(x,y) dx dy

Mean and variance of the Gaussian E[X] =  VAR(X) =  2 STD(X) = 

How can we use probability as a framework for machine learning?

Maximum likelihood estimation Say we have a density P(x|  ) with parameter  The likelihood of a set of independent and identically drawn (IDD) data x = (x 1,…,x N ) is P( x |  ) =  n=1 N P(x n |  ) The log-likelihood is L = ln P( x |  ) =  n=1 N lnP(x n |  ) The maximum likelihood (ML) estimate of  is  ML = argmax  L = argmax   n=1 N ln P(x n |  ) Example: For Gaussian likelihood P(x|  ) = N (x| ,  2 ), L =

Comments on notation from now on Instead of  j P(X=i,Y=j), we write  X P(X,Y) P() and p() are used interchangeably Discrete and continuous variables treated the same, so  X,  X,  x and  x are interchangeable  ML and  ML are interchangeable argmax  f(  ) is the value of  that maximizes f(  ) In the context of data x 1,…,x N, symbols x, X, X and X refer to the entire set of data N (x| ,  2 ) = log() = ln() and exp(x) = e x p context (x) and p(x|context) are interchangable

Maximum likelihood estimation Say we have a density P(x|  ) with parameter  The likelihood of a set of independent and identically drawn (IDD) data x = (x 1,…,x N ) is P( x |  ) =  n=1 N P(x n |  ) The log-likelihood is L = ln P( x |  ) =  n=1 N lnP(x n |  ) The maximum likelihood (ML) estimate of  is  ML = argmax  L = argmax   n=1 N ln P(x n |  ) Example: For Gaussian likelihood P(x|  ) = N (x| ,  2 ), L =

Questions?

Maximum likelihood estimation Say we have a density P(x|  ) with parameter  The likelihood of a set of independent and identically drawn (IDD) data x = (x 1,…,x N ) is P( x |  ) =  n=1 N P(x n |  ) The log-likelihood is L = ln P( x |  ) =  n=1 N lnP(x n |  ) The maximum likelihood (ML) estimate of  is  ML = argmax  L = argmax   n=1 N ln P(x n |  ) Example: For Gaussian likelihood P(x|  ) = N (x| ,  2 ), L =

Maximum likelihood estimation L = Example: For Gaussian likelihood P(x|  ) = N (x| ,  2 ), Objective of regression: Minimize error E(w) = ½  n ( t n - y(x n,w) ) 2