Recitation 1 Probability Review

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
Probabilistic models Haixu Tang School of Informatics.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
Machine Learning /15-781, Spring 2008 Tutorial on Basic Probability Eric Xing Lecture 2, September 15, 2006 Reading: Chap. 1&2, CB & Chap 5,6, TM.
Parameter Estimation using likelihood functions Tutorial #1
Lec 18 Nov 12 Probability – definitions and simulation.
Visual Recognition Tutorial
Probability Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Visual Recognition Tutorial
QMS 6351 Statistics and Research Methods Probability and Probability distributions Chapter 4, page 161 Chapter 5 (5.1) Chapter 6 (6.2) Prof. Vera Adamchik.
Thanks to Nir Friedman, HU
Information Theory and Security
Probability and Statistics Review Thursday Sep 11.
Crash Course on Machine Learning
Chapter 9 Introducing Probability - A bridge from Descriptive Statistics to Inferential Statistics.
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PROBABILITY & STATISTICAL INFERENCE LECTURE 3 MSc in Computing (Data Analytics)
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Theory of Probability Statistics for Business and Economics.
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
Generative verses discriminative classifier
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Elementary manipulations of probabilities Set probability of multi-valued r.v. P({x=Odd}) = P(1)+P(3)+P(5) = 1/6+1/6+1/6 = ½ Multi-variant distribution:
Lecture V Probability theory. Lecture questions Classical definition of probability Frequency probability Discrete variable and probability distribution.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Basics on Probability Jingrui He 09/11/2007. Coin Flips  You flip a coin Head with probability 0.5  You flip 100 coins How many heads would you expect.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Review of statistical modeling and probability theory Alan Moses ML4bio.
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Oliver Schulte Machine Learning 726
MECH 373 Instrumentation and Measurements
Probability Theory and Parameter Estimation I
CS 2750: Machine Learning Density Estimation
CS 2750: Machine Learning Probability Review Density Estimation
Bayes Net Learning: Bayesian Approaches
Review of Probabilities and Basic Statistics
Review of Probability and Estimators Arun Das, Jason Rebello
More about Posterior Distributions
Advanced Artificial Intelligence
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical NLP: Lecture 4
Parametric Methods Berlin Chen, 2005 References:
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Recitation 1 Probability Review Parts of the slides are from previous years’ recitation and lecture notes

Basic Concepts A sample space S is the set of all possible outcomes of a conceptual or physical, repeatable experiment. (S can be finite or infinite.) E.g., S may be the set of all possible outcomes of a dice roll: An event A is any subset of S. Eg., A= Event that the dice roll is < 3.

Worlds in which A is true Probability A probability P(A) is a function that maps an event A onto the interval [0, 1]. P(A) is also called the probability measure or probability mass of A. Worlds in which A is false Worlds in which A is true We talk about the probability of something, but what is the precise mathematical definition of “probability”? What is probability? Is it a number, between zero and one, or something else? A rigorous definition will involve measure theory, but by now we are satisfied to just say “A probability P(A) is a function that maps an event A onto the interval [0, 1].” One thing that is easy to confuse here is: probability is defined on event space, instead of sample space. When I throw a dice, I can ask the possibility that I get a 6; I can also ask the possibility of getting either 6 or 5. P(A) is the area of the oval 3

Kolmogorov Axioms All probabilities are between 0 and 1 P(S) = 1 0 ≤ P(A) ≤ 1 P(S) = 1 P(Φ)=0 The probability of a disjunction is given by P(A U B) = P(A) + P(B) − P(A ∩ B) Since we define probability as some type of measure on a sample space, probability function must satisfy the four Kolmogorov Axioms. ¬A ∩ ¬B B A ∩ B ? A ∩ B A 4

Random Variable A random variable is a function that associates a unique number with every outcome of an experiment. Discrete r.v.: The outcome of a dice-roll: D={1,2,3,4,5,6} Binary event and indicator variable: Seeing a “6" on a toss Þ X=1, o/w X=0. This describes the true or false outcome a random event. Continuous r.v.: The outcome of observing the measured location of an aircraft X(w) S w Quite often, people fount it is inconvenient to directly work with samples, because samples could be anything, time, location, color, anything. It’ll be more convenient to work just on numbers, at least for mathematicians and statisticians. Therefore we introduce the concept of “random variables”. A random variable is a function that maps the outcome of an experiment to a numerical value. Let us forget red/blue/green, instead, let’s talk about 1/2/3; 5

Probability distributions For each value that r.v X can take, assign a number in [0,1]. Suppose X takes values v1,…vn. Then, P(X= v1)+…+P(X= vn)= 1. Intuitively, the probability of X taking value vi is the frequency of getting outcome represented by vi

Discrete Distributions Bernoulli distribution: Ber(p) Binomial distribution: Bin(n,p) Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads? How many ways can you get k heads in a sequence of k heads and n-k tails? The simplest discrete distribution is Bernouilli. You throw a coin and you have probability p to get one and probability 1-p to get zero. Another discrete distribution that will be frequently used in this course is “Multinomial”. 7

Continuous Prob. Distribution A continuous random variable X is defined on a continuous sample space: an interval on the real line, a region in a high dimensional space, etc. X usually corresponds to a real-valued measurements of some property, e.g., length, position, … It is meaningless to talk about the probability of the random variable assuming a particular value --- P(x) = 0 Instead, we talk about the probability of the random variable assuming a value within a given interval, or half interval, etc. 8

Probability Density If the prob. of x falling into [x, x+dx] is given by p(x)dx for dx , then p(x) is called the probability density over x. The probability of the random variable assuming a value within some given interval from x1 to x2 is equivalent to the area under the graph of the probability density function between x1 and x2. Probability mass: Gaussian Distribution 9

Continuous Distributions Uniform Density Function Normal (Gaussian) Density Function The distribution is symmetric, and is often illustrated as a bell-shaped curve. Two parameters, m (mean) and s (standard deviation), determine the location and shape of the distribution. The highest point on the normal curve is at the mean, which is also the median and mode. The time until the next bus comes, or phones rings. Tightly related to Poisson distribution and Poisson process. Mu is called rated parameter. 10

Statistical Characterizations Expectation: the centre of mass, mean, first moment): Sample mean: Variance: the spread: Sample variance: 11

Conditional Probability P(X|Y) = Fraction of worlds in which X is true given Y is true H = "having a headache" F = "coming down with Flu" P(H)=1/10 P(F)=1/40 P(H|F)=1/2 P(H|F) = fraction of headache given you have a flu = P(H∧F)/P(F) Definition: Corollary: The Chain Rule Y X∧Y X

The Bayes Rule What we have just did leads to the following general expression: This is Bayes Rule

Probabilistic Inference H = "having a headache" F = "coming down with Flu" P(H)=1/10 P(F)=1/40 P(H|F)=1/2 The Problem: P(F|H) = ? F F∧H H

Joint Probability A joint probability distribution for a set of RVs gives the probability of every atomic event (sample point) P(Flu,DrinkBeer) = a 2 × 2 matrix of values: Every question about a domain can be answered by the joint distribution, as we will see later. B ¬B F 0.005 0.02 ¬F 0.195 0.78

Inference with the Joint Compute Marginals ¬F ¬B ¬H 0.4 H 0.1 B 0.17 0.2 F 0.05 0.015 H F B

Inference with the Joint Compute Marginals ¬F ¬B ¬H 0.4 H 0.1 B 0.17 0.2 F 0.05 0.015 H F B

Inference with the Joint Compute Conditionals ¬F ¬B ¬H 0.4 H 0.1 B 0.17 0.2 F 0.05 0.015 H F B

Inference with the Joint Compute Conditionals General idea: compute distribution on query variable by fixing evidence variables and summing over hidden variables ¬F ¬B ¬H 0.4 H 0.1 B 0.17 0.2 F 0.05 0.015 H F B

Conditional Independence Random variables X and Y are said to be independent if: P(X ∩ Y) =P(X)*P(Y) Alternatively, this can be written as P(X | Y) = P(X) and P(Y | X) = P(Y) Intuitively, this means that telling you that Y happened, does not make X more or less likely. Note: This does not mean X and Y are disjoint!!! Y X∧Y X

Rules of Independence --- by examples P(Virus | DrinkBeer) = P(Virus) iff Virus is independent of DrinkBeer P(Flu | Virus;DrinkBeer) = P(Flu|Virus) iff Flu is independent of DrinkBeer, given Virus P(Headache | Flu;Virus;DrinkBeer) = P(Headache|Flu;DrinkBeer) iff Headache is independent of Virus, given Flu and DrinkBeer

Marginal and Conditional Independence Recall that for events E (i.e. X=x) and H (say, Y=y), the conditional probability of E given H, written as P(E|H), is P(E and H)/P(H) (= the probability of both E and H are true, given H is true) E and H are (statistically) independent if P(E) = P(E|H) (i.e., prob. E is true doesn't depend on whether H is true); or equivalently P(E and H)=P(E)P(H). E and F are conditionally independent given H if P(E|H,F) = P(E|H) or equivalently P(E,F|H) = P(E|H)P(F|H)

Why knowledge of Independence is useful x Why knowledge of Independence is useful Lower complexity (time, space, search …) Motivates efficient inference for all kinds of queries Structured knowledge about the domain easy to learning (both from expert and from data) easy to grow

Density Estimation A Density Estimator learns a mapping from a set of attributes to a Probability Often know as parameter estimation if the distribution form is specified Binomial, Gaussian … Three important issues: Nature of the data (iid, correlated, …) Objective function (MLE, MAP, …) Algorithm (simple algebra, gradient methods, EM, …) Evaluation scheme (likelihood on test data, predictability, consistency, …)

Parameter Learning from iid data Goal: estimate distribution parameters q from a dataset of N independent, identically distributed (iid), fully observed, training cases D = {x1, . . . , xN} Maximum likelihood estimation (MLE) One of the most common estimators With iid and full-observability assumption, write L(q) as the likelihood of the data: pick the setting of parameters most likely to have generated the data we saw:

Example 1: Bernoulli model Data: We observed N iid coin tossing: D={1, 0, 1, …, 0} Representation: Binary r.v: Model: How to write the likelihood of a single observation xi ? The likelihood of datasetD={x1, …,xN}:

MLE or Objective function: We need to maximize this w.r.t. q Take derivatives wrt q or Frequency as sample mean

Overfitting Recall that for Bernoulli Distribution, we have What if we tossed too few times so that we saw zero head? We have and we will predict that the probability of seeing a head next is zero!!! The rescue: Where n' is know as the pseudo- (imaginary) count But can we make this more formal?

Example 2: univariate normal Data: We observed N iid real samples: D={-0.1, 10, 1, -5.2, …, 3} Model: Log likelihood: MLE: take derivative and set to zero:

P(M|D) = P(D|M)P(M)/P(D) The Bayesian Theory The Bayesian Theory: (e.g., for date D and model M) P(M|D) = P(D|M)P(M)/P(D) the posterior equals to the likelihood times the prior, up to a constant. This allows us to capture uncertainty about the model in a principled way

Hierarchical Bayesian Models q are the parameters for the likelihood p(x|q) a are the parameters for the prior p(q|a) . We can have hyper-hyper-parameters, etc. We stop when the choice of hyper-parameters makes no difference to the marginal likelihood; typically make hyper-parameters constants. Where do we get the prior? Intelligent guesses Empirical Bayes (Type-II maximum likelihood)  computing point estimates of a :

Bayesian estimation for Bernoulli Beta distribution: Posterior distribution of q : Notice the isomorphism of the posterior to the prior, such a prior is called a conjugate prior

Bayesian estimation for Bernoulli, con'd Posterior distribution of q : Maximum a posteriori (MAP) estimation: Posterior mean estimation: Prior strength: A=a+b A can be interoperated as the size of an imaginary data set from which we obtain the pseudo-counts Bata parameters can be understood as pseudo-counts

Bayesian estimation for normal distribution Normal Prior: Joint probability: Posterior: Sample mean