Probability Theory for ML

Probability Theory for ML
School of Computer Science Introduction to Machine Learning Probability Theory for ML Matt Gormley Lecture 2 August 31, 2016 Readings: Mitchell Ch. 1, 2, 6.1 – 6.3 Murphy Ch. 2 Bishop Ch

Reminders Homework 1: released later today due Sept. 7 at 5:30pm

Outline Motivation Probability Theory Random Variables
Sample space, Outcomes, Events Kolmogorov’s Axioms of Probability Random Variables Random variables, Probability mass function (pmf), Probability density function (pdf), Cumulative distribution function (cdf) Examples Notation Expectation and Variance Joint, conditional, marginal probabilities Independence Bayes’ Rule Common Probability Distributions Beta, Dirichlet, etc. Recap of Decision Trees Entropy Information Gain Probability in ML

Why Probability? Machine Learning Statistics
The goal of this course is to provide you with a toolbox: Machine Learning Statistics Probability Computer Science Optimization

Why Probability? Computer Science Domain of Interest Machine Learning
Optimization Statistics Probability Calculus Measure Theory Linear Algebra

Probability Theory

Probability Theory: Definitions
Example Probability Theory: Definitions Example 1: Flipping a coin Sample Space {Heads, Tails} Outcome Example: Heads Event Example: {Heads} Probability P({Heads}) = 0.5 P({Tails}) = 0.5 $

Probability provides a science for inference about interesting events Sample Space The set of all possible outcomes Outcome Possible result of an experiment Event Any subset of the sample space Probability The non-negative number assigned to each event in the sample space Each outcome is unique Only one outcome can occur per experiment An outcome can be in multiple events An elementary event consists of exactly one outcome

Example Probability Theory: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event Example: {3} (the event “the die came up 3”) Probability P({3}) = 1/6 P({4}) = 1/6

Example Probability Theory: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event Example: {2,4,6} (the event “the roll was even”) Probability P({2,4,6}) = 0.5 P({1,3,5}) = 0.5

Example Probability Theory: Definitions Example 3: Timing how long it takes a monkey to reproduce Shakespeare Sample Space [0, +∞) Outcome Example: 1,433,600 hours Event Example: [1, 6] hours Probability P([1,6]) = P([1,433,600, +∞)) = 0.99

Kolmogorov’s Axioms

All of probability can be derived from just these!
Kolmogorov’s Axioms All of probability can be derived from just these! In words: Each event has non-negative probability. The probability that some event will occur is one. The probability of the union of many disjoint sets is the sum of their probabilities “many” here should read “sequence of countably infinite”

Deriving Probability Theorems
Axioms Monotonicity: if A is a subset of B, then P(A) <= P(B) Proof: A subset of B  B = A + C for C=B-A A and C are disjoint  P(B) = P(A or C)=P(A) + P(C) P(C) >= 0 So P(B) >= P(A) “many” here should read “sequence of countably infinite” Slide adapted from William Cohen (10-601B, Spring 2016)

The complement of an event E, denoted ~E, is the event that E does not occur. E ~E

Axioms Theorem: P(~A) = 1 - P(A) Proof: P(A or ~A) = P(Ω) = 1 A and ~A are disjoint  P(A) + P(~A )=P(A or ~A) P(A) + P(~A) = 1 ….then solve for P(~A) “many” here should read “sequence of countably infinite” Slide adapted from William Cohen (10-601B, Spring 2016)

Axioms Theorem: P(A or B) = P(A) + P(B) - P(A and B) Proof: E1 = A and ~(A and B) E2 = (A and B) E3 = B and ~(A and B) E1 or E2 or E3 = A or B and E1, E2, E3 disjoint  P(A or B) = P(E1) + P(E2) + P(E3) further P(A) = P(E1) + P(E2) and P(B) = P(E3) + P(E2) ... - P(E2) P(E1)+P(E2) P(E3)+P(E2) “many” here should read “sequence of countably infinite” Slide adapted from William Cohen (10-601B, Spring 2016)

These Axioms are Not to be Trifled With - Andrew Moore
There have been many many other approaches to understanding “uncertainty”: Fuzzy Logic, three-valued logic, Dempster-Shafer, non-monotonic reasoning, … 40 years ago people in AI argued about these; now they mostly don’t Any scheme for combining uncertain information, uncertain “beliefs”, etc,… really should obey these axioms to be internally consistent (from Jayne, 1958; Cox 1930’s) If you gamble based on “uncertain beliefs”, then [you can be exploited by an opponent]  [your uncertainty formalism violates the axioms] - di Finetti 1931 (the “Dutch book argument”)

Random Variables

Random Variables: Definitions
(capital letters) Def 1: Variable whose possible values are the outcomes of a random experiment Value of a Random Variable (lowercase letters) The value taken by a random variable

Def 1: Variable whose possible values are the outcomes of a random experiment Discrete Random Variable Random variable whose values come from a countable set (e.g. the natural numbers or {True, False}) Continuous Random Variable Random variable whose values come from an interval or collection of intervals (e.g. the real numbers or the range (3, 5))

Def 1: Variable whose possible values are the outcomes of a random experiment Def 2: A measureable function from the sample space to the real numbers: Discrete Random Variable Random variable whose values come from a countable set (e.g. the natural numbers or {True, False}) Continuous Random Variable Random variable whose values come from an interval or collection of intervals (e.g. the real numbers or the range (3, 5))

Discrete Random Variable Random variable whose values come from a countable set (e.g. the natural numbers or {True, False}) Probability mass function (pmf) Function giving the probability that discrete r.v. X takes value x.

Example Random Variables: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event Example: {3} (the event “the die came up 3”) Probability P({3}) = 1/6 P({4}) = 1/6

Example Random Variables: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event Example: {3} (the event “the die came up 3”) Probability P({3}) = 1/6 P({4}) = 1/6 Discrete Ran-dom Variable Example: The value on the top face of the die. Prob. Mass Function (pmf) p(3) = 1/6 p(4) = 1/6

Example Random Variables: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event Example: {2,4,6} (the event “the roll was even”) Probability P({2,4,6}) = 0.5 P({1,3,5}) = 0.5 Discrete Ran-dom Variable Example: 1 if the die landed on an even number and 0 otherwise Prob. Mass Function (pmf) p(1) = 0.5 p(0) = 0.5

Discrete Random Variable Random variable whose values come from a countable set (e.g. the natural numbers or {True, False}) Probability mass function (pmf) Function giving the probability that discrete r.v. X takes value x.

Continuous Random Variable Random variable whose values come from an interval or collection of intervals (e.g. the real numbers or the range (3, 5)) Probability density function (pdf) Function the returns a nonnegative real indicating the relative likelihood that a continuous r.v. X takes value x For any continuous random variable: P(X = x) = 0 Non-zero probabilities are only available to intervals:

Example Random Variables: Definitions Example 3: Timing how long it takes a monkey to reproduce Shakespeare Sample Space [0, +∞) Outcome Example: 1,433,600 hours Event Example: [1, 6] hours Probability P([1,6]) = P([1,433,600, +∞)) = 0.99 Continuous Random Var. Example: Represents time to reproduce (not an interval!) Prob. Density Function Example: Gamma distribution

Example Random Variables: Definitions “Region”-valued Random Variables Sample Space Ω {1,2,3,4,5} Events x The sub-regions 1, 2, 3, 4, or 5 Discrete Ran-dom Variable X Represents a random selection of a sub-region Prob. Mass Fn. P(X=x) Proportional to size of sub-region Some of you may have seen an abstract picture like this before. Does anyone know what it’s depicting? X=1 X=2 X=3 X=4 X=5

Example Random Variables: Definitions “Region”-valued Random Variables Sample Space Ω All points in the region: Events x The sub-regions 1, 2, 3, 4, or 5 Discrete Ran-dom Variable X Represents a random selection of a sub-region Prob. Mass Fn. P(X=x) Proportional to size of sub-region Recall that an event is any subset of the sample space. So both definitions of the sample space here are valid. Some of you may have seen an abstract picture like this before. Does anyone know what it’s depicting? X=1 X=2 X=3 X=4 X=5

Example Random Variables: Definitions String-valued Random Variables Sample Space Ω All Korean sentences (an infinitely large set) Event x Translation of an English sentence into Korean (i.e. elementary events) Discrete Ran-dom Variable X Represents a translation Probability P(X=x) Given by a model English: machine learning requires probability and statistics P( X = ) 기계 학습은 확률과 통계를 필요 Korean: P( X = ) 머신 러닝은 확률 통계를 필요 P( X = ) 머신 러닝은 확률 통계를 이 필요합니다 …

Cumulative distribution function Function that returns the probability that a random variable X is less than or equal to x: For discrete random variables: For continuous random variables:

Random Variables and Events
Def 2: A measureable function from the sample space to the real numbers: Question: Something seems wrong… We defined P(E) (the capital ‘P’) as a function mapping events to probabilities So why do we write P(X=x)? A good guess: X=x is an event… These sets are events! Answer: P(X=x) is just shorthand! Example 1: Example 2:

Notational Shortcuts A convenient shorthand:

Notational Shortcuts But then how do we tell P(E) apart from P(X) ?
Event Random Variable Instead of writing: We should write: …but only probability theory textbooks go to such lengths.

Expectation and Variance
The expected value of X is E[X]. Also called the mean. Discrete random variables: Continuous random variables:

Expectation and Variance
The variance of X is Var(X). Discrete random variables: Continuous random variables:

Multiple Random Variables
Joint probability Marginal probability Conditional probability

Joint Probability Slide from Sam Roweis (MLSS, 2005)

Marginal Probabilities
Slide from Sam Roweis (MLSS, 2005)

Conditional Probability

Independence and Conditional Independence

A practical problem I have two d20 die, one loaded, one standard
Loaded die will give a 19/20 (“critical hit”) half the time. In the game, someone hands me a random die, which is fair (A) with P(A)=0.5. Then I roll, and either get a critical hit (B) or not (~B). What is P(B)? P(B) = P(B and A) + P(B and ~A) = 0.1* *(0.5) = 0.3 Slide from William Cohen (10-601B, Spring 2016)

A practical problem “mixture model”
I have lots of standard d20 die, lots of loaded die, all identical. Loaded die will give a 19/20 (“critical hit”) half the time. In the game, someone hands me a random die, which is fair (A) or loaded (~A), with P(A) depending on how I mix the die. Then I roll, and either get a critical hit (B) or not (~B) Can I mix the dice together so that P(B) is anything I want - say, p(B)= ? P(B) = P(B and A) + P(B and ~A) = 0.1*λ + 0.5*(1- λ) = 0.137 “mixture model” λ = ( )/0.4 = Mixtures let us build more complex models from simpler ones. Slide from William Cohen (10-601B, Spring 2016)

Another picture for this problem
It’s more convenient to say “if you’ve picked a fair die then …” i.e. Pr(critical hit|fair die)=0.1 “if you’ve picked the loaded die then….” Pr(critical hit|loaded die)=0.5 Conditional probability: Pr(B|A) = P(B^A)/P(A) A (fair die) ~A (loaded) P(B|A) P(B|~A) ~A and B A and B Slide from William Cohen (10-601B, Spring 2016)

Definition of Conditional Probability
P(A ^ B) P(A|B) = P(B) Corollary: The Chain Rule P(A ^ B) = P(A|B) P(B) Slide from William Cohen (10-601B, Spring 2016)

Some practical problems
I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? P(B) = P(B|A) P(A) + P(B|~A) P(~A) = 0.1* *0.25 = 0.2 P(B) = P(B|A) P(A) + P(B|~A) P(~A) P(B) = P(A|B) P(B) + P(~A|B) P(B) P(B) = P(B, A) P(A) + P(B, ~A) P(~A) P(B) = P(B, A) / P(A) + P(B, ~A) / P(~A) A. B. C. D. Slide from William Cohen (10-601B, Spring 2016)

Slide from William Cohen (10-601B, Spring 2016)
“marginalizing out” A P(A) P(~A) A (fair die) ~A (loaded) P(B) = P(B|A) P(A) + P(B|~A) P(~A) = 0.1* *0.25 = 0.2 P(B|A) P(B|~A) ~A and B A and B Slide from William Cohen (10-601B, Spring 2016)

Bayes’ Rule

Some practical problems
I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. Suppose B happens (e.g., I roll a 20). What is the chance the die I rolled is fair? i.e. what is P(A|B) ? Slide from William Cohen (10-601B, Spring 2016)

Slide from William Cohen (10-601B, Spring 2016)
posterior prior P(B|A) * P(A) P(B) P(A|B) = Bayes’ rule P(A|B) * P(B) P(A) P(B|A) = Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53: …by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning… Slide from William Cohen (10-601B, Spring 2016)

Common Probability Distributions

For Discrete Random Variables: Bernoulli Binomial Multinomial Categorical Poisson For Continuous Random Variables: Exponential Gamma Beta Dirichlet Laplace Gaussian (1D) Multivariate Gaussian

Beta Distribution probability density function: Support for values between 0 and 1 2 dimensional version of the Dirichlet Comparison to the Normal/Gaussian (symmetry, bending up the Gaussian)

Dirichlet Distribution probability density function: Just the Beta again, but really is a Dirichlet

Dirichlet Distribution probability density function: Left: 3,3,3 Right: 0.9,0.9,0.9 Compare to Gaussian A draw from a dirichlet are the parameters of the multinomial – Note the triangle/simplex BOARD: show the sum to one constraint

Recap of Decision Trees and ID3

Use Probability

Oh, the Places You’ll Use Probability!
Supervised Classification Naïve Bayes (next lecture) Logistic regression (the week after) Note: This is just motivation – we’ll cover these topics later!

ML Theory (Example: Sample Complexity) Note: This is just motivation – we’ll cover these topics later!

Deep Learning (Example: Deep Bi-directional RNN) y1 y2 y3 y4 h1 h2 h3 h4 h1 h2 h3 h4 x1 x2 x3 x4 Note: This is just motivation – we’ll cover these topics later!

Graphical Models Hidden Markov Model (HMM) Conditional Random Field (CRF) time flies like an arrow n v p d <START> n ψ2 v ψ4 p ψ6 d ψ8 ψ1 ψ3 ψ5 ψ7 ψ9 ψ0 <START> Note: This is just motivation – we’ll cover these topics later!

Matt’s Uncertain Week Quiz: Which of the following happened to Matt this past week? Matt got Millie’s ice cream during a power outage Matt’s basement flooded with water A car flipped on its side in front of Matt’s house Tree branch fell through the windshield of Matt’s car These are just silly examples from the real world. Certainly, Machine Learning doesn’t need to worry about modeling events like these… …Right?

Important Topics NOT Covered Today
Statistics Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) estimation Significance testing Probability Central limit theorem Monte Carlo approximation Other Topics Calculus in multiple dimensions Linear algebra Continuous optimization Covered Next Lecture

Summary Probability theory is rooted in (simple) axioms
Random variables provide an important tool for modeling the world Our favorite probability distributions are just functions! (usually with interesting properties) Probability and Statistics are essential to Machine Learning

Probability Theory for ML

Similar presentations

Presentation on theme: "Probability Theory for ML"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probability Theory for ML

Similar presentations

Presentation on theme: "Probability Theory for ML"— Presentation transcript:

Similar presentations

About project

Feedback