Probability Theory for ML School of Computer Science 10-601 Introduction to Machine Learning Probability Theory for ML Matt Gormley Lecture 2 August 31, 2016 Readings: Mitchell Ch. 1, 2, 6.1 – 6.3 Murphy Ch. 2 Bishop Ch. 1 - 2
Reminders Homework 1: released later today due Sept. 7 at 5:30pm
Outline Motivation Probability Theory Random Variables Sample space, Outcomes, Events Kolmogorov’s Axioms of Probability Random Variables Random variables, Probability mass function (pmf), Probability density function (pdf), Cumulative distribution function (cdf) Examples Notation Expectation and Variance Joint, conditional, marginal probabilities Independence Bayes’ Rule Common Probability Distributions Beta, Dirichlet, etc. Recap of Decision Trees Entropy Information Gain Probability in ML
Why Probability? Machine Learning Statistics The goal of this course is to provide you with a toolbox: Machine Learning Statistics Probability Computer Science Optimization
Why Probability? Computer Science Domain of Interest Machine Learning Optimization Statistics Probability Calculus Measure Theory Linear Algebra
Probability Theory
Probability Theory: Definitions Example Probability Theory: Definitions Example 1: Flipping a coin Sample Space {Heads, Tails} Outcome Example: Heads Event Example: {Heads} Probability P({Heads}) = 0.5 P({Tails}) = 0.5 $
Probability Theory: Definitions Probability provides a science for inference about interesting events Sample Space The set of all possible outcomes Outcome Possible result of an experiment Event Any subset of the sample space Probability The non-negative number assigned to each event in the sample space Each outcome is unique Only one outcome can occur per experiment An outcome can be in multiple events An elementary event consists of exactly one outcome
Probability Theory: Definitions Example Probability Theory: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event Example: {3} (the event “the die came up 3”) Probability P({3}) = 1/6 P({4}) = 1/6
Probability Theory: Definitions Example Probability Theory: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event Example: {2,4,6} (the event “the roll was even”) Probability P({2,4,6}) = 0.5 P({1,3,5}) = 0.5
Probability Theory: Definitions Example Probability Theory: Definitions Example 3: Timing how long it takes a monkey to reproduce Shakespeare Sample Space [0, +∞) Outcome Example: 1,433,600 hours Event Example: [1, 6] hours Probability P([1,6]) = 0.000000000001 P([1,433,600, +∞)) = 0.99
Kolmogorov’s Axioms
All of probability can be derived from just these! Kolmogorov’s Axioms All of probability can be derived from just these! In words: Each event has non-negative probability. The probability that some event will occur is one. The probability of the union of many disjoint sets is the sum of their probabilities “many” here should read “sequence of countably infinite”
Deriving Probability Theorems Axioms Monotonicity: if A is a subset of B, then P(A) <= P(B) Proof: A subset of B B = A + C for C=B-A A and C are disjoint P(B) = P(A or C)=P(A) + P(C) P(C) >= 0 So P(B) >= P(A) “many” here should read “sequence of countably infinite” Slide adapted from William Cohen (10-601B, Spring 2016)
Probability Theory: Definitions The complement of an event E, denoted ~E, is the event that E does not occur. E ~E
Deriving Probability Theorems Axioms Theorem: P(~A) = 1 - P(A) Proof: P(A or ~A) = P(Ω) = 1 A and ~A are disjoint P(A) + P(~A )=P(A or ~A) P(A) + P(~A) = 1 ….then solve for P(~A) “many” here should read “sequence of countably infinite” Slide adapted from William Cohen (10-601B, Spring 2016)
Deriving Probability Theorems Axioms Theorem: P(A or B) = P(A) + P(B) - P(A and B) Proof: E1 = A and ~(A and B) E2 = (A and B) E3 = B and ~(A and B) E1 or E2 or E3 = A or B and E1, E2, E3 disjoint P(A or B) = P(E1) + P(E2) + P(E3) further P(A) = P(E1) + P(E2) and P(B) = P(E3) + P(E2) ... - P(E2) P(E1)+P(E2) P(E3)+P(E2) “many” here should read “sequence of countably infinite” Slide adapted from William Cohen (10-601B, Spring 2016)
These Axioms are Not to be Trifled With - Andrew Moore There have been many many other approaches to understanding “uncertainty”: Fuzzy Logic, three-valued logic, Dempster-Shafer, non-monotonic reasoning, … 40 years ago people in AI argued about these; now they mostly don’t Any scheme for combining uncertain information, uncertain “beliefs”, etc,… really should obey these axioms to be internally consistent (from Jayne, 1958; Cox 1930’s) If you gamble based on “uncertain beliefs”, then [you can be exploited by an opponent] [your uncertainty formalism violates the axioms] - di Finetti 1931 (the “Dutch book argument”)
Random Variables
Random Variables: Definitions (capital letters) Def 1: Variable whose possible values are the outcomes of a random experiment Value of a Random Variable (lowercase letters) The value taken by a random variable
Random Variables: Definitions Def 1: Variable whose possible values are the outcomes of a random experiment Discrete Random Variable Random variable whose values come from a countable set (e.g. the natural numbers or {True, False}) Continuous Random Variable Random variable whose values come from an interval or collection of intervals (e.g. the real numbers or the range (3, 5))
Random Variables: Definitions Def 1: Variable whose possible values are the outcomes of a random experiment Def 2: A measureable function from the sample space to the real numbers: Discrete Random Variable Random variable whose values come from a countable set (e.g. the natural numbers or {True, False}) Continuous Random Variable Random variable whose values come from an interval or collection of intervals (e.g. the real numbers or the range (3, 5))
Random Variables: Definitions Discrete Random Variable Random variable whose values come from a countable set (e.g. the natural numbers or {True, False}) Probability mass function (pmf) Function giving the probability that discrete r.v. X takes value x.
Random Variables: Definitions Example Random Variables: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event Example: {3} (the event “the die came up 3”) Probability P({3}) = 1/6 P({4}) = 1/6
Random Variables: Definitions Example Random Variables: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event Example: {3} (the event “the die came up 3”) Probability P({3}) = 1/6 P({4}) = 1/6 Discrete Ran-dom Variable Example: The value on the top face of the die. Prob. Mass Function (pmf) p(3) = 1/6 p(4) = 1/6
Random Variables: Definitions Example Random Variables: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event Example: {2,4,6} (the event “the roll was even”) Probability P({2,4,6}) = 0.5 P({1,3,5}) = 0.5 Discrete Ran-dom Variable Example: 1 if the die landed on an even number and 0 otherwise Prob. Mass Function (pmf) p(1) = 0.5 p(0) = 0.5
Random Variables: Definitions Discrete Random Variable Random variable whose values come from a countable set (e.g. the natural numbers or {True, False}) Probability mass function (pmf) Function giving the probability that discrete r.v. X takes value x.
Random Variables: Definitions Continuous Random Variable Random variable whose values come from an interval or collection of intervals (e.g. the real numbers or the range (3, 5)) Probability density function (pdf) Function the returns a nonnegative real indicating the relative likelihood that a continuous r.v. X takes value x For any continuous random variable: P(X = x) = 0 Non-zero probabilities are only available to intervals:
Random Variables: Definitions Example Random Variables: Definitions Example 3: Timing how long it takes a monkey to reproduce Shakespeare Sample Space [0, +∞) Outcome Example: 1,433,600 hours Event Example: [1, 6] hours Probability P([1,6]) = 0.000000000001 P([1,433,600, +∞)) = 0.99 Continuous Random Var. Example: Represents time to reproduce (not an interval!) Prob. Density Function Example: Gamma distribution
Random Variables: Definitions Example Random Variables: Definitions “Region”-valued Random Variables Sample Space Ω {1,2,3,4,5} Events x The sub-regions 1, 2, 3, 4, or 5 Discrete Ran-dom Variable X Represents a random selection of a sub-region Prob. Mass Fn. P(X=x) Proportional to size of sub-region Some of you may have seen an abstract picture like this before. Does anyone know what it’s depicting? X=1 X=2 X=3 X=4 X=5
Random Variables: Definitions Example Random Variables: Definitions “Region”-valued Random Variables Sample Space Ω All points in the region: Events x The sub-regions 1, 2, 3, 4, or 5 Discrete Ran-dom Variable X Represents a random selection of a sub-region Prob. Mass Fn. P(X=x) Proportional to size of sub-region Recall that an event is any subset of the sample space. So both definitions of the sample space here are valid. Some of you may have seen an abstract picture like this before. Does anyone know what it’s depicting? X=1 X=2 X=3 X=4 X=5
Random Variables: Definitions Example Random Variables: Definitions String-valued Random Variables Sample Space Ω All Korean sentences (an infinitely large set) Event x Translation of an English sentence into Korean (i.e. elementary events) Discrete Ran-dom Variable X Represents a translation Probability P(X=x) Given by a model English: machine learning requires probability and statistics P( X = ) 기계 학습은 확률과 통계를 필요 Korean: P( X = ) 머신 러닝은 확률 통계를 필요 P( X = ) 머신 러닝은 확률 통계를 이 필요합니다 …
Random Variables: Definitions Cumulative distribution function Function that returns the probability that a random variable X is less than or equal to x: For discrete random variables: For continuous random variables:
Random Variables and Events Def 2: A measureable function from the sample space to the real numbers: Question: Something seems wrong… We defined P(E) (the capital ‘P’) as a function mapping events to probabilities So why do we write P(X=x)? A good guess: X=x is an event… These sets are events! Answer: P(X=x) is just shorthand! Example 1: Example 2:
Notational Shortcuts A convenient shorthand:
Notational Shortcuts But then how do we tell P(E) apart from P(X) ? Event Random Variable Instead of writing: We should write: …but only probability theory textbooks go to such lengths.
Expectation and Variance The expected value of X is E[X]. Also called the mean. Discrete random variables: Continuous random variables:
Expectation and Variance The variance of X is Var(X). Discrete random variables: Continuous random variables:
Multiple Random Variables Joint probability Marginal probability Conditional probability
Joint Probability Slide from Sam Roweis (MLSS, 2005)
Marginal Probabilities Slide from Sam Roweis (MLSS, 2005)
Conditional Probability Slide from Sam Roweis (MLSS, 2005)
Independence and Conditional Independence Slide from Sam Roweis (MLSS, 2005)
A practical problem I have two d20 die, one loaded, one standard Loaded die will give a 19/20 (“critical hit”) half the time. In the game, someone hands me a random die, which is fair (A) with P(A)=0.5. Then I roll, and either get a critical hit (B) or not (~B). What is P(B)? P(B) = P(B and A) + P(B and ~A) = 0.1*0.5 + 0.5*(0.5) = 0.3 Slide from William Cohen (10-601B, Spring 2016)
A practical problem “mixture model” I have lots of standard d20 die, lots of loaded die, all identical. Loaded die will give a 19/20 (“critical hit”) half the time. In the game, someone hands me a random die, which is fair (A) or loaded (~A), with P(A) depending on how I mix the die. Then I roll, and either get a critical hit (B) or not (~B) Can I mix the dice together so that P(B) is anything I want - say, p(B)= 0.137 ? P(B) = P(B and A) + P(B and ~A) = 0.1*λ + 0.5*(1- λ) = 0.137 “mixture model” λ = (0.5 - 0.137)/0.4 = 0.9075 Mixtures let us build more complex models from simpler ones. Slide from William Cohen (10-601B, Spring 2016)
Another picture for this problem It’s more convenient to say “if you’ve picked a fair die then …” i.e. Pr(critical hit|fair die)=0.1 “if you’ve picked the loaded die then….” Pr(critical hit|loaded die)=0.5 Conditional probability: Pr(B|A) = P(B^A)/P(A) A (fair die) ~A (loaded) P(B|A) P(B|~A) ~A and B A and B Slide from William Cohen (10-601B, Spring 2016)
Definition of Conditional Probability P(A ^ B) P(A|B) = ----------- P(B) Corollary: The Chain Rule P(A ^ B) = P(A|B) P(B) Slide from William Cohen (10-601B, Spring 2016)
Some practical problems I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? P(B) = P(B|A) P(A) + P(B|~A) P(~A) = 0.1*0.75 + 0.5*0.25 = 0.2 P(B) = P(B|A) P(A) + P(B|~A) P(~A) P(B) = P(A|B) P(B) + P(~A|B) P(B) P(B) = P(B, A) P(A) + P(B, ~A) P(~A) P(B) = P(B, A) / P(A) + P(B, ~A) / P(~A) A. B. C. D. Slide from William Cohen (10-601B, Spring 2016)
Slide from William Cohen (10-601B, Spring 2016) “marginalizing out” A P(A) P(~A) A (fair die) ~A (loaded) P(B) = P(B|A) P(A) + P(B|~A) P(~A) = 0.1*0.75 + 0.5*0.25 = 0.2 P(B|A) P(B|~A) ~A and B A and B Slide from William Cohen (10-601B, Spring 2016)
Bayes’ Rule
Some practical problems I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. Suppose B happens (e.g., I roll a 20). What is the chance the die I rolled is fair? i.e. what is P(A|B) ? Slide from William Cohen (10-601B, Spring 2016)
P(A|B) = ? P(B|A) * P(A) P(B) P(A|B) = P(A and B) = P(A|B) * P(B) P(A and B) = P(B|A) * P(A) P(A|B) * P(B) = P(B|A) * P(A) P(B) P(A) P(~A) A (fair die) ~A (loaded) ~A and B A and B P(B|A) P(B|~A) Slide from William Cohen (10-601B, Spring 2016)
Slide from William Cohen (10-601B, Spring 2016) posterior prior P(B|A) * P(A) P(B) P(A|B) = Bayes’ rule P(A|B) * P(B) P(A) P(B|A) = Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning… Slide from William Cohen (10-601B, Spring 2016)
Common Probability Distributions
Common Probability Distributions For Discrete Random Variables: Bernoulli Binomial Multinomial Categorical Poisson For Continuous Random Variables: Exponential Gamma Beta Dirichlet Laplace Gaussian (1D) Multivariate Gaussian
Common Probability Distributions Beta Distribution probability density function: Support for values between 0 and 1 2 dimensional version of the Dirichlet Comparison to the Normal/Gaussian (symmetry, bending up the Gaussian)
Common Probability Distributions Dirichlet Distribution probability density function: Just the Beta again, but really is a Dirichlet
Common Probability Distributions Dirichlet Distribution probability density function: Left: 3,3,3 Right: 0.9,0.9,0.9 Compare to Gaussian A draw from a dirichlet are the parameters of the multinomial – Note the triangle/simplex BOARD: show the sum to one constraint
Recap of Decision Trees and ID3
Use Probability
Oh, the Places You’ll Use Probability! Supervised Classification Naïve Bayes (next lecture) Logistic regression (the week after) Note: This is just motivation – we’ll cover these topics later!
Oh, the Places You’ll Use Probability! ML Theory (Example: Sample Complexity) Note: This is just motivation – we’ll cover these topics later!
Oh, the Places You’ll Use Probability! Deep Learning (Example: Deep Bi-directional RNN) y1 y2 y3 y4 h1 h2 h3 h4 h1 h2 h3 h4 x1 x2 x3 x4 Note: This is just motivation – we’ll cover these topics later!
Oh, the Places You’ll Use Probability! Graphical Models Hidden Markov Model (HMM) Conditional Random Field (CRF) time flies like an arrow n v p d <START> n ψ2 v ψ4 p ψ6 d ψ8 ψ1 ψ3 ψ5 ψ7 ψ9 ψ0 <START> Note: This is just motivation – we’ll cover these topics later!
Oh, the Places You’ll Use Probability! Matt’s Uncertain Week Quiz: Which of the following happened to Matt this past week? Matt got Millie’s ice cream during a power outage Matt’s basement flooded with water A car flipped on its side in front of Matt’s house Tree branch fell through the windshield of Matt’s car These are just silly examples from the real world. Certainly, Machine Learning doesn’t need to worry about modeling events like these… …Right?
Important Topics NOT Covered Today Statistics Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) estimation Significance testing Probability Central limit theorem Monte Carlo approximation Other Topics Calculus in multiple dimensions Linear algebra Continuous optimization Covered Next Lecture
Summary Probability theory is rooted in (simple) axioms Random variables provide an important tool for modeling the world Our favorite probability distributions are just functions! (usually with interesting properties) Probability and Statistics are essential to Machine Learning