Probability Theory for ML

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Probabilistic models Haixu Tang School of Informatics.
CS433: Modeling and Simulation
Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.
Random Variable A random variable X is a function that assign a real number, X(ζ), to each outcome ζ in the sample space of a random experiment. Domain.
Lecture 10 – Introduction to Probability Topics Events, sample space, random variables Examples Probability distribution function Conditional probabilities.
DEPARTMENT OF HEALTH SCIENCE AND TECHNOLOGY STOCHASTIC SIGNALS AND PROCESSES Lecture 1 WELCOME.
Probability theory Much inspired by the presentation of Kren and Samuelsson.
Probability Distributions Finite Random Variables.
1 Review of Probability Theory [Source: Stanford University]
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Course outline and schedule Introduction Event Algebra (Sec )
Pattern Classification, Chapter 1 1 Basic Probability.
1 Engineering Computation Part 5. 2 Some Concepts Previous to Probability RANDOM EXPERIMENT A random experiment or trial can be thought of as any activity.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Course outline and schedule Introduction (Sec )
C4: DISCRETE RANDOM VARIABLES CIS 2033 based on Dekking et al. A Modern Introduction to Probability and Statistics Longin Jan Latecki.
Probability and Statistics Review Thursday Sep 11.
Joint Distribution of two or More Random Variables
Expected Value (Mean), Variance, Independence Transformations of Random Variables Last Time:
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
Hamid R. Rabiee Fall 2009 Stochastic Processes Review of Elementary Probability Lecture I.
Crash Course on Machine Learning
Lecture 10 – Introduction to Probability Topics Events, sample space, random variables Examples Probability distribution function Conditional probabilities.
Recitation 1 Probability Review
Chapter 1 Probability and Distributions Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
General information CSE : Probabilistic Analysis of Computer Systems
All of Statistics Chapter 5: Convergence of Random Variables Nick Schafer.
1 Probability and Statistics  What is probability?  What is statistics?
IRDM WS Chapter 2: Basics from Probability Theory and Statistics 2.1 Probability Theory Events, Probabilities, Random Variables, Distributions,
IBS-09-SL RM 501 – Ranjit Goswami 1 Basic Probability.
Theory of Probability Statistics for Business and Economics.
1 Lecture 4. 2 Random Variables (Discrete) Real-valued functions defined on a sample space are random vars. determined by outcome of experiment, we can.
LECTURE IV Random Variables and Probability Distributions I.
CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)
Uncertainty Uncertain Knowledge Probability Review Bayes’ Theorem Summary.
STA347 - week 51 More on Distribution Function The distribution of a random variable X can be determined directly from its cumulative distribution function.
Week 21 Conditional Probability Idea – have performed a chance experiment but don’t know the outcome (ω), but have some partial information (event A) about.
STA347 - week 31 Random Variables Example: We roll a fair die 6 times. Suppose we are interested in the number of 5’s in the 6 rolls. Let X = number of.
CS433 Modeling and Simulation Lecture 03 – Part 01 Probability Review 1 Dr. Anis Koubâa Al-Imam Mohammad Ibn Saud University
Probability Review-1 Probability Review. Probability Review-2 Probability Theory Mathematical description of relationships or occurrences that cannot.
PROBABILITY, PROBABILITY RULES, AND CONDITIONAL PROBABILITY
Natural Language Processing Giuseppe Attardi Introduction to Probability IP notice: some slides from: Dan Jurafsky, Jim Martin, Sandiway Fong, Dan Klein.
Probability (outcome k) = Relative Frequency of k
Copyright © 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Review of Statistics I: Probability and Probability Distributions.
Probability and Distributions. Deterministic vs. Random Processes In deterministic processes, the outcome can be predicted exactly in advance Eg. Force.
Discrete Random Variables. Introduction In previous lectures we established a foundation of the probability theory; we applied the probability theory.
1 Probability: Introduction Definitions,Definitions, Laws of ProbabilityLaws of Probability Random VariablesRandom Variables DistributionsDistributions.
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
4. Overview of Probability Network Performance and Quality of Service.
Probabilistic Analysis of Computer Systems
Matt Gormley Lecture 3 September 7, 2016
MAT 446 Supplementary Note for Ch 3
ONE DIMENSIONAL RANDOM VARIABLES
What is Probability? Quantification of uncertainty.
Quick Review Probability Theory
Appendix A: Probability Theory
CS 2750: Machine Learning Probability Review Density Estimation
Corpora and Statistical Methods
Review of Probabilities and Basic Statistics
Basic Probability aft A RAJASEKHAR YADAV.
Natural Language Processing
Review of Probability and Estimators Arun Das, Jason Rebello
Lecture 11 Sections 5.1 – 5.2 Objectives: Probability
STA 291 Spring 2008 Lecture 7 Dustin Lueker.
CSCI 5832 Natural Language Processing
Advanced Artificial Intelligence
Statistical NLP: Lecture 4
Discrete Random Variables: Joint PMFs, Conditioning and Independence
Experiments, Outcomes, Events and Random Variables: A Revisit
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Probability Theory for ML School of Computer Science 10-601 Introduction to Machine Learning Probability Theory for ML Matt Gormley Lecture 2 August 31, 2016 Readings: Mitchell Ch. 1, 2, 6.1 – 6.3 Murphy Ch. 2 Bishop Ch. 1 - 2

Reminders Homework 1: released later today due Sept. 7 at 5:30pm

Outline Motivation Probability Theory Random Variables Sample space, Outcomes, Events Kolmogorov’s Axioms of Probability Random Variables Random variables, Probability mass function (pmf), Probability density function (pdf), Cumulative distribution function (cdf) Examples Notation Expectation and Variance Joint, conditional, marginal probabilities Independence Bayes’ Rule Common Probability Distributions Beta, Dirichlet, etc. Recap of Decision Trees Entropy Information Gain Probability in ML

Why Probability? Machine Learning Statistics The goal of this course is to provide you with a toolbox: Machine Learning Statistics Probability Computer Science Optimization

Why Probability? Computer Science Domain of Interest Machine Learning Optimization Statistics Probability Calculus Measure Theory Linear Algebra

Probability Theory

Probability Theory: Definitions Example Probability Theory: Definitions Example 1: Flipping a coin Sample Space {Heads, Tails} Outcome Example: Heads Event Example: {Heads} Probability P({Heads}) = 0.5 P({Tails}) = 0.5 $

Probability Theory: Definitions Probability provides a science for inference about interesting events Sample Space The set of all possible outcomes Outcome Possible result of an experiment Event Any subset of the sample space Probability The non-negative number assigned to each event in the sample space Each outcome is unique Only one outcome can occur per experiment An outcome can be in multiple events An elementary event consists of exactly one outcome

Probability Theory: Definitions Example Probability Theory: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event Example: {3} (the event “the die came up 3”) Probability P({3}) = 1/6 P({4}) = 1/6

Probability Theory: Definitions Example Probability Theory: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event Example: {2,4,6} (the event “the roll was even”) Probability P({2,4,6}) = 0.5 P({1,3,5}) = 0.5

Probability Theory: Definitions Example Probability Theory: Definitions Example 3: Timing how long it takes a monkey to reproduce Shakespeare Sample Space [0, +∞) Outcome Example: 1,433,600 hours Event Example: [1, 6] hours Probability P([1,6]) = 0.000000000001 P([1,433,600, +∞)) = 0.99

Kolmogorov’s Axioms

All of probability can be derived from just these! Kolmogorov’s Axioms All of probability can be derived from just these! In words: Each event has non-negative probability. The probability that some event will occur is one. The probability of the union of many disjoint sets is the sum of their probabilities “many” here should read “sequence of countably infinite”

Deriving Probability Theorems Axioms Monotonicity: if A is a subset of B, then P(A) <= P(B) Proof: A subset of B  B = A + C for C=B-A A and C are disjoint  P(B) = P(A or C)=P(A) + P(C) P(C) >= 0 So P(B) >= P(A) “many” here should read “sequence of countably infinite” Slide adapted from William Cohen (10-601B, Spring 2016)

Probability Theory: Definitions The complement of an event E, denoted ~E, is the event that E does not occur. E ~E

Deriving Probability Theorems Axioms Theorem: P(~A) = 1 - P(A) Proof: P(A or ~A) = P(Ω) = 1 A and ~A are disjoint  P(A) + P(~A )=P(A or ~A) P(A) + P(~A) = 1 ….then solve for P(~A) “many” here should read “sequence of countably infinite” Slide adapted from William Cohen (10-601B, Spring 2016)

Deriving Probability Theorems Axioms Theorem: P(A or B) = P(A) + P(B) - P(A and B) Proof: E1 = A and ~(A and B) E2 = (A and B) E3 = B and ~(A and B) E1 or E2 or E3 = A or B and E1, E2, E3 disjoint  P(A or B) = P(E1) + P(E2) + P(E3) further P(A) = P(E1) + P(E2) and P(B) = P(E3) + P(E2) ... - P(E2) P(E1)+P(E2) P(E3)+P(E2) “many” here should read “sequence of countably infinite” Slide adapted from William Cohen (10-601B, Spring 2016)

These Axioms are Not to be Trifled With - Andrew Moore There have been many many other approaches to understanding “uncertainty”: Fuzzy Logic, three-valued logic, Dempster-Shafer, non-monotonic reasoning, … 40 years ago people in AI argued about these; now they mostly don’t Any scheme for combining uncertain information, uncertain “beliefs”, etc,… really should obey these axioms to be internally consistent (from Jayne, 1958; Cox 1930’s) If you gamble based on “uncertain beliefs”, then [you can be exploited by an opponent]  [your uncertainty formalism violates the axioms] - di Finetti 1931 (the “Dutch book argument”)

Random Variables

Random Variables: Definitions (capital letters) Def 1: Variable whose possible values are the outcomes of a random experiment Value of a Random Variable (lowercase letters) The value taken by a random variable

Random Variables: Definitions Def 1: Variable whose possible values are the outcomes of a random experiment Discrete Random Variable Random variable whose values come from a countable set (e.g. the natural numbers or {True, False}) Continuous Random Variable Random variable whose values come from an interval or collection of intervals (e.g. the real numbers or the range (3, 5))

Random Variables: Definitions Def 1: Variable whose possible values are the outcomes of a random experiment Def 2: A measureable function from the sample space to the real numbers: Discrete Random Variable Random variable whose values come from a countable set (e.g. the natural numbers or {True, False}) Continuous Random Variable Random variable whose values come from an interval or collection of intervals (e.g. the real numbers or the range (3, 5))

Random Variables: Definitions Discrete Random Variable Random variable whose values come from a countable set (e.g. the natural numbers or {True, False}) Probability mass function (pmf) Function giving the probability that discrete r.v. X takes value x.

Random Variables: Definitions Example Random Variables: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event Example: {3} (the event “the die came up 3”) Probability P({3}) = 1/6 P({4}) = 1/6

Random Variables: Definitions Example Random Variables: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event Example: {3} (the event “the die came up 3”) Probability P({3}) = 1/6 P({4}) = 1/6 Discrete Ran-dom Variable Example: The value on the top face of the die. Prob. Mass Function (pmf) p(3) = 1/6 p(4) = 1/6

Random Variables: Definitions Example Random Variables: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event Example: {2,4,6} (the event “the roll was even”) Probability P({2,4,6}) = 0.5 P({1,3,5}) = 0.5 Discrete Ran-dom Variable Example: 1 if the die landed on an even number and 0 otherwise Prob. Mass Function (pmf) p(1) = 0.5 p(0) = 0.5

Random Variables: Definitions Discrete Random Variable Random variable whose values come from a countable set (e.g. the natural numbers or {True, False}) Probability mass function (pmf) Function giving the probability that discrete r.v. X takes value x.

Random Variables: Definitions Continuous Random Variable Random variable whose values come from an interval or collection of intervals (e.g. the real numbers or the range (3, 5)) Probability density function (pdf) Function the returns a nonnegative real indicating the relative likelihood that a continuous r.v. X takes value x For any continuous random variable: P(X = x) = 0 Non-zero probabilities are only available to intervals:

Random Variables: Definitions Example Random Variables: Definitions Example 3: Timing how long it takes a monkey to reproduce Shakespeare Sample Space [0, +∞) Outcome Example: 1,433,600 hours Event Example: [1, 6] hours Probability P([1,6]) = 0.000000000001 P([1,433,600, +∞)) = 0.99 Continuous Random Var. Example: Represents time to reproduce (not an interval!) Prob. Density Function Example: Gamma distribution

Random Variables: Definitions Example Random Variables: Definitions “Region”-valued Random Variables Sample Space Ω {1,2,3,4,5} Events x The sub-regions 1, 2, 3, 4, or 5 Discrete Ran-dom Variable X Represents a random selection of a sub-region Prob. Mass Fn. P(X=x) Proportional to size of sub-region Some of you may have seen an abstract picture like this before. Does anyone know what it’s depicting? X=1 X=2 X=3 X=4 X=5

Random Variables: Definitions Example Random Variables: Definitions “Region”-valued Random Variables Sample Space Ω All points in the region: Events x The sub-regions 1, 2, 3, 4, or 5 Discrete Ran-dom Variable X Represents a random selection of a sub-region Prob. Mass Fn. P(X=x) Proportional to size of sub-region Recall that an event is any subset of the sample space. So both definitions of the sample space here are valid. Some of you may have seen an abstract picture like this before. Does anyone know what it’s depicting? X=1 X=2 X=3 X=4 X=5

Random Variables: Definitions Example Random Variables: Definitions String-valued Random Variables Sample Space Ω All Korean sentences (an infinitely large set) Event x Translation of an English sentence into Korean (i.e. elementary events) Discrete Ran-dom Variable X Represents a translation Probability P(X=x) Given by a model English: machine learning requires probability and statistics P( X = ) 기계 학습은 확률과 통계를 필요 Korean: P( X = ) 머신 러닝은 확률 통계를 필요 P( X = ) 머신 러닝은 확률 통계를 이 필요합니다 …

Random Variables: Definitions Cumulative distribution function Function that returns the probability that a random variable X is less than or equal to x: For discrete random variables: For continuous random variables:

Random Variables and Events Def 2: A measureable function from the sample space to the real numbers: Question: Something seems wrong… We defined P(E) (the capital ‘P’) as a function mapping events to probabilities So why do we write P(X=x)? A good guess: X=x is an event… These sets are events! Answer: P(X=x) is just shorthand! Example 1: Example 2:

Notational Shortcuts A convenient shorthand:

Notational Shortcuts But then how do we tell P(E) apart from P(X) ? Event Random Variable Instead of writing: We should write: …but only probability theory textbooks go to such lengths.

Expectation and Variance The expected value of X is E[X]. Also called the mean. Discrete random variables: Continuous random variables:

Expectation and Variance The variance of X is Var(X). Discrete random variables: Continuous random variables:

Multiple Random Variables Joint probability Marginal probability Conditional probability

Joint Probability Slide from Sam Roweis (MLSS, 2005)

Marginal Probabilities Slide from Sam Roweis (MLSS, 2005)

Conditional Probability Slide from Sam Roweis (MLSS, 2005)

Independence and Conditional Independence Slide from Sam Roweis (MLSS, 2005)

A practical problem I have two d20 die, one loaded, one standard Loaded die will give a 19/20 (“critical hit”) half the time. In the game, someone hands me a random die, which is fair (A) with P(A)=0.5. Then I roll, and either get a critical hit (B) or not (~B). What is P(B)? P(B) = P(B and A) + P(B and ~A) = 0.1*0.5 + 0.5*(0.5) = 0.3 Slide from William Cohen (10-601B, Spring 2016)

A practical problem “mixture model” I have lots of standard d20 die, lots of loaded die, all identical. Loaded die will give a 19/20 (“critical hit”) half the time. In the game, someone hands me a random die, which is fair (A) or loaded (~A), with P(A) depending on how I mix the die. Then I roll, and either get a critical hit (B) or not (~B) Can I mix the dice together so that P(B) is anything I want - say, p(B)= 0.137 ? P(B) = P(B and A) + P(B and ~A) = 0.1*λ + 0.5*(1- λ) = 0.137 “mixture model” λ = (0.5 - 0.137)/0.4 = 0.9075 Mixtures let us build more complex models from simpler ones. Slide from William Cohen (10-601B, Spring 2016)

Another picture for this problem It’s more convenient to say “if you’ve picked a fair die then …” i.e. Pr(critical hit|fair die)=0.1 “if you’ve picked the loaded die then….” Pr(critical hit|loaded die)=0.5 Conditional probability: Pr(B|A) = P(B^A)/P(A) A (fair die) ~A (loaded) P(B|A) P(B|~A) ~A and B A and B Slide from William Cohen (10-601B, Spring 2016)

Definition of Conditional Probability P(A ^ B) P(A|B) = ----------- P(B) Corollary: The Chain Rule P(A ^ B) = P(A|B) P(B) Slide from William Cohen (10-601B, Spring 2016)

Some practical problems I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? P(B) = P(B|A) P(A) + P(B|~A) P(~A) = 0.1*0.75 + 0.5*0.25 = 0.2 P(B) = P(B|A) P(A) + P(B|~A) P(~A) P(B) = P(A|B) P(B) + P(~A|B) P(B) P(B) = P(B, A) P(A) + P(B, ~A) P(~A) P(B) = P(B, A) / P(A) + P(B, ~A) / P(~A) A. B. C. D. Slide from William Cohen (10-601B, Spring 2016)

Slide from William Cohen (10-601B, Spring 2016) “marginalizing out” A P(A) P(~A) A (fair die) ~A (loaded) P(B) = P(B|A) P(A) + P(B|~A) P(~A) = 0.1*0.75 + 0.5*0.25 = 0.2 P(B|A) P(B|~A) ~A and B A and B Slide from William Cohen (10-601B, Spring 2016)

Bayes’ Rule

Some practical problems I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. Suppose B happens (e.g., I roll a 20). What is the chance the die I rolled is fair? i.e. what is P(A|B) ? Slide from William Cohen (10-601B, Spring 2016)

P(A|B) = ? P(B|A) * P(A) P(B) P(A|B) = P(A and B) = P(A|B) * P(B) P(A and B) = P(B|A) * P(A) P(A|B) * P(B) = P(B|A) * P(A) P(B) P(A) P(~A) A (fair die) ~A (loaded) ~A and B A and B P(B|A) P(B|~A) Slide from William Cohen (10-601B, Spring 2016)

Slide from William Cohen (10-601B, Spring 2016) posterior prior P(B|A) * P(A) P(B) P(A|B) = Bayes’ rule P(A|B) * P(B) P(A) P(B|A) = Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 …by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning… Slide from William Cohen (10-601B, Spring 2016)

Common Probability Distributions

Common Probability Distributions For Discrete Random Variables: Bernoulli Binomial Multinomial Categorical Poisson For Continuous Random Variables: Exponential Gamma Beta Dirichlet Laplace Gaussian (1D) Multivariate Gaussian

Common Probability Distributions Beta Distribution probability density function: Support for values between 0 and 1 2 dimensional version of the Dirichlet Comparison to the Normal/Gaussian (symmetry, bending up the Gaussian)

Common Probability Distributions Dirichlet Distribution probability density function: Just the Beta again, but really is a Dirichlet

Common Probability Distributions Dirichlet Distribution probability density function: Left: 3,3,3 Right: 0.9,0.9,0.9 Compare to Gaussian A draw from a dirichlet are the parameters of the multinomial – Note the triangle/simplex BOARD: show the sum to one constraint

Recap of Decision Trees and ID3

Use Probability

Oh, the Places You’ll Use Probability! Supervised Classification Naïve Bayes (next lecture) Logistic regression (the week after) Note: This is just motivation – we’ll cover these topics later!

Oh, the Places You’ll Use Probability! ML Theory (Example: Sample Complexity) Note: This is just motivation – we’ll cover these topics later!

Oh, the Places You’ll Use Probability! Deep Learning (Example: Deep Bi-directional RNN) y1 y2 y3 y4 h1 h2 h3 h4 h1 h2 h3 h4 x1 x2 x3 x4 Note: This is just motivation – we’ll cover these topics later!

Oh, the Places You’ll Use Probability! Graphical Models Hidden Markov Model (HMM) Conditional Random Field (CRF) time flies like an arrow n v p d <START> n ψ2 v ψ4 p ψ6 d ψ8 ψ1 ψ3 ψ5 ψ7 ψ9 ψ0 <START> Note: This is just motivation – we’ll cover these topics later!

Oh, the Places You’ll Use Probability! Matt’s Uncertain Week Quiz: Which of the following happened to Matt this past week? Matt got Millie’s ice cream during a power outage Matt’s basement flooded with water A car flipped on its side in front of Matt’s house Tree branch fell through the windshield of Matt’s car These are just silly examples from the real world. Certainly, Machine Learning doesn’t need to worry about modeling events like these… …Right?

Important Topics NOT Covered Today Statistics Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) estimation Significance testing Probability Central limit theorem Monte Carlo approximation Other Topics Calculus in multiple dimensions Linear algebra Continuous optimization Covered Next Lecture

Summary Probability theory is rooted in (simple) axioms Random variables provide an important tool for modeling the world Our favorite probability distributions are just functions! (usually with interesting properties) Probability and Statistics are essential to Machine Learning