Probability for Machine Learning Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Probabilistic Machine Learning Not all machine learning models are probabilistic … but most of them have probabilistic interpretations Predictions need to have associated confidence Confidence = probability Arguments for probabilistic approach Complete framework for Machine Learning Makes assumptions explicit Recovers most non-probabilistic models as special cases Modular: Easily extensible Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
References “Introduction to Probability Models”, Sheldon Ross “Introduction to Probability and Statistics for Engineers and Scientists”, Sheldon Ross “Introduction To Probability”, Dimitri P. Bertsekas, John N. Tsitsiklis Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Basics Random experiment 𝐸, outcome 𝜔∈Ω, events 𝐹, sample space (Ω,𝐹) Probability measure 𝑃:𝐹→𝑅 Axioms of probability, basic laws of probability Discrete sample space, discrete probability measure Continuous sample space, continuous probability measure Conditional probability, multiplicative rule, theorem of total probability, Bayes theorem Independence, pair-wise, mutual, conditional independence Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Random Variables 𝑋:Ω→𝑅 Example: Experiment: Tossing of two coins Random variable: sum of two outcomes 𝑋=2 ≡ 𝜔:𝑠𝑢𝑚 𝑜𝑓 𝑠𝑐𝑜𝑟𝑒𝑠=2 = 1,1 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Discrete Random Variables Probability mass function Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Example distributions: Discrete Bernoulli: 𝑥∼𝐵𝑒𝑟 𝑝 , 𝑥∈{0,1}≡𝑝 𝑥 = 𝑝 𝑥 1−𝑝 1−𝑥 Binomial: 𝑥∼𝐵𝑖𝑛 𝑛,𝑝 , 𝑥∈{0,…,𝑛}≡𝑝 𝑥 =𝑛𝐶𝑥 𝑝 𝑥 1−𝑝 1−𝑥 Poisson: 𝑥∼𝑃𝑜𝑖𝑠𝑠𝑜𝑛 𝜆 , 𝑥∈{0,1, …}≡𝑝 𝑥 = 𝑒 −𝜆 𝜆 𝑘 𝑘! Geometric: 𝑥∼𝐺𝑒𝑜 𝑝 , 𝑥∈{1,…,𝑛}≡𝑝 𝑥 = 1−𝑝 𝑥−1 𝑝 Empirical distribution: Given 𝐷= 𝑥 1 ,…, 𝑥 𝑛 , 𝑝 𝑒𝑚𝑝 𝐴 = 1 𝑁 𝑖 𝛿 𝑥 𝑖 (𝐴) , where 𝛿 𝑥 𝑖 (𝐴) is the Dirac delta measure Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Continuous Random Variables Probability density function Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Example density functions Uniform: 𝑥∼𝑈 𝑎,𝑏 ≡𝑓 𝑥 = 1 𝑏−𝑎 Exponential: 𝑥∼𝐸𝑥𝑝 𝜆 ≡𝑓 𝑥 =𝜆 𝑒 −𝜆𝑥 Standard Normal: 𝑥∼𝑁 0,1 ≡𝑓 𝑥 = 1 √2𝜋 𝑒 − 𝑥 2 /2 Gaussian: 𝑥∼𝑁(𝜇,𝜎)≡𝑓 𝑥 = 1 √2𝜋𝜎 𝑒 −( 𝑥−𝜇) 2 /2 𝜎 2 Laplace: 𝑥∼𝐿𝑎𝑝(𝜇,𝑏)≡𝑓 𝑥 = 1 2𝑏 𝑒 −|𝑥−𝜇|/𝑏 Gamma: 𝑥∼𝐺𝑎𝑚(𝛼,𝛽)≡𝑓 𝑥 = 𝛽 𝛼 Γ(𝛼) 𝑥 𝛼−1 𝑒 −𝛽𝑥 Beta: 𝑥∼𝐵𝑒𝑡𝑎(𝛼,𝛽)≡𝑓 𝑥 = Γ 𝛼 Γ(𝛽) Γ 𝛼+𝛽 𝑥 𝛼−1 (1−𝑥) 𝛽−1 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Random Variables Cumulative distribution function Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Moments Mean Variance 𝐸 𝑋 = 𝑥𝑓 𝑥 𝑑𝑥 𝑉𝑎𝑟 𝑋 =𝐸[ 𝑋−𝐸 𝑋 2 ] 𝐸 𝑋 = 𝑥𝑓 𝑥 𝑑𝑥 Variance 𝑉𝑎𝑟 𝑋 =𝐸[ 𝑋−𝐸 𝑋 2 ] Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Random Vectors and Joint Distributions Discrete Random Vector Joint pmf Continuous Random Vector Joint pdf Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Example multi-variate distributions Multi-variate Gaussian 𝑥∼𝑁 𝜇,Σ ≡𝑓 𝑥 = 2𝜋 − 𝑘 2 Σ −1 𝑥−𝜇 𝑇 Σ −1 (𝑥−𝜇) Multinomial 𝑥∼𝑀𝑢𝑙𝑡 𝑝 1 ,…, 𝑝 𝑘 ≡𝑓 𝑥 1 ,…, 𝑥 𝑘 = 𝑛! 𝑥 1 !… 𝑥 𝑘 ! 𝑝 1 𝑥 1 … 𝑝 𝑘 𝑥 𝑘 Dirichlet 𝑥∼𝐷𝑖𝑟 𝛼 1 ,…, 𝛼 𝑘 ≡𝑓 𝑥 1 ,…, 𝑥 𝑘 = Γ 𝑖 𝛼 𝑖 𝑖 Γ 𝛼 𝑖 𝑖 𝑥 𝑖 𝛼 𝑖 −1 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Random Vectors and Joint Distributions Given 𝑓( 𝑥 1 ,… 𝑥 𝑘 ), Marginal distributions 𝑓 𝑋 1 𝑥 1 = 𝑥 2 𝑥 3 … 𝑓 𝑥 1 ,…, 𝑥 𝑘 𝑑 𝑥 2 𝑑 𝑥 3 … Expectation 𝐸[𝑋]= 𝑥 1 𝑥 2 … ( 𝑥 1 ,…, 𝑥 𝑘 )𝑓 𝑥 1 ,…, 𝑥 𝑘 𝑑 𝑥 1 𝑑 𝑥 2 … Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Conditional Probability Conditional pmf Conditional pdf Given 𝑓 𝑋 1 𝑋 2 ( 𝑥 1 , 𝑥 2 ), 𝑓 𝑋 1 | 𝑋 2 𝑥 1 𝑥 2 = 𝑓 𝑋 1 𝑋 2 ( 𝑥 1 , 𝑥 2 )/ 𝑓 𝑋 2 ( 𝑥 2 ) Multiplication Rule Bayes rule 𝑓 𝑋 1 | 𝑋 2 𝑥 1 𝑥 2 = 𝑓 𝑋 2 | 𝑋 1 𝑥 2 𝑥 1 𝑓 𝑋 1 𝑥 1 𝑥 1 𝑓 𝑋 2 | 𝑋 1 𝑥 2 𝑥 1 𝑓 𝑋 1 𝑥 1 𝑑 𝑥 1 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Conditional Probability Given 𝑓 𝑋 1 𝑋 2 ( 𝑥 1 , 𝑥 2 ), Conditional Expectation 𝐸 𝑋 1 𝑥 2 = 𝑥 1 𝑓 𝑋_1| 𝑋 2 𝑥 1 𝑥 2 𝑑 𝑥 1 Law of Total Expectation 𝐸 𝑋 1 = 𝐸 𝑋 1 𝑥 2 𝑓 𝑋 2 𝑥 2 𝑑 𝑥 2 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Independence and Conditional Independence 𝑓 𝑋 1 𝑋 2 𝑥 1 , 𝑥 2 = 𝑓 𝑋 1 ( 𝑥 1 ) 𝑓 𝑋 2 ( 𝑥 2 ) Conditional Independence 𝑓 𝑋 1 𝑋 2 | 𝑋 3 𝑥 1 , 𝑥 2 | 𝑥 3 = 𝑓 𝑋 1 | 𝑋 3 ( 𝑥 1 | 𝑥 3 ) 𝑓 𝑋 2 | 𝑋 3 ( 𝑥 2 | 𝑥 3 ) Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Covariance Covariance Correlation co-efficient 𝐶𝑜𝑣 𝑋,𝑌 =𝐸[(𝑋−𝐸[𝑋])(𝑌−𝐸[𝑌])] Correlation co-efficient 𝜌 𝑋,𝑌 =𝐶𝑜𝑣(𝑋,𝑌)/√𝑉𝑎𝑟(𝑋)√𝑉𝑎𝑟(𝑌) Covariance matrix for a random vector X 𝐶𝑜𝑣 𝑋 =𝐸[ 𝑋−𝐸 𝑋 𝐸−𝐸 𝑋 𝑇 ] Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Central Limit Theorem N i.i.d. random variables 𝑋 𝑖 with mean 𝜇, variance 𝜎 2 𝑆 𝑁 = 𝑖 𝑋 𝑖 𝑍 𝑁 = 𝑆 𝑁 −𝑁𝜇 𝜎 𝑁 As N increases the distribution of 𝑍 𝑁 approaches the standard normal distribution Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Notions from Information Theory Entropy 𝐻 𝑋 =− 𝑘 𝑃 𝑋=𝑘 log 2 𝑃(𝑋=𝑘) KL divergence 𝐾𝐿 𝑝 𝑞 = 𝑥 𝑝 𝑘 log 𝑝 𝑘 𝑞 𝑘 Mutual Information 𝐼 𝑋,𝑌 =𝐾𝐿 𝑝 𝑋,𝑌 ,𝑝 𝑋 𝑝 𝑌 = 𝑥,𝑦 𝑝 𝑥,𝑦 log 𝑝(𝑥,𝑦) 𝑝 𝑥 𝑝(𝑦) Point-wise Mutual Information 𝑃𝑀𝐼 𝑥,𝑦 = log 𝑝(𝑥,𝑦) 𝑝 𝑥 𝑝(𝑦) Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Jensen’s Inequality For a convex function f() and a random variable X 𝑓 𝐸 𝑋 ≤𝐸 𝑓 𝑥 Equality holds if f(x) is linear Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya