Probabilistic Models for Linear Regression Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Regression Problem N iid training samples { 𝑥 𝑛 , 𝑦 𝑛 } Response / Output / Target : 𝑦 𝑛 ∈𝑅 Input / Feature vector: 𝑋∈ 𝑅 𝑑 Linear Regression 𝑦 𝑛 = 𝑤 𝑇 𝑥 𝑛 + 𝜖 𝑛 Polynomial Regression 𝑦 𝑛 = 𝑤 𝑇 𝜙 𝑥 𝑛 + 𝜖 𝑛 𝜙 𝑗 𝑥 = 𝑥 𝑗 Still linear function of w Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Least Squares Formulation Deterministic error term 𝜖 𝑛 Minimize total error 𝐸 𝑤 = 𝑛 𝜖 𝑛 2 𝑤 ∗ = arg min 𝑤 𝐸(𝑤) Find gradient wrt 𝑤 and equate to 0 𝑤 ∗ = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑦 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Regularization for Regression How does regression overfit? Adding regularization to regression 𝐸 1 𝑤,𝐷 + 𝜆𝐸 2 𝑤 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Regularization for Regression Possibilities for regularizers 𝑙 2 norm 𝑤 𝑇 𝑤 (Ridge regression) Quadratic: Continuous, convex 𝑤 ∗ = 𝜆𝐼+ 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑌 𝑙 1 norm (Lasso) Choosing 𝜆 Cross validation: wastes training data … Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Probabilistic formulation Model X and Y as random variables Directly model conditional distribution of Y IID 𝑌 𝑖 | 𝑋 𝑖 =𝑥∼𝑖𝑖𝑑 𝑝(𝑦|𝑥) Linear 𝑌 𝑖 = 𝑤 𝑇 𝑋 𝑖 + 𝜖 𝑛 , 𝜖 𝑛 ∼𝑖𝑖𝑑 𝑝 𝜖 Gaussian noise 𝑝 𝜖 =𝑁 0, 𝜎 2 𝑝 𝑦 𝑥 = 1 2𝜋 𝜎 exp{− 𝑦− 𝑤 𝑇 𝑥 2 2 𝜎 2 } Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Probabilistic formulation Image from Michael Jordan’s book Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Maximum Likelihood Estimation Formulate loglikelihood 𝐿 𝑤 = 𝑛 𝑝 𝑦 𝑛 𝑥 𝑛 ;𝑤 = 1 2𝜋𝜎^2 𝑁/2 exp{− 1 2𝜋 𝜎 𝑛 𝑦− 𝑤 𝑇 𝑥 𝑛 2 } 𝑙 𝑤 = 𝑛 𝑦− 𝑤 𝑇 𝑥 𝑛 2 Recovers LMS formulation! Maximize to get MLE 𝑤 𝑀𝐿 = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑌 𝜎 2 𝑀𝐿 = 1 𝑁 𝑛 ( 𝑦 𝑛 − 𝑤 𝑀𝐿 𝑇 𝑥 𝑛 ) Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Bayesian Linear Regression Model W as random variable with prior distribution 𝑝 𝑤 =𝑁 𝑚 0 , 𝑆 0 ;𝑤, 𝑚 0 is 𝑀×1, 𝑆 0 is 𝑀×𝑀 Derive posterior distribution 𝑝 𝑤 𝑦 =𝑁 𝑚 𝑁 , 𝑆 𝑁 (for some 𝑚 𝑁 , 𝑆 𝑁 ) Derive mean of posterior distribution 𝑤 𝐵 =𝐸 𝑊 𝑦 = 𝑚 𝑁 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Iterative Solutions for Normal Equations Direct solutions have limitations Iterative solutions First order method: Gradient descent 𝑤 (𝑡+1) ← 𝑤 (𝑡) +𝜌 𝑛 𝑦 𝑛 − 𝑤 𝑡 𝑇 𝑥 𝑛 𝑥 𝑛 Convergence guarantees Convergence in probability to correct solution for appropriate fixed step size Sure convergence with decreasing step sizes Stochastic gradient descent Update based on a single data point as each step Often converges faster Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Advantages of Probabilistic Modeling Makes assumptions explicit Modularity Conceptually simple to change a model by replacing with appropriate distributions Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Summary Probabilistic formulation of linear regression Recovers least squares formulation Iterative algorithms for training Forms of regularization Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya