: INTRODUCTION TO Machine Learning Parametric Methods
Parametric Estimation X = { x t } t where x t ~ p (x) Parametric estimation: Assume a form for p (x |q ) and estimate q, its sufficient statistics, using X N ( μ, σ 2 ) where q = { μ, σ 2 }
Maximum Likelihood Estimation Likelihood of q given the sample X l ( θ |X) = p (X | θ ) = t p (x t | θ ) Log likelihood L( θ |X) = log l ( θ |X) = t log p (x t | θ ) Maximum likelihood estimator θ * = argmax θ L( θ |X)
Examples: Bernoulli/Multinomial Bernoulli: Two states, failure/success, x in {0,1} P (x) = p o x (1 – p o ) (1 – x) L (p o |X) = log t p o x t (1 – p o ) (1 – x t ) MLE: p o = t x t / N Multinomial: K>2 states, x i in {0,1} P (x 1,x 2,...,x K ) = i p i x i L(p 1,p 2,...,p K |X) = log t i p i x i t MLE: p i = t x i t / N
Gaussian (Normal) Distribution p(x) = N ( μ, σ 2 ) MLE for μ and σ 2 :
Bias and Variance Unknown parameter q Estimator d i = d (X i ) on sample X i Bias: b q (d) = E [d] – q Variance: E [(d–E [d]) 2 ] Mean square error: r (d,q) = E [(d–q) 2 ] = (E [d] – q) 2 + E [(d–E [d]) 2 ] = Bias 2 + Variance
Bayes Estimator Treat θ as a random var with prior p ( θ ) Bayes rule: p ( θ |X) = p(X| θ ) p( θ ) / p(X) Full: p(x|X) = p(x| θ ) p( θ |X) d θ Maximum a Posteriori (MAP): θ MAP = argmax θ p( θ |X) Maximum Likelihood (ML): θ ML = argmax θ p(X| θ ) Bayes: θ Bayes = E[ θ |X] = θ p( θ |X) d θ
Parametric Classification
Given the sample ML estimates are Discriminant becomes Parametric Classification
(a)and(b) for two classes when the input is one-dimensional. Variances are equal and the posteriors intersect at one point, which is the threshold if decision.
Parametric Classification (a)and(b) for two classes when the input is one-dimensional. Variances are unequal and the posteriors intersect at two points. In (c), the expected risks are shown for the two classes and for reject with
Regression
Regression: From LogL to Error
Linear Regression
Polynomial Regression
Square Error: Relative Square Error: Absolute Error: E ( θ |X) = t |r t – g(x t | θ )| ε -sensitive Error: E ( θ |X) = t 1(|r t – g(x t | θ )|>ε) (|r t – g(x t |θ)| – ε) Other Error Measures
Bias and Variance biasvariance noisesquared error
Estimating Bias and Variance M samples X i ={x t i, r t i }, i=1,...,M are used to fit g i (x), i =1,...,M
Bias/Variance Dilemma Example: g i (x)=2 has no variance and high bias g i (x)= t r t i /N has lower bias with variance As we increase complexity, bias decreases (a better fit to data) and variance increases (fit varies more with data)
Bias/Variance Dilemma (a) Function, f(x) = 2sin(1.5x), and one noisy (N(0,1)) dataset sampled from the function. Five samples are taken, each containing twenty in-stances. (b), (c), (d) are five polynomial fits, namely, gi(.), of order 1, 3 and 5. for each case, dotted line is the average of the five fits namely,.
Polynomial Regression Best fit min error In the same setting as that of previous, using one hundred models instead of five, bias, variance, and error for polynomials of order 1 to 5.
Model Selection Cross-validation: Measure generalization accuracy by testing on data unused during training Regularization: Penalize complex models E=error on data + λ model complexity Akaikes information criterion (AIC), Bayesian information criterion (BIC) Minimum description length (MDL): Kolmogorov complexity, shortest description of data Structural risk minimization (SRM)
Best fit, elbow Model Selection
Bayesian Model Selection Prior on models, p(model) Regularization, when prior favors simpler models Bayes, MAP of the posterior, p(model|data) Average over a number of models with high posterior
Regression example Coefficients increase in magnitude as order increases: 1: [ , ] 2: [0.1682, , ] 3: [0.4238, , , : [ , , , , ]