Download presentation
Presentation is loading. Please wait.
Published byBernardo Franca Leal Modified over 6 years ago
1
Machine Learning for Signal Processing Linear Gaussian Models
Class Oct 2014 Instructor: Bhiksha Raj 11755/18797
2
Recap: MAP Estimators MAP (Maximum A Posteriori): Find a “best guess” for y (statistically), given known x y = argmax Y P(Y|x) 11755/18797
3
Recap: MAP estimation x and y are jointly Gaussian z is Gaussian
11755/18797
4
MAP estimation: Gaussian PDF
Y F1 X 11755/18797
5
MAP estimation: The Gaussian at a particular value of X
11755/18797
6
Conditional Probability of y|x
The conditional probability of y given x is also Gaussian The slice in the figure is Gaussian The mean of this Gaussian is a function of x The variance of y reduces if x is known Uncertainty is reduced 11755/18797
7
MAP estimation: The Gaussian at a particular value of X
Most likely value F1 x0 11755/18797
8
MAP Estimation of a Gaussian RV
x0 11755/18797
9
Its also a minimum-mean-squared error estimate
Minimize error: Differentiating and equating to 0: The MMSE estimate is the mean of the distribution 11755/18797
10
For the Gaussian: MAP = MMSE
Most likely value is also The MEAN value Would be true of any symmetric distribution 11755/18797
11
A Likelihood Perspective
= + y is a noisy reading of aTx Error e is Gaussian Estimate A from 11755/18797
12
The Likelihood of the data
Probability of observing a specific y, given x, for a particular matrix a Probability of collection: Assuming IID for convenience (not necessary) 11755/18797
13
A Maximum Likelihood Estimate
Maximizing the log probability is identical to minimizing the least squared error 11755/18797
14
A problem with regressions
ML fit is sensitive Error is squared Small variations in data large variations in weights Outliers affect it adversely Unstable If dimension of X >= no. of instances (XXT) is not invertible 11755/18797
15
MAP estimation of weights
y=aTX+e X e Assume weights drawn from a Gaussian P(a) = N(0, s2I) Max. Likelihood estimate Maximum a posteriori estimate 11755/18797
16
MAP estimation of weights
P(a) = N(0, s2I) Log P(a) = C – log s – 0.5s-2 ||a||2 Similar to ML estimate with an additional term 11755/18797
17
MAP estimate of weights
Equivalent to diagonal loading of correlation matrix Improves condition number of correlation matrix Can be inverted with greater stability Will not affect the estimation from well-conditioned data Also called Tikhonov Regularization Dual form: Ridge regression MAP estimate of weights Not to be confused with MAP estimate of Y 11755/18797
18
MAP estimate priors Left: Gaussian Prior on W Right: Laplacian Prior
11755/18797
19
MAP estimation of weights with laplacian prior
Assume weights drawn from a Laplacian P(a) = l-1exp(-l-1|a|1) Maximum a posteriori estimate No closed form solution Quadratic programming solution required Non-trivial 11755/18797
20
MAP estimation of weights with laplacian prior
Assume weights drawn from a Laplacian P(a) = l-1exp(-l-1|a|1) Maximum a posteriori estimate … Identical to L1 regularized least-squares estimation 11755/18797
21
L1-regularized LSE No closed form solution Dual formulation
Quadratic programming solutions required Dual formulation “LASSO” – Least absolute shrinkage and selection operator subject to 11755/18797
22
LASSO Algorithms Various convex optimization algorithms
LARS: Least angle regression Pathwise coordinate descent.. Matlab code available from web 11755/18797
23
Regularized least squares
Image Credit: Tibshirani Regularization results in selection of suboptimal (in least-squares sense) solution One of the loci outside center Tikhonov regularization selects shortest solution L1 regularization selects sparsest solution 11755/18797
24
LASSO and Compressive Sensing
= Y X a Given Y and X, estimate sparse W LASSO: X = explanatory variable Y = dependent variable a = weights of regression CS: X = measurement matrix Y = measurement a = data 11755/18797
25
MAP / ML / MMSE General statistical estimators
All used to predict a variable, based on other parameters related to it.. Most common assumption: Data are Gaussian, all RVs are Gaussian Other probability densities may also be used.. For Gaussians relationships are linear as we saw.. 11755/18797
26
Gaussians and more Gaussians..
Linear Gaussian Models.. But first a recap 11755/18797
27
A Brief Recap Principal component analysis: Find the K bases that best explain the given data Find B and C such that the difference between D and BC is minimum While constraining that the columns of B are orthonormal 11755/18797
28
Remember Eigenfaces Approximate every face f as f = wf,1 V1+ wf,2 V2 + wf,3 V wf,k Vk Estimate V to minimize the squared error Error is unexplained by V1.. Vk Error is orthogonal to Eigenfaces 11755/18797
29
Karhunen Loeve vs. PCA Eigenvectors of the Correlation matrix:
Principal directions of tightest ellipse centered on origin Directions that retain maximum energy 11755/18797
30
Karhunen Loeve vs. PCA Eigenvectors of the Correlation matrix:
Principal directions of tightest ellipse centered on origin Directions that retain maximum energy Eigenvectors of the Covariance matrix: Principal directions of tightest ellipse centered on data Directions that retain maximum variance 11755/18797
31
Karhunen Loeve vs. PCA Eigenvectors of the Correlation matrix:
KARHUNEN LOEVE TRANSFORM Eigenvectors of the Correlation matrix: Principal directions of tightest ellipse centered on origin Directions that retain maximum energy Eigenvectors of the Covariance matrix: Principal directions of tightest ellipse centered on data Directions that retain maximum variance 11755/18797
32
Karhunen Loeve vs. PCA Eigenvectors of the Correlation matrix:
Principal Component Analysis KARHUNEN LOEVE TRANSFORM Eigenvectors of the Correlation matrix: Principal directions of tightest ellipse centered on origin Directions that retain maximum energy Eigenvectors of the Covariance matrix: Principal directions of tightest ellipse centered on data Directions that retain maximum variance 11755/18797
33
Karhunen Loeve vs. PCA If the data are naturally centered at origin, KLT == PCA Following slides refer to PCA! Assume data centered at origin for simplicity Not essential, as we will see.. 11755/18797
34
Remember Eigenfaces Approximate every face f as f = wf,1 V1+ wf,2 V2 + wf,3 V wf,k Vk Estimate V to minimize the squared error Error is unexplained by V1.. Vk Error is orthogonal to Eigenfaces 11755/18797
35
= w11 + e1 Eigen Representation K-dimensional representation w11 e1
w11 e1 Illustration assuming 3D space K-dimensional representation Error is orthogonal to representation Weight and error are specific to data instance 11755/18797
36
= w12 + e2 Representation K-dimensional representation w12 e2
Error is at 90o to the eigenface w12 90o e2 Illustration assuming 3D space K-dimensional representation Error is orthogonal to representation Weight and error are specific to data instance 11755/18797
37
Representation K-dimensional representation
All data with the same representation wV1 lie a plane orthogonal to wV1 w K-dimensional representation Error is orthogonal to representation 11755/18797
38
With 2 bases = w11 + w21 + e1 K-dimensional representation e1
Error is at 90o to the eigenfaces 0,0 e1 w11 w21 Illustration assuming 3D space K-dimensional representation Error is orthogonal to representation Weight and error are specific to data instance 11755/18797
39
With 2 bases = w12 + w22 + e2 K-dimensional representation e2
Error is at 90o to the eigenfaces w12 w22 e2 Illustration assuming 3D space K-dimensional representation Error is orthogonal to representation Weight and error are specific to data instance 11755/18797
40
In Vector Form Xi = w1iV1 + w2iV2 + ei K-dimensional representation e2
Error is at 90o to the eigenfaces w12 w22 V2 D2 e2 V1 K-dimensional representation Error is orthogonal to representation Weight and error are specific to data instance 11755/18797
41
In Vector Form Xi = w1iV1 + w2iV2 + ei e2 K-dimensional representation
Error is at 90o to the eigenface w12 w22 V2 D2 e2 V1 K-dimensional representation x is a D dimensional vector V is a D x K matrix w is a K dimensional vector e is a D dimensional vector 11755/18797
42
Learning PCA For the given data: find the K-dimensional subspace such that it captures most of the variance in the data Variance in remaining subspace is minimal 11755/18797
43
Constraints Error is at 90o to the eigenface w12 w22 V2 D2 e2 V1 VTV = I : Eigen vectors are orthogonal to each other For every vector, error is orthogonal to Eigen vectors eTV = 0 Over the collection of data Average wwT = Diagonal : Eigen representations are uncorrelated Determinant eTe = minimum: Error variance is minimum Mean of error is 0 11755/18797
44
A Statistical Formulation of PCA
Error is at 90o to the eigenface w12 w22 V2 D2 e2 V1 x is a random variable generated according to a linear relation w is drawn from an K-dimensional Gaussian with diagonal covariance e is drawn from a 0-mean (D-K)-rank D-dimensional Gaussian Estimate V (and B) given examples of x 11755/18797
45
Linear Gaussian Models!!
x is a random variable generated according to a linear relation w is drawn from a Gaussian e is drawn from a 0-mean Gaussian Estimate V given examples of x In the process also estimate B and E 11755/18797
46
Linear Gaussian Models!!
PCA is a specific instance of a linear Gaussian model with particular constraints B = Diagonal VTV = I E is low rank x is a random variable generated according to a linear relation w is drawn from a Gaussian e is drawn from a 0-mean Gaussian Estimate V given examples of x In the process also estimate B and E 11755/18797
47
Linear Gaussian Models
Observations are linear functions of two uncorrelated Gaussian random variables A “weight” variable w An “error” variable e Error not correlated to weight: E[eTw] = 0 Learning LGMs: Estimate parameters of the model given instances of x The problem of learning the distribution of a Gaussian RV 11755/18797
48
LGMs: Probability Density
The mean of x: The Covariance of x: 11755/18797
49
The probability of x x is a linear function of Gaussians: x is also Gaussian Its mean and variance are as given 11755/18797
50
Estimating the variables of the model
Estimating the variables of the LGM is equivalent to estimating P(x) The variables are m, V, B and E 11755/18797
51
Estimating the model The model is indeterminate:
Vw = VCC-1w = (VC)(C-1w) We need extra constraints to make the solution unique Usual constraint : B = I Variance of w is an identity matrix 11755/18797
52
Estimating the variables of the model
Estimating the variables of the LGM is equivalent to estimating P(x) The variables are m, V, and E 11755/18797
53
The Maximum Likelihood Estimate
Given training set x1, x2, .. xN, find m, V, E The ML estimate of m does not depend on the covariance of the Gaussian 11755/18797
54
Centered Data We can safely assume “centered” data
If the data are not centered, “center” it Estimate mean of data Which is the maximum likelihood estimate Subtract it from the data 11755/18797
55
Simplified Model Estimating the variables of the LGM is equivalent to estimating P(x) The variables are V, and E 11755/18797
56
Estimating the model Given a collection of xi terms Estimate V and E
x1, x2,..xN Estimate V and E w is unknown for each x But if assume we know w for each x, then what do we get: 11755/18797
57
Estimating the Parameters
We will use a maximum-likelihood estimate The log-likelihood of x1..xN knowing their wis 11755/18797
58
Maximizing the log-likelihood
Differentiating w.r.t. V and setting to 0 Differentiating w.r.t. E-1 and setting to 0 11755/18797
59
Estimating LGMs: If we know w
But in reality we don’t know the w for each x So how to deal with this? EM.. 11755/18797
60
Recall EM Instance from blue dice Instance from red dice Dice unknown 6 6 6 6 6 6 .. 6 .. .. .. 6 .. 6 .. Collection of “blue” numbers Collection of “red” numbers Collection of “blue” numbers Collection of “red” numbers Collection of “blue” numbers Collection of “red” numbers We figured out how to compute parameters if we knew the missing information Then we “fragmented” the observations according to the posterior probability P(z|x) and counted as usual In effect we took the expectation with respect to the a posteriori probability of the missing data: P(z|x) 11755/18797
61
EM for LGMs Replace unseen data terms with expectations taken w.r.t. P(w|xi) 11755/18797
62
EM for LGMs Replace unseen data terms with expectations taken w.r.t. P(w|xi) 11755/18797
63
Expected Value of w given x
x and w are jointly Gaussian! x is Gaussian w is Gaussian They are linearly related 11755/18797
64
Expected Value of w given x
x and w are jointly Gaussian! 11755/18797
65
The conditional expectation of w given z
P(w|z) is a Gaussian 11755/18797
66
LGM: The complete EM algorithm
Initialize V and E E step: M step: 11755/18797
67
So what have we achieved
Employed a complicated EM algorithm to learn a Gaussian PDF for a variable x What have we gained??? Next class: PCA Sensible PCA EM algorithms for PCA Factor Analysis FA for feature extraction 11755/18797
68
LGMs : Application 1 Learning principal components
Find directions that capture most of the variation in the data Error is orthogonal to these variations 11755/18797
69
LGMs : Application 2 Learning with insufficient data
FULL COV FIGURE The full covariance matrix of a Gaussian has D2 terms Fully captures the relationships between variables Problem: Needs a lot of data to estimate robustly 11755/18797
70
To be continued.. Other applications.. Next class 11755/18797
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.