Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning for Signal Processing Linear Gaussian Models

Similar presentations


Presentation on theme: "Machine Learning for Signal Processing Linear Gaussian Models"— Presentation transcript:

1 Machine Learning for Signal Processing Linear Gaussian Models
Class Oct 2014 Instructor: Bhiksha Raj 11755/18797

2 Recap: MAP Estimators MAP (Maximum A Posteriori): Find a “best guess” for y (statistically), given known x y = argmax Y P(Y|x) 11755/18797

3 Recap: MAP estimation x and y are jointly Gaussian z is Gaussian
11755/18797

4 MAP estimation: Gaussian PDF
Y F1 X 11755/18797

5 MAP estimation: The Gaussian at a particular value of X
11755/18797

6 Conditional Probability of y|x
The conditional probability of y given x is also Gaussian The slice in the figure is Gaussian The mean of this Gaussian is a function of x The variance of y reduces if x is known Uncertainty is reduced 11755/18797

7 MAP estimation: The Gaussian at a particular value of X
Most likely value F1 x0 11755/18797

8 MAP Estimation of a Gaussian RV
x0 11755/18797

9 Its also a minimum-mean-squared error estimate
Minimize error: Differentiating and equating to 0: The MMSE estimate is the mean of the distribution 11755/18797

10 For the Gaussian: MAP = MMSE
Most likely value is also The MEAN value Would be true of any symmetric distribution 11755/18797

11 A Likelihood Perspective
= + y is a noisy reading of aTx Error e is Gaussian Estimate A from 11755/18797

12 The Likelihood of the data
Probability of observing a specific y, given x, for a particular matrix a Probability of collection: Assuming IID for convenience (not necessary) 11755/18797

13 A Maximum Likelihood Estimate
Maximizing the log probability is identical to minimizing the least squared error 11755/18797

14 A problem with regressions
ML fit is sensitive Error is squared Small variations in data  large variations in weights Outliers affect it adversely Unstable If dimension of X >= no. of instances (XXT) is not invertible 11755/18797

15 MAP estimation of weights
y=aTX+e X e Assume weights drawn from a Gaussian P(a) = N(0, s2I) Max. Likelihood estimate Maximum a posteriori estimate 11755/18797

16 MAP estimation of weights
P(a) = N(0, s2I) Log P(a) = C – log s – 0.5s-2 ||a||2 Similar to ML estimate with an additional term 11755/18797

17 MAP estimate of weights
Equivalent to diagonal loading of correlation matrix Improves condition number of correlation matrix Can be inverted with greater stability Will not affect the estimation from well-conditioned data Also called Tikhonov Regularization Dual form: Ridge regression MAP estimate of weights Not to be confused with MAP estimate of Y 11755/18797

18 MAP estimate priors Left: Gaussian Prior on W Right: Laplacian Prior
11755/18797

19 MAP estimation of weights with laplacian prior
Assume weights drawn from a Laplacian P(a) = l-1exp(-l-1|a|1) Maximum a posteriori estimate No closed form solution Quadratic programming solution required Non-trivial 11755/18797

20 MAP estimation of weights with laplacian prior
Assume weights drawn from a Laplacian P(a) = l-1exp(-l-1|a|1) Maximum a posteriori estimate Identical to L1 regularized least-squares estimation 11755/18797

21 L1-regularized LSE No closed form solution Dual formulation
Quadratic programming solutions required Dual formulation “LASSO” – Least absolute shrinkage and selection operator subject to 11755/18797

22 LASSO Algorithms Various convex optimization algorithms
LARS: Least angle regression Pathwise coordinate descent.. Matlab code available from web 11755/18797

23 Regularized least squares
Image Credit: Tibshirani Regularization results in selection of suboptimal (in least-squares sense) solution One of the loci outside center Tikhonov regularization selects shortest solution L1 regularization selects sparsest solution 11755/18797

24 LASSO and Compressive Sensing
= Y X a Given Y and X, estimate sparse W LASSO: X = explanatory variable Y = dependent variable a = weights of regression CS: X = measurement matrix Y = measurement a = data 11755/18797

25 MAP / ML / MMSE General statistical estimators
All used to predict a variable, based on other parameters related to it.. Most common assumption: Data are Gaussian, all RVs are Gaussian Other probability densities may also be used.. For Gaussians relationships are linear as we saw.. 11755/18797

26 Gaussians and more Gaussians..
Linear Gaussian Models.. But first a recap 11755/18797

27 A Brief Recap Principal component analysis: Find the K bases that best explain the given data Find B and C such that the difference between D and BC is minimum While constraining that the columns of B are orthonormal 11755/18797

28 Remember Eigenfaces Approximate every face f as f = wf,1 V1+ wf,2 V2 + wf,3 V wf,k Vk Estimate V to minimize the squared error Error is unexplained by V1.. Vk Error is orthogonal to Eigenfaces 11755/18797

29 Karhunen Loeve vs. PCA Eigenvectors of the Correlation matrix:
Principal directions of tightest ellipse centered on origin Directions that retain maximum energy 11755/18797

30 Karhunen Loeve vs. PCA Eigenvectors of the Correlation matrix:
Principal directions of tightest ellipse centered on origin Directions that retain maximum energy Eigenvectors of the Covariance matrix: Principal directions of tightest ellipse centered on data Directions that retain maximum variance 11755/18797

31 Karhunen Loeve vs. PCA Eigenvectors of the Correlation matrix:
KARHUNEN LOEVE TRANSFORM Eigenvectors of the Correlation matrix: Principal directions of tightest ellipse centered on origin Directions that retain maximum energy Eigenvectors of the Covariance matrix: Principal directions of tightest ellipse centered on data Directions that retain maximum variance 11755/18797

32 Karhunen Loeve vs. PCA Eigenvectors of the Correlation matrix:
Principal Component Analysis KARHUNEN LOEVE TRANSFORM Eigenvectors of the Correlation matrix: Principal directions of tightest ellipse centered on origin Directions that retain maximum energy Eigenvectors of the Covariance matrix: Principal directions of tightest ellipse centered on data Directions that retain maximum variance 11755/18797

33 Karhunen Loeve vs. PCA If the data are naturally centered at origin, KLT == PCA Following slides refer to PCA! Assume data centered at origin for simplicity Not essential, as we will see.. 11755/18797

34 Remember Eigenfaces Approximate every face f as f = wf,1 V1+ wf,2 V2 + wf,3 V wf,k Vk Estimate V to minimize the squared error Error is unexplained by V1.. Vk Error is orthogonal to Eigenfaces 11755/18797

35 = w11 + e1 Eigen Representation K-dimensional representation w11 e1
w11 e1 Illustration assuming 3D space K-dimensional representation Error is orthogonal to representation Weight and error are specific to data instance 11755/18797

36 = w12 + e2 Representation K-dimensional representation w12 e2
Error is at 90o to the eigenface w12 90o e2 Illustration assuming 3D space K-dimensional representation Error is orthogonal to representation Weight and error are specific to data instance 11755/18797

37 Representation K-dimensional representation
All data with the same representation wV1 lie a plane orthogonal to wV1 w K-dimensional representation Error is orthogonal to representation 11755/18797

38 With 2 bases = w11 + w21 + e1 K-dimensional representation e1
Error is at 90o to the eigenfaces 0,0 e1 w11 w21 Illustration assuming 3D space K-dimensional representation Error is orthogonal to representation Weight and error are specific to data instance 11755/18797

39 With 2 bases = w12 + w22 + e2 K-dimensional representation e2
Error is at 90o to the eigenfaces w12 w22 e2 Illustration assuming 3D space K-dimensional representation Error is orthogonal to representation Weight and error are specific to data instance 11755/18797

40 In Vector Form Xi = w1iV1 + w2iV2 + ei K-dimensional representation e2
Error is at 90o to the eigenfaces w12 w22 V2 D2 e2 V1 K-dimensional representation Error is orthogonal to representation Weight and error are specific to data instance 11755/18797

41 In Vector Form Xi = w1iV1 + w2iV2 + ei e2 K-dimensional representation
Error is at 90o to the eigenface w12 w22 V2 D2 e2 V1 K-dimensional representation x is a D dimensional vector V is a D x K matrix w is a K dimensional vector e is a D dimensional vector 11755/18797

42 Learning PCA For the given data: find the K-dimensional subspace such that it captures most of the variance in the data Variance in remaining subspace is minimal 11755/18797

43 Constraints Error is at 90o to the eigenface w12 w22 V2 D2 e2 V1 VTV = I : Eigen vectors are orthogonal to each other For every vector, error is orthogonal to Eigen vectors eTV = 0 Over the collection of data Average wwT = Diagonal : Eigen representations are uncorrelated Determinant eTe = minimum: Error variance is minimum Mean of error is 0 11755/18797

44 A Statistical Formulation of PCA
Error is at 90o to the eigenface w12 w22 V2 D2 e2 V1 x is a random variable generated according to a linear relation w is drawn from an K-dimensional Gaussian with diagonal covariance e is drawn from a 0-mean (D-K)-rank D-dimensional Gaussian Estimate V (and B) given examples of x 11755/18797

45 Linear Gaussian Models!!
x is a random variable generated according to a linear relation w is drawn from a Gaussian e is drawn from a 0-mean Gaussian Estimate V given examples of x In the process also estimate B and E 11755/18797

46 Linear Gaussian Models!!
PCA is a specific instance of a linear Gaussian model with particular constraints B = Diagonal VTV = I E is low rank x is a random variable generated according to a linear relation w is drawn from a Gaussian e is drawn from a 0-mean Gaussian Estimate V given examples of x In the process also estimate B and E 11755/18797

47 Linear Gaussian Models
Observations are linear functions of two uncorrelated Gaussian random variables A “weight” variable w An “error” variable e Error not correlated to weight: E[eTw] = 0 Learning LGMs: Estimate parameters of the model given instances of x The problem of learning the distribution of a Gaussian RV 11755/18797

48 LGMs: Probability Density
The mean of x: The Covariance of x: 11755/18797

49 The probability of x x is a linear function of Gaussians: x is also Gaussian Its mean and variance are as given 11755/18797

50 Estimating the variables of the model
Estimating the variables of the LGM is equivalent to estimating P(x) The variables are m, V, B and E 11755/18797

51 Estimating the model The model is indeterminate:
Vw = VCC-1w = (VC)(C-1w) We need extra constraints to make the solution unique Usual constraint : B = I Variance of w is an identity matrix 11755/18797

52 Estimating the variables of the model
Estimating the variables of the LGM is equivalent to estimating P(x) The variables are m, V, and E 11755/18797

53 The Maximum Likelihood Estimate
Given training set x1, x2, .. xN, find m, V, E The ML estimate of m does not depend on the covariance of the Gaussian 11755/18797

54 Centered Data We can safely assume “centered” data
If the data are not centered, “center” it Estimate mean of data Which is the maximum likelihood estimate Subtract it from the data 11755/18797

55 Simplified Model Estimating the variables of the LGM is equivalent to estimating P(x) The variables are V, and E 11755/18797

56 Estimating the model Given a collection of xi terms Estimate V and E
x1, x2,..xN Estimate V and E w is unknown for each x But if assume we know w for each x, then what do we get: 11755/18797

57 Estimating the Parameters
We will use a maximum-likelihood estimate The log-likelihood of x1..xN knowing their wis 11755/18797

58 Maximizing the log-likelihood
Differentiating w.r.t. V and setting to 0 Differentiating w.r.t. E-1 and setting to 0 11755/18797

59 Estimating LGMs: If we know w
But in reality we don’t know the w for each x So how to deal with this? EM.. 11755/18797

60 Recall EM Instance from blue dice Instance from red dice Dice unknown 6 6 6 6 6 6 .. 6 .. .. .. 6 .. 6 .. Collection of “blue” numbers Collection of “red” numbers Collection of “blue” numbers Collection of “red” numbers Collection of “blue” numbers Collection of “red” numbers We figured out how to compute parameters if we knew the missing information Then we “fragmented” the observations according to the posterior probability P(z|x) and counted as usual In effect we took the expectation with respect to the a posteriori probability of the missing data: P(z|x) 11755/18797

61 EM for LGMs Replace unseen data terms with expectations taken w.r.t. P(w|xi) 11755/18797

62 EM for LGMs Replace unseen data terms with expectations taken w.r.t. P(w|xi) 11755/18797

63 Expected Value of w given x
x and w are jointly Gaussian! x is Gaussian w is Gaussian They are linearly related 11755/18797

64 Expected Value of w given x
x and w are jointly Gaussian! 11755/18797

65 The conditional expectation of w given z
P(w|z) is a Gaussian 11755/18797

66 LGM: The complete EM algorithm
Initialize V and E E step: M step: 11755/18797

67 So what have we achieved
Employed a complicated EM algorithm to learn a Gaussian PDF for a variable x What have we gained??? Next class: PCA Sensible PCA EM algorithms for PCA Factor Analysis FA for feature extraction 11755/18797

68 LGMs : Application 1 Learning principal components
Find directions that capture most of the variation in the data Error is orthogonal to these variations 11755/18797

69 LGMs : Application 2 Learning with insufficient data
FULL COV FIGURE The full covariance matrix of a Gaussian has D2 terms Fully captures the relationships between variables Problem: Needs a lot of data to estimate robustly 11755/18797

70 To be continued.. Other applications.. Next class 11755/18797


Download ppt "Machine Learning for Signal Processing Linear Gaussian Models"

Similar presentations


Ads by Google