Presentation is loading. Please wait.

Presentation is loading. Please wait.

11 Dec 2012COMP80131-SEEDSM12_101 Scientific Methods 1 Barry & Goran ‘Scientific evaluation, experimental design & statistical methods’ COMP80131 Lecture.

Similar presentations


Presentation on theme: "11 Dec 2012COMP80131-SEEDSM12_101 Scientific Methods 1 Barry & Goran ‘Scientific evaluation, experimental design & statistical methods’ COMP80131 Lecture."— Presentation transcript:

1 11 Dec 2012COMP80131-SEEDSM12_101 Scientific Methods 1 Barry & Goran ‘Scientific evaluation, experimental design & statistical methods’ COMP80131 Lecture 10: Statistical Methods- Intro to PCA & Monte Carlo Methods www.cs.man.ac.uk/~barry/mydocs/MyCOMP80131

2 11 Dec 2012COMP80131-SEEDSM12_102 Correlation & covariance Pearson correlation coeff for two samples of M variables Covariance between two samples of M variables: Lies between -1 & 1 1/(M-1) if meanx is sample-mean; 1/M if meanx is pop-mean

3 11 Dec 2012COMP80131-SEEDSM12_103 In vector notation 1 2 3 -4 5 6 Both measure similarity between 2 cols of numbers (vectors). x = 2 3 -3 5 4 1 y=y= (1/(var x.var y )) x T y or (1/M) x T y when means have been subtracted

4 11 Dec 2012COMP80131-SEEDSM12_104 Principal Components Analysis PCA converts samples of M variables into samples of a smaller number of variables called principal components. Produces shorter columns. Exploits interdependency or correlation among the M variables in each col. Evidence is the similarity between columns as seen in lots of examples. If there is none, PCA cannot reduce number of variables. First princ comp has the highest variance. It accounts for as much variability as possible. Each succeeding princ comp has the highest variance possible while being orthogonal to (uncorrelated with) the previous ones.

5 11 Dec 2012COMP80131-SEEDSM12_105 P C A Reduces number of variables (dimensionality) - without significant loss of information Also named: ‘discrete Karhunen–Loève transform (KLT)’, ‘Hotelling transform’ ‘proper orthogonal decomposition (POD)’. Related to (but not the same as): ‘Factor analysis’

6 11 Dec 2012COMP80131-SEEDSM12_106 Example Assume 5 observations of behaviour of 3 variables x 1, x 2, x 3 : x 1 : 1 2 3 1 4 sample-mean = 11/5 = 2.2 x 2 : 2 1 3 -1 2 sample-mean = 1.4 x 3 : 3 4 7 1 8 sample mean = 4.6 Subtract means: x 1 ' : -1.2 -0.2 0.8 -1.2 1.8 x 2 ' : 0.6 -0.4 1.6 -2.4 0.6 call this matrix X x 3 ' : -1.6 -0.6 2.4 -3.6 3.4 Calculate ‘Covariance Matrix’ (C): x 1 ' x 2 ' x 3 ' x 1 ' : 1.7 1.15 3.6 x 2 ' : 1.15 2.3 3.45 x 3 ' : 3.6 3.45 8.3 C(1,2) = average value of x 1 '.x 2 ‘ = (-1.2  0.6 +0.2  0.4 + 0.8  1.6 +1.2  2.4+1.8  0.6) = = 4.6/4 = 1.15

7 11 Dec 2012COMP80131-SEEDSM12_107 Eigenvalues & eigenvectors [U, diagV] = eig(C); 0.811 0.458 0.364 0 0 0 0.324 -0.87 0.372 0 0.964 0 -0.487 0.184 0.854 0 0 11.34 u 3 u 2 u 1 3 0 0 0 2 0 D U 0 0 1

8 11 Dec 2012COMP80131-SEEDSM12_108 Transforming the measurements For each column of matrix X, multiply by U T to transform it to a different set of numbers. For each column x’ transform it to U T x’ Or do it all at once by calculating Y = U T *X. We get: 0 0 0 0 0 Y = -1.37 0.146 -0.583 0.874 0.929 -1.58 -0.734 2.936 -4.404 3.781 First column of X is now expressed as: 0  u 1 -1.37  u 2 – 1.58  u 3 Similarly for all the other four columns of X. Each column is now expressed in terms of the eigenvectors of C.

9 11 Dec 2012COMP80131-SEEDSM12_109 Reducing dimensionality U T C U = D therefore C = U D U T ( since U T is inverse of U) Now we can express: C = 1 (u 1 u 1 T ) + 2 (u 2 u 2 T ) + 3 (u 3 u 3 T ) Now look at the eigenvalues 1, 2, 3 Strike out zero valued one ( 3 ) with corresponding eigenvector (u 3 ). Leaves u 1 & u 2 as princ components. Can represent all the data, without loss, with just these two. Can remove smaller eigenvalues (such as 2 ) with its eigenvector. (If they do not affect C much they should not affect X) Whole data can represented by just u 1 without serious loss of accuracy.

10 11 Dec 2012COMP80131-SEEDSM12_1010 Reconstructing orig data from princ comps Because Y = U T *X, then X = U*Y. If we don’t strike out any eigenvals & eigenvecs, this gets us back to orig data. If we strike out row 1 of Y and u 1 (first col of U), we still get back to orig data. If we strike out row 2 of Y and u2, we get back something close to orig data., We do not lose much info by keeping just one princ. comp Dimensionality reduces from 3 to 2 or 1. (Normally, eigenvals reordered in decreasing magnitude, but I have not done that here)

11 11 Dec 2012COMP80131-SEEDSM12_1011 In MATLAB clear all; origData = [1 2 3 1 4 ; 2 1 3 -1 2 ; 3 4 7 1 8] [M,N] = size(origData); meanofCols = mean(origData,2); % subtract off mean for EACH dimension zmData = origData - repmat(meanofCols,1,N) covarMat = 1 / (N-1) * (zmData * zmData') % find the eigenvectors and eigenvalues of covarMat [eigVecs, diagV] = eig(covarMat) eigVals = diag(diagV) [reigVals, Ind] = sort(eigVals,'descend'); % sort the variances in decreasing order reigVecs = eigVecs(:,Ind); % Reorder eigenvectors accordingly proj_zmData = reigVecs' * zmData disp('Approx to original data taking just a few principal components'); nPC = input('How many PCs do you need (look at eigenvals to decide):'); PCproj_zmData = proj_zmData(1:nPC,:) PCVecs = reigVecs(:,1:nPC) %Only keep the first few reordered eigVecs RecApprox_zmData = PCVecs * PCproj_zmData RecApprox_origData = RecApprox_zmData + repmat(meanofCols,1,N)

12 11 Dec 2012COMP80131-SEEDSM12_1012 Monte Carlo Methods Use of repeated random sampling of the behaviour of highly complex multidimensional mathematical equations describing real or simulated systems, to determine their properties. The repeated random sampling produces observations to which statistical inference can be applied.

13 11 Dec 2012COMP80131-SEEDSM12_1013 Pseudo-random processes Name Monte Carlo refers to the famous casino. Gambling requires a random process such as the spinning of a roulette wheel. Monte Carlo methods use pseudo-random processes implemented in software. ‘Pseudo-random’ processes are not truly random. The variables produced can be predicted by a person who knows the algorithm being used. However, they can be used to simulate the effects of true randomness. Simulations not required to be numerically identical to real processes. Aim is to produce statistical results such as averages & distributions. Requires a ‘sampling’ of the population of all possible modes of behaviour of the system.

14 11 Dec 2012COMP80131-SEEDSM12_1014 Illustration Monte Carlo methods may be used to evaluate multi- dimensional integrals. Consider the problem of calculating the area of an ellipse by generating a set of N pseudo-random number pairs (xi,yi) uniformly covering the area -1<x<1, -1<y<1 as illustrated next:

15 11 Dec 2012COMP80131-SEEDSM12_1015 Area of an ellipse x y 1 1 Area of square is 2  2 = 4 If there are N points and M of them fall inside the ellipse, area of ellipse  4  M / N as N  (Frequentist approach)

16 11 Dec 2012COMP80131-SEEDSM12_1016 Simplicity of MC methods This example illustrates the simplicity of MC techniques, but not their computational advantages. We could have use a regularly placed grid of points rather than randomly placed points in the rectangle as on next slide

17 11 Dec 2012COMP80131-SEEDSM12_1017 Regular grid x y 1cm 1

18 11 Dec 2012COMP80131-SEEDSM12_1018 Advantages of Monte Carlo In fact there are no advantages for such a 2-dimensional dimensional problem. Consider a multi-dimensional integration

19 11 Dec 2012COMP80131-SEEDSM12_1019 Disadvantage of regular grid f(x1, x2, …, xL) may be evaluated at regularly spaced points as a means of evaluating the integral. Number of regularly spaced points, N, must increase exponentially with dimension L if error is not to increase exponentially with L. If N = 100 when L=2, then adjacent points will be 0.2 cm apart. If L increases to 3, we need N=10 3 points to maintain the same separation between them. When L = 4, we need N= 10 4 etc. – ‘Curse of dimensionality’ Look at this another way: Assume N remains fixed with regular sampling, and L increases. Each dimension must be sampled more & more sparsely - and less and less efficiently. More & more points with same value in each dimension. Error increases in proportion to N -2/L

20 11 Dec 2012COMP80131-SEEDSM12_1020 Advantage of random sampling Uniformly distributed random points in L-dimensional space. Avoids inefficiency of rectangular grids created by regular sampling by using a purely random sample of N points uniformly distributed For high dimensions K, error is proportional to 1/(  N) To reduce the error by a factor of 2, the sample size N must be increased by a factor of 4 regardless of the dimensionality. There are ways of decreasing the Monte Carlo error to make the technique still more efficient. One approach is to use ‘quasi-random’ or ‘low-discrepancy’ sampling. The use of such quasi-random sampling for numerical integration is referred to as “quasi–Monte Carlo” integration.

21 11 Dec 2012COMP80131-SEEDSM12_1021 MATLAB: Area of Semicircle for N=1:200 M=0; for i=1:N x=2*rand -1; y=rand*1.0; I = sqrt(1-x*x); if y <= I, M=M+1; end; %if y <= I point (x,y) is below curve !!! end; Int(N)=M*2/N; end; % of N loop figure(6); plot(Int); title('Area of semicircle'); xlabel('Number of points');

22 11 Dec 2012COMP80131-SEEDSM12_1022 Convergence as N 

23 11 Dec 2012COMP80131-SEEDSM12_1023 MATLAB code for scatter plot clear; N=6000; M=0; for i=1:N x(i)=rand*2-1; y(i)=rand*1.0; I = sqrt(1-x(i)*x(i)); if y(i)<=I, M=M+1; C(i) = 2; else C(i)=1; end; end; scatter(x,y,6,C,'filled'); Int = M*2/N ; title(sprintf('Scatter of MC area method: N=%d, Int=%d',N,Int)); disp('Red if I y'); xlabel('x Plot red if y <= I'); ylabel('y');

24 11 Dec 2012COMP80131-SEEDSM12_1024 Integration of circle – scatter plot


Download ppt "11 Dec 2012COMP80131-SEEDSM12_101 Scientific Methods 1 Barry & Goran ‘Scientific evaluation, experimental design & statistical methods’ COMP80131 Lecture."

Similar presentations


Ads by Google