Unsupervised Learning II Feature Extraction. Feature Extraction Techniques Unsupervised methods can also be used to find features which can be useful.

Unsupervised Learning II Feature Extraction

Feature Extraction Techniques Unsupervised methods can also be used to find features which can be useful for categorization. There are unsupervised methods that represent a form of smart feature extraction.  Transforming the input data into the set of features still describing the data with sufficient accuracy  In pattern recognition and image processing, feature extraction is a special form of dimensionality reduction

What is feature reduction? Linear transformation Original data reduced data

High-dimensional data Gene expression Face imagesHandwritten digits

Example  Car o Km/Hour o Mile/Hour  Machine Learning Course o Learning Time o Interest Rate o Final Grade

Why feature reduction? Most machine learning and data mining techniques may not be effective for high-dimensional data When the input data to an algorithm is too large to be processed and it is suspected to be redundant (much data, but not much information) Analysis with a large number of variables generally requires a large amount of memory and computation power or a classification algorithm which overfits the training sample and generalizes poorly to new samples The intrinsic dimension may be small. – For example, the number of genes responsible for a certain type of disease may be small.

Why feature reduction? Visualization: projection of high-dimensional data onto 2D or 3D. Data compression: efficient storage and retrieval. Noise removal: positive effect on query accuracy.

Feature reduction versus feature selection Feature reduction – All original features are used – The transformed features are linear combinations of the original features. Feature selection – Only a subset of the original features are used. Continuous versus discrete

Application of feature reduction Face recognition Handwritten digit recognition Text mining Image retrieval Microarray data analysis Protein classification

Algorithms  Feature Extraction Techniques  Principal component analysis  Singular value decomposition  Non-negative matrix factorization  Independent component analysis

What is Principal Component Analysis? Principal component analysis (PCA) – Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables – Retains most of the sample's information. – Useful for the compression and classification of data. By information we mean the variation present in the sample, given by the correlations between the original variables. – The new variables, called principal components (PCs), are uncorrelated, and are ordered by the fraction of the total information each retains.

Geometric picture of principal components (PCs) the 1 st PC is a minimum distance fit to a line in X space the 2 nd PC is a minimum distance fit to a line in the plane perpendicular to the 1 st PC PCs are a series of linear least squares fits to a sample, each orthogonal to all the previous.

Principle – Linear projection method to reduce the number of parameters – Transfer a set of correlated variables into a new set of uncorrelated variables – Map the data into a space of lower dimensionality – Form of unsupervised learning Properties – It can be viewed as a rotation of the existing axes to new positions in the space defined by original variables – New axes are orthogonal and represent the directions with maximum variability Principal Components Analysis (PCA)

Background Mathematics  Linear Algebra  Calculus  Probability and Computing

Linear Algebra  A matrix is a set of elements, organized into rows and columns rows columns is a matrix with m rows and n columns, where the entries of A are real number

Vector The i th element of a vector x is denoted x i

Basic Notation

6.837 Linear Algebra Review Basic Operations Addition, Subtraction, Multiplication Just add elements Just subtract elements Multiply each row by each column

6.837 Linear Algebra Review Multiplication Is AB = BA? Maybe, but maybe not! The product of two matrices is the matrix Heads up: multiplication is NOT commutative! – AB  BA in general !

Multiplication Example [3 x 2][2 x 3] [3 x 3] Result is 3 x 3

6.837 Linear Algebra Review Vector Operations Vector: N x 1 matrix Interpretation: a line in N dimensional space Dot Product, Cross Product, and Magnitude defined on vectors only x y v

Basic concepts Transpose: reflect vector/matrix on line: –Note: (Ax) T =x T A T (We’ll define multiplication soon…) Vector norms: –L p norm of v = (v 1,…,v k ) is (Σ i |v i | p ) 1/p –Common norms: L 1, L 2

6.837 Linear Algebra Review Vectors: Dot Product Interpretation: the dot product measures to what degree two vectors are aligned A B A B C A+B = C (use the head-to-tail method to combine vectors)

Vectors: Dot Product Think of the dot product as a matrix multiplication The magnitude is the dot product of a vector with itself The dot product is also related to the angle between the two vectors – but it doesn’t tell us the angle Outer product of the vectors

Vectors: Dot Product cos(α+β)=cos(α) cos(β)-sin(α) sin(β)

Vectors: Dot Product Let α+β=θ, we want to prove a(a 1,a 2 ) b b(b 1,b 2 )

Linear Equations Linear algebra provides a way of representing and operating on sets of linear equations

28  A square matrix whose elements a ij = 0, for i > j is called upper triangular, i.e.,  A square matrix whose elements a ij = 0, for i < j is called lower triangular, i.e., Identity matrix Types of matrices

 Both upper and lower triangular, i.e., a ij = 0, for i  j, i.e., Identity matrix Types of matrices is called a diagonal matrix, simply

30  In particular, a 11 = a 22 = … = a nn = 1, the matrix is called identity matrix.  Properties: AI = IA = A Examples of identity matrices: and Identity matrix Types of matrices

 If matrices A and B such that AB = BA = I, then B is called the inverse of A (symbol: A -1 ); and A is called the inverse of B (symbol: B -1 ). The inverse of a matrix Show B is the the inverse of matrix A. Example: Ans: Note that Can you show the details? Types of matrices

The transpose of a matrix  The matrix obtained by interchanging the rows and columns of a matrix A is called the transpose of A (write A T ). Example: The transpose of A is  For a matrix A = [a ij ], its transpose A T = [b ij ], where b ij = a ji. Types of matrices

Symmetric matrix  A matrix A such that A T = A is called symmetric, i.e., a ji = a ij for all i and j.  A + A T must be symmetric. Why? Example: is symmetric.  A matrix A such that A T = -A is called skew- symmetric, i.e., a ji = -a ij for all i and j.  A - A T must be skew-symmetric. Why? Types of matrices

Orthogonal matrix  A matrix A is called orthogonal if AA T = A T A = I, i.e., A T = A -1 Example: prove that is orthogonal. Types of matrices Since,. Hence, AA T = A T A = I. Can you show the details?

 (AB) -1 = B -1 A -1  (A T ) T = A and ( A) T =  A T  (A + B) T = A T + B T  (AB) T = B T A T Properties of matrix

Example: Prove (AB) -1 = B -1 A -1. Since (AB) (B -1 A -1 ) = A(B B -1 )A -1 = I and (B -1 A -1 ) (AB) = B -1 (A -1 A)B = I. Therefore, B -1 A -1 is the inverse of matrix AB.

Determinants Consider a 2  2 matrix: Determinant of order 2  Determinant of A, denoted, is a number and can be evaluated by

Determinant of order 2  easy to remember (for order 2 only).. Example: Evaluate the determinant: Determinants + -

1.If every element of a row (column) is zero, e.g.,, then |A| = 0. 2. |A T | = |A| 3. |AB| = |A||B| determinant of a matrix = that of its transpose The following properties are true for determinants of any order.

Example: Show that the determinant of any orthogonal matrix is either +1 or –1. For any orthogonal matrix, A A T = I. Since |AA T | = |A||A T | = 1 and |A T | = |A|, so |A| 2 = 1 or |A| =  1. Determinants

Determinants of order 3 Example:

Determinants Inverse Number Example o t[3214]=3+0+1+0=4 o t[13524]=1+2+0+0+0=3 o t[n n-1,… 2 1]=n(n-1)/2

Cofactor of matrix Cofactor matrix of The cofactor for each element of matrix A:

Cofactor matrix of is then given by: Cofactor of matrix

Determinants of order 3 Consider an example: Its determinant can be obtained by:

Inverse of a 3  3 matrix Inverse matrix of is given by:

Inverse of matrix Switch two rows and assume new matrix is D 1. Based on the formula, |D|=-| D 1 |. Therefore, if two rows are same, |D|=-|D|=0.

Inverse of matrix

Example. Consider the matrix Consider the three column matrices We have In other words, we have 0,4 and 3 are eigenvalues of A, C 1,C 2 and C 3 are eigenvectors Eigenvalues & eigenvectors

Consider the matrix P for which the columns are C 1, C 2, and C 3, i.e., we have det(P) = 84. So this matrix is invertible. Easy calculations give Next we evaluate the matrix P -1 AP. In other words, we have Eigenvalues & eigenvectors

In other words, we have Using the matrix multiplication, we obtain which implies that A is similar to a diagonal matrix. In particular, we have Eigenvalues & eigenvectors

Definition. Let A be a square matrix. A non-zero vector C is called an eigenvector of A if and only if there exists a number (real or complex) λ such that If such a number λ exists, it is called an eigenvalue of A. The vector C is called eigenvector associated to the eigenvalue λ. Remark. The eigenvector C must be non-zero since we have for any number λ. Eigenvalues & eigenvectors

Example. Consider the matrix We have seen that where So C 1 is an eigenvector of A associated to the eigenvalue 0. C 2 is an eigenvector of A associated to the eigenvalue -4 while C 3 is an eigenvector of A associated to the eigenvalue 3. Eigenvalues & eigenvectors

For a square matrix A of order n, the number λ is an eigenvalue if and only if there exists a non-zero vector C such that Using the matrix multiplication properties, we obtain This is a linear system for which the matrix coefficient is Since the zero-vector is a solution and C is not the zero vector, then we must have We also know that this system has one solution if and only if the matrix coefficient is invertible, i.e. Eigenvalues & eigenvectors

Example. Consider the matrix The equation translates into which is equivalent to the quadratic equation Solving this equation leads to (use quadratic formula) In other words, the matrix A has only two eigenvalues. Eigenvalues & eigenvectors

In general, for a square matrix A of order n, the equation will give the eigenvalues of A. This equation is called the characteristic equation or characteristic polynomial of A. It is a polynomial function in λ of degree n. Therefore this equation will not have more than n roots or solutions. So a square matrix A of order n will not have more than n eigenvalues. Eigenvalues & eigenvectors

Example. Consider the diagonal matrix Its characteristic polynomial is So the eigenvalues of D are a, b, c, and d, i.e. the entries on the diagonal. Eigenvalues & eigenvectors

Remark. Any square matrix A has the same eigenvalues as its transpose A T For any square matrix of order 2, A, where the characteristic polynomial is given by the equation The number (a+d) is called the trace of A (denoted tr(A)), and the number (ad-bc) is the determinant of A. So the characteristic polynomial of A can be rewritten as Eigenvalues & eigenvectors

Theorem. Let A be a square matrix of order n. If λ is an eigenvalue of A, then: Eigenvalues & eigenvectors

Computation of Eigenvectors Let A be a square matrix of order n and λ one of its eigenvalues. Let X be an eigenvector of A associated to λ. We must have This is a linear system for which the matrix coefficient is Since the zero-vector is a solution, the system is consistent. Remark. Note that if X is a vector which satisfies AX= λX, then the vector Y = c X (for any arbitrary number c) satisfies the same equation, i.e. AY- λY. In other words, if we know that X is an eigenvector, then cX is also an eigenvector associated to the same eigenvalue.

Example. Consider the matrix First we look for the eigenvalues of A. These are given by the characteristic equation If we develop this determinant using the third column, we obtain By algebraic manipulations, we get which implies that the eigenvalues of A are 0, -4, and 3. Computation of Eigenvectors

EIGENVECTORS ASSOCIATED WITH EIGENVALUES 1. Case λ=0. : The associated eigenvectors are given by the linear system which may be rewritten by The third equation is identical to the first. From the second equation, we have y = 6x, so the first equation reduces to 13x + z = 0. So this system is equivalent to Computation of Eigenvectors

So the unknown vector X is given by Therefore, any eigenvector X of A associated to the eigenvalue 0 is given by where c is an arbitrary number. Computation of Eigenvectors

2. Case λ=-4: The associated eigenvectors are given by the linear system which may be rewritten by We use elementary operations to solve it. First we consider the augmented matrix Computation of Eigenvectors

Then we use elementary row operations to reduce it to a upper-triangular form. First we interchange the first row with the first one to get Next, we use the first row to eliminate the 5 and 6 on the first column. We obtain Computation of Eigenvectors

If we cancel the 8 and 9 from the second and third row, we obtain Finally, we subtract the second row from the third to get Computation of Eigenvectors

Next, we set z = c. From the second row, we get y = 2z = 2c. The first row will imply x = -2y+3z = -c. Hence Therefore, any eigenvector X of A associated to the eigenvalue -4 is given by where c is an arbitrary number. Computation of Eigenvectors

Case λ=3: Using similar ideas as the one described above, one may easily show that any eigenvector X of A associated to the eigenvalue 3 is given by where c is an arbitrary number. Computation of Eigenvectors

Summary: Let A be a square matrix. Assume λ is an eigenvalue of A. In order to find the associated eigenvectors, we do the following steps: 1. Write down the associated linear system 2.Solve the system. 3. Rewrite the unknown vector X as a linear combination of known vectors. Computation of Eigenvectors

Why Eigenvectors and Eigenvalues An eigenvector of a square matrix is a non-zero vector that, when multiplied by the matrix, yields a vector that differs from the original at most by a multiplicative scalar. The scalar is represented by its eigenvalue. In this shear mapping the red arrow changes direction but the blue arrow does not. The blue arrow is an eigenvector, and since its length is unchanged its eigenvalue is 1.

Diagonalizable Matrix A square matrix A is called diagonalizable if it is similar to a diagonal matrix, i.e., if there exists an invertible matrix P such that P −1 AP is a diagonal matrix. Diagonalization is the process of finding a corresponding diagonal matrix for a diagonalizable matrix or linear map Diagonalizable matrices and maps are of interest because diagonal matrices are especially easy to handle: their eigenvalues and eigenvectors are known and one can raise a diagonal matrix to a power by simply raising the diagonal entries to that same power. Simlar Matrix: Same Rank, same eigenvalues, same value of determinants.

Probability Theory  Expected Value  Variance  Standard Deviation  Covariance  Covariance Matrix

Random Variable  A random variable x takes on a defined set of values with different probabilities. o For example, if you roll a die, the outcome is random (not fixed) and there are 6 possible outcomes, each of which occur with probability one- sixth. o For example, if you poll people about their voting preferences, the percentage of the sample that responds “Yes on Proposition 100” is a also a random variable (the percentage will be slightly differently every time you poll).  Roughly, probability is how frequently we expect different outcomes to occur if we repeat the experiment over and over (“frequentist” view)

Random variables can be discrete or continuous Discrete random variables have a countable number of outcomes – Examples: Dead/alive, treatment/placebo, dice, counts, etc. Continuous random variables have an infinite continuum of possible values. – Examples: blood pressure, weight, the speed of a car, the real numbers from 1 to 6.

Probability functions A probability function maps the possible values of x against their respective probabilities of occurrence, p(x) p(x) is a number from 0 to 1.0. The area under a probability function is always 1.

Discrete example: roll of a die x p(x) 1/6 145623

Probability mass function (pmf) xp(x) 1p(x=1)=1/6 2p(x=2)=1/6 3p(x=3)=1/6 4p(x=4)=1/6 5p(x=5)=1/6 6p(x=6)=1/6 1.0

Cumulative distribution function (CDF) x P(x) 1/6 145623 1/3 1/2 2/3 5/6 1.0

Cumulative distribution function xP(x≤A) 1P(x≤1)=1/6 2P(x≤2)=2/6 3P(x≤3)=3/6 4P(x≤4)=4/6 5P(x≤5)=5/6 6P(x≤6)=6/6

Review Question If you toss a die, what’s the probability that you roll a 3 or less? a.1/6 b.1/3 c.1/2 d.5/6 e.1.0

Review Question Two dice are rolled and the sum of the face values is six? What is the probability that at least one of the dice came up a 3? a.1/5 b.2/3 c.1/2 d.5/6 e.1.0

Review Question Two dice are rolled and the sum of the face values is six. What is the probability that at least one of the dice came up a 3? a.1/5 b.2/3 c.1/2 d.5/6 e.1.0 How can you get a 6 on two dice? 1-5, 5-1, 2-4, 4-2, 3-3 One of these five has a 3.  1/5

Continuous case  The probability function that accompanies a continuous random variable is a continuous mathematical function that integrates to 1.  For example, recall the negative exponential function (in probability, this is called an “exponential distribution”):  This function integrates to 1:

Expected Value and Variance All probability distributions are characterized by an expected value (mean) and a variance (standard deviation squared). The mean (or expected value) of X gives the value that we would expect to observe on average in a large number of repetitions of the experiment

Expected value of a random variable Expected value is just the average or mean (µ) of random variable x. It’s sometimes called a “weighted average” because more frequent values of X are weighted more highly in the average. It’s also how we expect X to behave on-average over the long run (“frequentist” view again).

Expected Value Discrete case: Continuous case:

Variance in General The variance is a measure of how far a set of numbers is spread out.

Practice Problem On the roulette wheel, X=1 with probability 18/38 and X= -1 with probability 20/38. – We already calculated the mean to be = -$.053. What’s the variance of X?

Answer Standard deviation is $.99. Interpretation: On average, you’re either 1 dollar above or 1 dollar below the mean, which is just under zero. Makes sense!

Sample Mean is a special case of Expected Value Sample mean, for a sample of n subjects: = The probability (frequency) of each person in the sample is 1/n.

Sample Standard Deviation and Variance Variance Standard Deviation

Covariance Variance Covariance Measure of how much two random variables change together Special Case

Covariance Matrix x y z x 1 y 1 z 1 x 2 y 2 z 2 x 3 y 3 z 3 … … … x n y n z n

Covariance Example DATA: x y 2.5 2.4 0.5 0.7 2.2 2.9 1.9 2.2 3.1 3.0 2.3 2.7 2 1.6 1 1.1 1.5 1.6 1.1 0.9 ZERO MEAN DATA: x y 0.69 0.49 -1.31 -1.21 0.39 0.99 0.09 0.29 1.29 1.09 0.49 0.79 0.19 -0.31 -0.81 -0.31 -0.71 -1.01

Covariance Matrix Example Calculate the covariance matrix C = 0.616555556 0.615444444 0.615444444 0.716555556 Since the non-diagonal elements in this covariance matrix are positive, we should expect that both the x and y variable increase together. Symmetric Matrix

Finally We are ready for Principal Component Analysis

Principle – Linear projection method to reduce the number of parameters – Transfer a set of correlated variables into a new set of uncorrelated variables – Map the data into a space of lower dimensionality – Form of unsupervised learning Properties – It can be viewed as a rotation of the existing axes to new positions in the space defined by original variables – New axes are orthogonal and represent the directions with maximum variability Principal Components Analysis (PCA)

The first PC retains the greatest amount of variation in the sample The kth PC retains the kth greatest fraction of the variation in the sample The kth largest eigenvalue of the correlation matrix C is the variance in the sample along the kth PC PCs and Variance

Dimensionality Reduction Can ignore the components of lesser significance. You do lose some information, but if the eigenvalues are small, you don’t lose much – n dimensions in original data – calculate n eigenvectors and eigenvalues – choose only the first p eigenvectors, based on their eigenvalues – final data set has only p dimensions

PCA Example –STEP 1 Subtract the mean from each of the data dimensions. All the x values have x subtracted and y values have y subtracted from them. This produces a data set whose mean is zero. Subtracting the mean makes variance and covariance calculation easier by simplifying their equations. The variance and co-variance values are not affected by the mean value.

PCA Example –STEP 1 DATA: x y 2.5 2.4 0.5 0.7 2.2 2.9 1.9 2.2 3.1 3.0 2.3 2.7 2 1.6 1 1.1 1.5 1.6 1.1 0.9 ZERO MEAN DATA: x y.69.49 -1.31 -1.21.39.99.09.29 1.29 1.09.49.79.19 -.31 -.81 -.31 -.71 -1.01 http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf

PCA Example –STEP 1 http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf

PCA Example –STEP 2 Calculate the covariance matrix cov =.616555556.615444444.615444444.716555556 since the non-diagonal elements in this covariance matrix are positive, we should expect that both the x and y variable increase together.

PCA Example –STEP 3 Calculate the eigenvectors and eigenvalues of the covariance matrix eigenvalues = 0.04908340 1.28402771 eigenvectors = -0.73517866 -0.677873399 0.67787340 -0.735178656

PCA Example –STEP 3 http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf eigenvectors are plotted as diagonal dotted lines on the plot. Note they are perpendicular to each other. Note one of the eigenvectors goes through the middle of the points, like drawing a line of best fit. The second eigenvector gives us the other, less important, pattern in the data, that all the points follow the main line, but are off to the side of the main line by some amount.

PCA Example –STEP 4 Reduce dimensionality and form feature vector the eigenvector with the highest eigenvalue is the principle component of the data set. In our example, the eigenvector with the larges eigenvalue was the one that pointed down the middle of the data. Once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives you the components in order of significance.

PCA Example –STEP 4 Now, if you like, you can decide to ignore the components of lesser significance. You do lose some information, but if the eigenvalues are small, you don’t lose much n dimensions in your data calculate n eigenvectors and eigenvalues choose only the first p eigenvectors final data set has only p dimensions.

PCA Example –STEP 4 Feature Vector FeatureVector = (eig 1 eig 2 eig 3 … eig n ) We can either form a feature vector with both of the eigenvectors: -.677873399 -.735178656 -.735178656.677873399 or, we can choose to leave out the smaller, less significant component and only have a single column: -.677873399 -.735178656

PCA Example –STEP 5 Deriving the new data FinalData = RowFeatureVector x RowZeroMeanData RowFeatureVector is the matrix with the eigenvectors in the columns transposed so that the eigenvectors are now in the rows, with the most significant eigenvector at the top RowZeroMeanData is the mean-adjusted data transposed, ie. the data items are in each column, with each row holding a separate dimension.

PCA Example –STEP 5 FinalData transpose: dimensions along columns x y -.827970186 -.175115307 1.77758033.142857227 -.992197494.384374989 -.274210416.130417207 -1.67580142 -.209498461 -.912949103.175282444.0991094375 -.349824698 1.14457216.0464172582.438046137.0177646297 1.22382056 -.162675287

PCA Example –STEP 5 http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf

PCA Algorithm  Get some data  Subtract the mean  Calculate the covariance matrix  Calculate the eigenvectors and eigenvalues of the covariance matrix  Choosing components and forming a feature vector  Deriving the new data set

Why PCA?  Maximum variance theory o In general, variance for noise data should be low and signal should be high. A common measure is the signal-to-noise ratio (SNR). A high SNR indicates high precision data.

Projection Original coordinate Projection Unit Vector The dot product is also related to the angle between the two vectors – but it doesn’t tell us the angle is the length from blue node to origin of coordinates. The mean for the is zero.

Maximum Variance Covariance Matrix since the mean is zero Therefore We got it. is the eigenvalue of matrix. u is the Eigenvectors. The goal of PCA is to find an u where the variance of all projection points is maximum.

Unsupervised Learning II Feature Extraction. Feature Extraction Techniques Unsupervised methods can also be used to find features which can be useful.

Similar presentations

Presentation on theme: "Unsupervised Learning II Feature Extraction. Feature Extraction Techniques Unsupervised methods can also be used to find features which can be useful."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unsupervised Learning II Feature Extraction. Feature Extraction Techniques Unsupervised methods can also be used to find features which can be useful.

Similar presentations

Presentation on theme: "Unsupervised Learning II Feature Extraction. Feature Extraction Techniques Unsupervised methods can also be used to find features which can be useful."— Presentation transcript:

Similar presentations

About project

Feedback