Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli

Slides:



Advertisements
Similar presentations
EigenFaces.
Advertisements

Machine Learning Lecture 8 Data Processing and Representation
PCA + SVD.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Introduction to Bioinformatics
Lecture 7: Principal component analysis (PCA)
Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Principal Component Analysis
Principal Component Analysis
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Clustering.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
What is Cluster Analysis?
MOHAMMAD IMRAN DEPARTMENT OF APPLIED SCIENCES JAHANGIRABAD EDUCATIONAL GROUP OF INSTITUTES.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Computer Vision Spring ,-685 Instructor: S. Narasimhan WH 5409 T-R 10:30am – 11:50am Lecture #18.
Separate multivariate observations
Summarized by Soo-Jin Kim
Principle Component Analysis Presented by: Sabbir Ahmed Roll: FH-227.
Eigenface Dr. Hiren D. Joshi Dept. of Computer Science Rollwala Computer Center Gujarat University.
Principal Components Analysis on Images and Face Recognition
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #19.
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
Digital Image Processing, 3rd ed. © 1992–2008 R. C. Gonzalez & R. E. Woods Gonzalez & Woods Matrices and Vectors Objective.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2013.
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Feature Extraction 主講人:虞台文.
Principal Components Analysis ( PCA)
Unsupervised Learning II Feature Extraction
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Unsupervised Learning II Feature Extraction
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
CSE 554 Lecture 8: Alignment
Unsupervised Learning
PREDICT 422: Practical Machine Learning
Hierarchical Clustering
Hierarchical Clustering
Lecture: Face Recognition and Feature Reduction
Principal Component Analysis (PCA)
Principal Component Analysis
Hierarchical and Ensemble Clustering
Principal Component Analysis
PCA is “an orthogonal linear transformation that transfers the data to a new coordinate system such that the greatest variance by any projection of the.
Descriptive Statistics vs. Factor Analysis
Hierarchical and Ensemble Clustering
CSCI N317 Computation for Scientific Applications Unit Weka
Feature space tansformation methods
Text Categorization Berlin Chen 2003 Reference:
Principal Component Analysis
Principal Component Analysis
Unsupervised Learning
Presentation transcript:

Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli

Roadmap Math Reminder Principle Components Analysis Clustering ANOVA

Standard Deviation Statistics – analyzing data sets in terms of the relationships between the individual points Standard Deviation is a measure of the spread of the data Calculation: average distance from the mean of the data

Variance Another measure of the spread of the data in a data set Calculation: Var( X ) = E(( x – μ )^2) Why have both variance and SD to calculate the spread of data? Variance is claimed to be the original statistical measure of spread of data. However it’s unit would be expressed as a square e.g. cm^2, which is unrealistic to express heights or other measures. Hence SD as the square root of variance was born.

Covariance Variance – measure of the deviation from the mean for points in one dimension e.g. heights Covariance as a measure of how much each of the dimensions vary from the mean with respect to each other. Covariance is measured between 2 dimensions to see if there is a relationship between the 2 dimensions e.g. number of hours studied & marks obtained The covariance between one dimension and itself is the variance

Covariance Matrix Representing Covariance between dimensions as a matrix e.g. for 3 dimensions: cov(x,x) cov(x,y) cov(x,z) C = cov(y,x) cov(y,y) cov(y,z) cov(z,x) cov(z,y) cov(z,z) Diagonal is the variances of x, y and z cov(x,y) = cov(y,x) hence matrix is symmetrical about the diagonal N-dimensional data will result in nxn covariance matrix

Covariance Exact value is not as important as it’s sign. A positive value of covariance indicates both dimensions increase or decrease together e.g. as the number of hours studied increases, the marks in that subject increase. A negative value indicates while one increases the other decreases, or vice-versa e.g. active social life at RIT vs performance in CS dept. If covariance is zero: the two dimensions are independent of each other e.g. heights of students vs the marks obtained in a subject

Transformation matrices Consider: 2 3 3 12 3 2 1 2 8 2 Square transformation matrix transforms (3,2) from its original location. Now if we were to take a multiple of (3,2) 3 6 2 4 2 3 6 24 6 2 1 4 16 4 x = = 4 x 2 x = x = = 4 x

Transformation matrices Scale vector (3,2) by a value 2 to get (6,4) Multiply by the square transformation matrix We see the result is still a multiple of 4. WHY? A vector consists of both length and direction. Scaling a vector only changes its length and not its direction. This is an important observation in the transformation of matrices leading to formation of eigenvectors and eigenvalues. Irrespective of how much we scale (3,2) by, the solution is always a multiple of 4.

eigenvalue problem The eigenvalue problem is any problem having the following form: A . v = λ . v A: n x n matrix v: n x 1 non-zero vector λ: scalar Any value of λ for which this equation has a solution is called the eigenvalue of A and vector v which corresponds to this value is called the eigenvector of A.

eigenvalue problem 2 3 3 12 3 2 1 2 8 2 A . v = λ . v 2 3 3 12 3 2 1 2 8 2 A . v = λ . v Therefore, (3,2) is an eigenvector of the square matrix A and 4 is an eigenvalue of A Given matrix A, how can we calculate the eigenvector and eigenvalues for A? x = = 4 x

Calculating eigenvectors & eigenvalues Given A . v = λ . v A . v - λ . I . v = 0 (A - λ . I ). v = 0 Finding the roots of |A - λ . I| will give the eigenvalues and for each of these eigenvalues there will be an eigenvector Example …

Calculating eigenvectors & eigenvalues If A = 0 1 -2 -3 Then |A - λ . I| = 0 1 λ 0 = 0 -2 -3 0 λ -λ 1 = λ2 + 3λ + 2 = 0 -2 -3-λ This gives us 2 eigenvalues: λ1 = -1 and λ2 = -2

Properties of eigenvectors and eigenvalues Note that Irrespective of how much we scale (3,2) by, the solution is always a multiple of 4. Eigenvectors can only be found for square matrices and not every square matrix has eigenvectors. Given an n x n matrix, we can find n eigenvectors

Roadmap Principle Components Analysis Clustering ANOVA

PCA principal components analysis (PCA) is a technique that can be used to simplify a dataset It is a linear transformation that chooses a new coordinate system for the data set such that greatest variance by any projection of the data set comes to lie on the first axis (then called the first principal component), the second greatest variance on the second axis, and so on. PCA can be used for reducing dimensionality by eliminating the later principal components.

PCA By finding the eigenvalues and eigenvectors of the covariance matrix, we find that the eigenvectors with the largest eigenvalues correspond to the dimensions that have the strongest correlation in the dataset. This is the principal component. PCA is a useful statistical technique that has found application in: fields such as face recognition and image compression finding patterns in data of high dimension

PCA process –STEP 1 Subtract the mean from each of the data dimensions. All the x values have x subtracted and y values have y subtracted from them. This produces a data set whose mean is zero. Subtracting the mean makes variance and covariance calculation easier by simplifying their equations. The variance and co-variance values are not affected by the mean value.

PCA process –STEP 1 DATA: x y 2.5 2.4 0.5 0.7 2.2 2.9 1.9 2.2 3.1 3.0 2.5 2.4 0.5 0.7 2.2 2.9 1.9 2.2 3.1 3.0 2.3 2.7 2 1.6 1 1.1 1.5 1.6 1.1 0.9 ZERO MEAN DATA: x y .69 .49 -1.31 -1.21 .39 .99 .09 .29 1.29 1.09 .49 .79 .19 -.31 -.81 -.81 -.31 -.31 -.71 -1.01

PCA process –STEP 1

PCA process –STEP 2 Calculate the covariance matrix .615444444 .716555556 since the non-diagonal elements in this covariance matrix are positive, we should expect that both the x and y variable increase together.

PCA process –STEP 3 Calculate the eigenvectors and eigenvalues of the covariance matrix eigenvalues = .0490833989 1.28402771 eigenvectors = -.735178656 -.677873399 .677873399 -.735178656

PCA process –STEP 3 eigenvectors are plotted as diagonal dotted lines on the plot. Note they are perpendicular to each other. Note one of the eigenvectors goes through the middle of the points, like drawing a line of best fit. The second eigenvector gives us the other, less important, pattern in the data, that all the points follow the main line, but are off to the side of the main line by some amount.

PCA process –STEP 4 Reduce dimensionality and form feature vector the eigenvector with the highest eigenvalue is the principle component of the data set. In our example, the eigenvector with the larges eigenvalue was the one that pointed down the middle of the data. Once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives you the components in order of significance.

PCA process –STEP 4 Now, if you like, you can decide to ignore the components of lesser significance You do lose some information, but if the eigenvalues are small, you don’t lose much n dimensions in your data calculate n eigenvectors and eigenvalues choose only the first p eigenvectors final data set has only p dimensions.

PCA process –STEP 4 Feature Vector FeatureVector = (eig1 eig2 eig3 … eign) We can either form a feature vector with both of the eigenvectors: -.677873399 -.735178656 -.735178656 .677873399 or, we can choose to leave out the smaller, less significant component and only have a single column: - .677873399 - .735178656

PCA process –STEP 5 Deriving the new data FinalData = RowFeatureVector x RowZeroMeanData RowFeatureVector is the matrix with the eigenvectors in the columns transposed so that the eigenvectors are now in the rows, with the most significant eigenvector at the top RowZeroMeanData is the mean-adjusted data transposed, ie. the data items are in each column, with each row holding a separate dimension.

PCA process –STEP 5 S R = U VT factors variables factors variables sig. significant noise noise significant noise factors factors samples samples

PCA process –STEP 5 FinalData is the final data set, with data items in columns, and dimensions along rows. What will this give us? It will give us the original data solely in terms of the vectors we chose. We have changed our data from being in terms of the axes x and y , and now they are in terms of our 2 eigenvectors.

PCA process –STEP 5 FinalData transpose: dimensions along columns x y -.827970186 -.175115307 1.77758033 .142857227 -.992197494 .384374989 -.274210416 .130417207 -1.67580142 -.209498461 -.912949103 .175282444 .0991094375 -.349824698 1.14457216 .0464172582 .438046137 .0177646297 1.22382056 -.162675287

PCA process –STEP 5

Reconstruction of original Data If we reduced the dimensionality, obviously, when reconstructing the data we would lose those dimensions we chose to discard. In our example let us assume that we considered only the x dimension…

Reconstruction of original Data x -.827970186 1.77758033 -.992197494 -.274210416 -1.67580142 -.912949103 .0991094375 1.14457216 .438046137 1.22382056

Roadmap Principle Components Analysis Clustering ANOVA

What is Cluster Analysis? Cluster: a collection of data objects Similar to the objects in the same cluster (Intraclass similarity) Dissimilar to the objects in other clusters (Interclass dissimilarity) Cluster analysis Statistical method for grouping a set of data objects into clusters A good clustering method produces high quality clusters with high intraclass similarity and low interclass similarity Clustering is unsupervised classification Can be a stand-alone tool or as a preprocessing step for other algorithms

Group objects according to their similarity Cluster: a set of objects that are similar to each other and separated from the other objects. Example: green/ red data points were generated from two different normal distributions

Clustering data object expression data matrix Experiments/samples are given as the row and column vectors of an expression data matrix Clustering may be applied either to objects experiments (regarded as vectors in Ro or Rn). n experiments o objects

Pattern matrix  Proximity matrix Pattern matrix (nxp) p=attributes n=# of objects Proximity matrix (nxn) d(i,j)=difference/ dissimilarity between i and j

Proximity matrix Clustering methods require that a index of proximity, or alikeness, or affinity or association be established between pairs of patterns A proximity index is either a similarity or a dissimilarity The crucial problem in identifying clusters in data is to specify what proximity is and how to measure it

Proximity indices A proximity index between the ith and kth patterns is denoted d(i,k) and must satisfy the following three properties: 1. (a) for a dissimilarity: d(i,i) = 0, all i (b) for a similarity: d(i,i) ≥ max d(i,k), all I 2. d(i,k) = d(k,i), all (i,k) 3. d(i,k) ≥ 0, all (i,k)

Different proximity measures r = 2(Euclidean distance) [42 + 22]1/2 = 4.472 r = 1(Manhattan distance) 4 + 2 = 6 r → ∞ (“sup” distance) max{4,2} = 4

K-Means Clustering The meaning of ‘K-means’ Why it is called ‘K-means’ clustering: K points are used to represent the clustering result; each point corresponds to the centre (mean) of a cluster Each point is assigned to the cluster with the closest center point The number K, must be specified Basic algorithm

The K-Means Clustering Method Given k, the k-means algorithm is implemented in 4 steps: Partition objects into k non-empty subsets Arbitrarily choose k points as initial centers Assign each object to the cluster with the nearest seed point (center) Calculate the mean of the cluster and update the seed point Go back to Step 3, stop when no more new assignment

The K-Means Clustering Method (cntd) The basic step of k-means clustering is simple: Iterate until stable (= no object move group): Determine the centroid coordinate Determine the distance of each object to the centroids Group the object based on minimum distance

The K-Means Clustering Method (cntd)

The K-Means Clustering Results Example 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 Update the cluster means 4 Assign each objects to most similar center 3 2 1 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K object as initial cluster center Update the cluster means

Weaknesses of the K-Means Method Unable to handle noisy data and outliers Very large or very small values could skew the mean Not suitable to discover clusters with non-convex shapes

Hierarchical Clustering Start with every data point in a separate cluster Keep merging the most similar pairs of data points/clusters until we have one big cluster left This is called a bottom-up or agglomerative method

Hierarchical Clustering (cont.) This produces a binary tree or dendrogram The final cluster is the root and each data item is a leaf The height of the bars indicate how close the items are

Hierarchical Clustering Demo

Levels of Clustering

Linkage in Hierarchical Clustering We already know about distance measures between data items, but what about between a data item and a cluster or between two clusters? We just treat a data point as a cluster with a single item, so our only problem is to define a linkage method between clusters As usual, there are lots of choices…

Average Linkage Definition Each cluster ci is associated with a mean vector i which is the mean of all the data items in the cluster The distance between two clusters ci and cj is then just d(i , j ) This is somewhat non-standard – this method is usually referred to as centroid linkage and average linkage is defined as the average of all pairwise distances between points in the two clusters

Single Linkage The minimum of all pairwise distances between points in the two clusters Tends to produce long, “loose” clusters

Complete Linkage The maximum of all pairwise distances between points in the two clusters Tends to produce very tight clusters

Distances between clusters (summary) Calculation of the distance between two clusters is based on the pairwise distances between members of the clusters. Complete linkage: largest distance between points Average linkage: average distance between points Single linkage: smallest distance between points Centroid: distance between centroids Complete linkage gives preference to compact/spherical clusters. Single linkage can produce long stretched clusters.

A B C D E 1 2 3 4 5 EXAMPLE

More on Hierarchical Clustering Methods Major advantage Conceptually very simple Easy to implement  most commonly used technique Major weakness of agglomerative clustering methods do not scale well: time complexity of at least O(n2), where n is the number of total objects can never undo what was done previously  high likelihood of getting stuck in local minima

Roadmap Principle Components Analysis Clustering ANOVA

(M)ANOVA The analysis of variance technique in One-Way Analysis of Variance (ANOVA) takes a set of grouped data and determine whether the mean of a variable differs significantly between groups Often there are multiple variables and you are interested in determining whether the entire set of means is different from one group to the next There is a multivariate version of analysis of variance that can address that problem (MANOVA)