Unsupervised Learning

Unsupervised Learning
STT : Intro. to Statistical Learning Unsupervised Learning Chapter 10 Disclaimer: This PPT is modified based on IOM 530: Intro. to Statistical Learning

STT592-002: Intro. to Statistical Learning
Outline Principle Component Analysis (PCA) What is Clustering? K-Means Clustering Hierarchical Clustering

Supervised vs. Unsupervised Learning
STT : Intro. to Statistical Learning Supervised vs. Unsupervised Learning Supervised Learning: both X and Y are known Unsupervised Learning: only X Supervised Learning Unsupervised Learning

Overview of Unsupervised Learning
STT : Intro. to Statistical Learning Overview of Unsupervised Learning Focus on two particular types of unsupervised learning: Principal Components Analysis (PCA), a tool used for data visualization or data pre-processing (dimension reduction) before supervised techniques are applied Clustering: a broad class of methods for discovering unknown subgroups in data.

Challenge of Unsupervised Learning
STT : Intro. to Statistical Learning Challenge of Unsupervised Learning Unsupervised learning is more challenging. No simple goal for analysis, such as prediction of a response for classification or MSE. More on exploratory data analysis. Hard to assess the results from unsupervised learning, as we did not have any ground truth.

Examples of Unsupervised Learning
STT : Intro. to Statistical Learning Examples of Unsupervised Learning Eg: A cancer researcher might assay gene expression levels in 100 patients with breast cancer, and look for subgroups among the breast cancer samples, or among the genes, in order to obtain a better understanding of the disease. Eg: Online shopping site: identify groups of shoppers with similar browsing and purchase histories, as well as items of interest within each group. Then an individual shopper can be preferentially shown the items likely to be interested, based on the purchase histories of similar shoppers. A search engine choose search results to display to a particular individual based on the click histories of other individuals with similar search patterns.

Principle Component Analysis (PCA)
STT : Intro. to Statistical Learning Principle Component Analysis (PCA) Review Chap 6.3, page

PCA Ideas: a large set of correlated variables, principal components allow us to summarize this set with a smaller # of representative variables for original variability Recall: PCA serves for -- Dimension reduction: data pre-processing before supervised techniques are applied Lossy data compression Feature extraction A tool for data visualization

PCA

PCA Two common used definitions of PCA Orthogonal Projection of the data onto a lower dimensional linear space, known as the principle subspace, such that the variance of the projected data is maximized; Equivalently, as linear projection that minimizes average projection cost, where mean squared distance between data points and their projections.

USArrests Example X1 X2 X3 X4 Violent Crime Rates by US State This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in Also given is the percent of the population living in urban areas. A data frame with 50 observations on 4 variables. Murder: Murder arrests (per 100,000) Assault: Assault arrests (per 100,000) UrbanPop: Percent urban population Rape: Rape arrests (per 100,000) Q: Summarize a set of (X1, X2, X3, and X4) into a smaller # of representative variables for original variability.

PCA

PC loading and scores PC loading vectors as the directions in feature space along which the data vary the most. PC scores as projections along these directions. PCA LDA Scores Loadings PC1, PC2, …, PC_(p-1) where p is the # of variable of X. LD1, LD2, … LD_(c-1) where c is the # of classes of Y

Top/right: scale for loadings PCA calculation Standardize to mean 0 and SD=1 ## PC Scores for California: temp=c( , , , ); pr.out_x1=sum(temp*c( , , , )) pr.out_x ## pr.out_x2=sum(temp*c( , , , )); pr.out_x ##

PCA Biplot The figure represents both the principal component scores and the loading vectors in a single biplot display. PCA loading: large positive scores on 1st component: California, Nevada and Florida, have high crime rates; While states like North Dakota, with negative scores on the first component, have low crime rates. California also has a high score on 2nd component, indicating a high level of urbanization, while the opposite is true for states like Mississippi. States close to zero on both components, such as Indiana, have approximately average levels of both crime and urbanization. 1st Component: Serious Crime 2nd Component: Level of Urbanization

PCA Biplot A biplot uses points to represent the scores of the observations on the principal components, and it uses vectors to represent the coefficients of the variables on the principal components. Interpreting Vectors: Vectors point away from origin in some direction. A vector points in direction which has the highest squared multiple correlation with the principal components. The length of the vector is proportional to the squared multiple correlation between the fitted values for the variable and the variable itself. 1st Component: Serious Crime 2nd Component: Level of Urbanization

More on PCA Scaling of variables: In general, we shall scale the data before performing PCA. However, don’t scale the data if the variables may be measured in the same units (eg: gene data).

More on PCA Uniqueness of the Principal Components: Each principal component loading vector is unique, up to a sign flip. But flipping the sign has no effect as the direction does not change. The Proportion of Variance Explained (PVE):

More on PCA Deciding How Many Principal Components to Use: A n × p data matrix X has min(n − 1, p) distinct PCs. Goal: choose smallest # of PCs to explain a sizable amount of the variation in the data. Use the Scree Plot.

More on PCA Scree Plot: Find the elbow in the scree plot. Figure 10.4, one might conclude: a fair amount of variance is explained by first two PCs. There is an elbow after 2nd component. After all, 3rd principal component explains less than 10% of the variance in the data, and the fourth principal component explains less than half that and so is essentially worthless.

Another Interpretation of Principal Components
STT : Intro. to Statistical Learning Another Interpretation of Principal Components Principal components provide low-dimensional linear surfaces that are closest to the observations.

STT : Intro. to Statistical Learning Another Interpretation of Principal Components The first principal component loading vector has a very special property: it is the line in p-dimensional space that is closest to the n observations (using average squared Euclidean distance as a measure of closeness)

STT : Intro. to Statistical Learning Another Interpretation of Principal Components The first two principal components of a data set span the plane that is closest to the n observations, in terms of average squared Euclidean distance.

STT : Intro. to Statistical Learning Another Interpretation of Principal Components Principal components provide low-dimensional linear surfaces that are closest to the observations.

Unsupervised Learning

Similar presentations

Presentation on theme: "Unsupervised Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unsupervised Learning

Similar presentations

Presentation on theme: "Unsupervised Learning"— Presentation transcript:

Similar presentations

About project

Feedback