Download presentation
Presentation is loading. Please wait.
Published byStephen Briggs Modified over 9 years ago
1
Unsupervised learning Supervised and Unsupervised learning General considerations Clustering Dimension reduction The lecture is partly based on: Hastie, Tibshirani & Friedman. The Elements of Statistical Learning. 2009. Chapter 2. Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000. Chapter 2. Dudoit, S., Fridlyand, J., & Speed, T. (2000). Comparison of discrimination methods for the classification of tumors using gene expression data. JASA 2000.
2
General considerations Control 1Control 2……Control 25Disease 1Disease 2……Disease 40 Gene 19.259.77……9.48.585.62……6.88 Gene 26.995.85……55.145.43……5.01 Gene 34.555.3……4.733.664.27……4.11 Gene 47.047.16……6.476.796.87……6.45 Gene 52.843.21……3.23.063.26……3.15 Gene 66.086.26……7.196.125.93……6.44 Gene 744.41……4.224.424.09……4.26 Gene 84.014.15……3.453.773.55……3.82 Gene 96.377.2……8.145.137.06……7.27 Gene 102.913.04……3.032.833.86……2.89 Gene 113.713.79……3.395.156.23……4.44 …… Gene 500003.653.73……3.83.873.76……3.62 This is the common structure of microarray gene expression data from a simple cross-sectional case-control design. Data from other high-throughput technology are often similar.
3
3 Supervised learning In supervised learning, the problem is well-defined: Given a set of observations {x i, y i }, estimate the density Pr(Y|X) Usually the goal is to find the parameter that minimize the expected classification error, or some loss derived from it. Objective criteria exists to measure the success of a supervised learning mechanism: Error rate from testing (or cross-validation) data Disease classification, predict survival, predict cost … … General considerations
4
4 Unsupervised learning There is no output variable, all we observe is a set {x i }. The goal is to infer Pr(X) and/or some of its properties. When the dimension is low, nonparametric density estimation is possible; When the dimension is high, may need to find simple properties without density estimation, or apply strong assumptions to estimate the density. There is no objective criteria from the data itself; to justify a result: > Heuristic arguments, > External information, > Reasonable explanation of the outcome Find co-regulated sets, infer hidden regulation signals, infer regulatory networks, ……. General considerations
5
Correlation structure There is always correlations between features (genes, proteins, metabolites …) in biological data. This is caused by the intrinsic biological interactions and regulations. The problem is: (1)We don’t know what the correlation structure is (in some cases we have some idea, e.g. DNA) (2) We cannot reliably estimate it because the dimension is too high and there is not enough data General considerations
6
Curse of Dimensionality Bellman R.E., 1961. In p-dimensions, to get a hypercube with volume r, the edge length needed is r 1/p. In 10 dimensions, to capture 1% of the data to get a local average, we need 63% of the range of each input variable. General considerations
7
In other words, To get a “dense” sample, if we need N=100 samples in 1 dimension, then we need N=100 10 samples in 10 dimensions. In high-dimension, the data is always sparse and do not support density estimation. More data points are closer to the boundary, rather than to any other data point prediction is much harder near the edge of the training sample. Curse of Dimensionality General considerations
9
Curse of Dimensionality We have talked about the curse of dimensionality in the sense of density estimation. In a classification problem, we do not necessarily need density estimation. Generative model --- care about class density function. Discriminative model --- care about boundary. Example: Classifying belt fish and carp. Looking at the length/width ratio is enough. Why should we care other variables such as shape of fins, or number of teeth? General considerations
10
N<<p problem We talk about “curse of dimensionality” when N is not >>>p. In bioinformatics, usually N 1000. How to deal with this N<<p issue? Dramatically reduce p before model-building. Filter genes based on: variation, normal/disease test statistic, projection…… Use methods that are resistant to large numbers of nuisance variables: Support vector machines, random forests, boosting …… Borrow other information: functional annotation, meta-analysis …… General considerations
11
11 (1)Obtain high-throughput data (2)Unsupervised learning (dimension reduction/clustering) to show that sample from different treatment are indeed separated, and identify any interesting pattern. (3)Feature selection based on testing – find features that are differentially expressed between treatment. FDR is used here. (4)Experimental validation of the selected features, using more reliable biological techniques. (e.g. real-time PCR is used to validate microarray expression data.) (5)Classification model building. (6)From an independent group of samples, measure the feature levels using reliable technique. (7)Find the sensitivity/specificity of the model using the independent data. A typical workflow
12
12 Finding features/samples that are similar. Can tolerate n<p. Irrelevant features contribute random noise that shouldn’t change strong clusters. Some false clusters may be due to noise. But their size should be limited. Clustering
13
Hierarchical clustering Agglomerative: build tree by joining nodes; Divisive: build tree by dividing groups of objects.
14
14 Hierarchical clustering
15
Example data: Hierarchical clustering
16
Single linkage: find the distance between any two nodes by nearest neighbor distance. Hierarchical clustering
17
Single linkage: Hierarchical clustering
18
Complete linkage: find the distance between any two nodes by farthest neighbor distance. Average linkage: find the distance between any two nodes by average distance. Hierarchical clustering
19
Comments: Hierarchical clustering generates a tree; to find clusters, the tree needs to be cut at a certain height; Complete linkage method favors compact, ball-shaped clusters; single linkage method favors chain-shaped clusters; average linkage is somewhere in between. Hierarchical clustering
20
20 Average linkage on microarray data. Row: genes Column: samples Hierarchical clustering
21
21 Figure 14.12: Dendrogram from agglomerative hierarchical clustering with average linkage to the human tumor microarray data. Hierarchical clustering
22
Center-based clustering Have objective functions which define how good a solution is; The goal is to minimize the objective function; Efficient for large/high dimensional datasets; The clusters are assumed to be convex shaped; The cluster center is representative of the cluster; Some model-based clustering, e.g. Gaussian mixtures, are center-based clustering.
23
Center-based clustering K-means clustering. Let be k disjoint clusters. Error is defined as the sum of the distance from the cluster center
24
The k-means algorithm: Center-based clustering
25
Understanding k-means as an optimization procedure: The objective function is: Minimize the P(W,Q) subject to:
26
The solution is iteratively solving sub-problems: When is fixed, is minimized if and only if: When is fixed, is minimized if and only if Center-based clustering
27
In terms of optimization, the k-means procedure is greedy. Every iteration decreases the value of the objective function; The algorithm converges to a local minimum after a finite number of iterations. Results depend on initiation values. The computational complexity is proportional to the size of the dataset efficient on large data. The clusters identified are mostly ball-shaped. Works only on numerical data.
28
28 K-means Figure 14.8: Total within cluster sum of squares for K-means clustering applied to the human tumor microarray data. How to decide the number of clusters? If there are truly k* groups, for k k*, some true groups are partitioned. Increasing k should not bring much improvement on within-cluster dissimilarity. Center-based clustering
29
Automated selection of k? The x-means algorithm based on AIC/BIC. A family of models at different k: Is the likelihood of the data given the jth model. p j is the number of parameters. We have to assume a model to get the likelihood. The convenient one is Gaussian.
30
Center-based clustering Under the assumption of identical spherical Gaussian assumption, (n is sample size; k is number of centroids) μ (i) is the centroid associated with x i. The likelihood is: The number of parameters is (d is dimension): (class probabilities + parameters for mean & variance)
31
Dimension Reduction The purpose of dimension reduction: Data simplification Data visualization Reduce noise (if we can assume only the dominating dimensions are signals) Variable selection for prediction (in supervised learning)
32
PCA Explain the variance-covariance structure among a set of random variables by a few linear combinations of the variables; Does not require normality!
33
PCA
35
The eigen values are the variance components: Proportion of total variance explained by the kth PC:
36
PCA
37
The geometrical interpretation of PCA:
38
PCA using the correlation matrix, instead of the covariance matrix? This is equivalent to first standardizing all X vectors. PCA
39
Using the correlation matrix avoids the domination from one X variable due to scaling (unit changes), for example using inch instead of foot. Example: PCA
40
Selecting the number of components? Based on eigen values (% variation explained). Assumption: the small amount of variation explained by low-rank PCs is noise.
41
41 Figure 14.21: The best rank-two linear approximation to the half-sphere data. The right panel shows the projected points with coordinates given by U2D2, the first two principal components of the data. PCA
42
Factor Analysis If we take the first several PCs that explain most of the variation in the data, we have one form of factor model. L: loading matrix F: unobserved random vector (latent variables). ε: unobserved random vector (noise)
43
Factor Analysis Rotations in the m-dimensional subspace defined by the factors make the solution non-unique: PCA is one unique solution, as the vectors are sequentially selected. Maximum likelihood estimator is another solution:
44
Factor Analysis As we said, rotations within the m-dimensional subspace doesn’t change the overall amount of variation explained. Do rotation to make the results more interpretable:
45
45 Orthogonal simple factor rotation: Rotate the orthogonal factors around the origin until the system is maximally aligned with the separate clusters of variables. Oblique Simple Structure Rotation: Allow the factors to become correlated. Each factor is rotated individually to fit a cluster. Factor Analysis
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.