1 Neighboring Feature Clustering Author: Z. Wang, W. Zheng, Y. Wang, J. Ford, F. Makedon, J. Pearlman Presenter: Prof. Fillia Makedon Dartmouth College
2 What is Neighboring Feature Clustering Given an m × n matrix M,where m denotes m samples and n denotes n (ordered) dimensional features, the goal is to find a intrinsic partition of the features based on their characteristics such that each cluster is a continuous piece of features. We assume there is a natural ordering of features that has relevance to the problem being solved – E.g., in spectral datasets, such characteristics could be correlations – For example, if we decide feature 1 and 10 belong to a cluster, feature 2 to 9 should also belong to that cluster. – ZHIFENG: PLEASE IMPROVE THIS SLIDE, PROVIDE AN INTUITIVE DIAGRAM
3 MR spectral features and DNA Copy Number??? MR spectral features are highly redundant suggesting that the data lie in some low-dimensional space (ZHIFENG: WHAT DO YOU MEAN BY LOW DIMENSIONAL SPACE - CLARIFY) Neighboring spectral features of MR spectra are highly correlated Using NFC, we can partition the features into clusters. A cluster can be represented by a single feature, hence reducing the dimensionality. This idea can be applied to DNA copy number analysis too. Zhifeng: Yuhang said these two are not related!! Please explain how these are related.
4 Use MDL method to solve NFC Reduce NFC into a one dimensional piece-wise linear approximation problem. Given a sequence of n one dimensional points, find the optimal step function-like line segments that can be fitted to the points Fig. 1. Piecewise linear approximation [3] [4] is usually 2D. Here we use its concept for a 1D situation. We use minimum description length (MDL) method [2] to solve this reduced problem. Zhifeng: define and explain MDL
5 Minimum Description Length (MDL) Zhifeng, please provide a slide to define this EXPLAIN HOW THE TRANSFORMATION IS DONE (AS IN [1]) TO GIVE 1D piece-wise linear approximation. Represent all the points by two line segments. Trade-off between approximation accuracy and number of line segments. A compromise can be made using MDL. ??? Zhifeng: it is all very cryptic, pieces of explanation are missing!
6 Outline The problem – Spectral data – The abstract problem Related work – HMM based, partial sum based, maximum likelihood based Our approach – Problem reduced to 1D linear approximation – MDL approach
7 Reducing NFC to 1D Piece-Wise Linear Approximation Problem 1 Let correlation coefficient matrix of M be denoted as C. LetC ∗ be the strictly upper triangular matrix derived from 1−|C| (entries near 0 imply high correlation between the corresponding two features). For features from i to j (1 ≤ i ≤ j ≤ n), the submatrix C ∗ i:j,i:j depicts pairwise correlations. We use its entries (excluding lower and diagonal entries) as the points to be explained by a line in the 1D piece-wise linear approximation problem. The objective is to find the optimal piece-wise line segments to fit those created points. Points near 0 mean high correlation. We need to force high correlations among a set. Thus the points are always approximated by 0.
8 example For example, suppose we have a set with points all around 0.3. In piece-wise linear approximation, it is better to use 0.3 as the approximation. However in NFC, we should penalize the points that stray away from 0. So we still use 0 as the approximation. Unlike usual 1D piece-wise linear approximation problem, the reduced problem has dynamic points (because they are created on the fly). Zhifeng: provide figure to illustrate above example
9 Spectral data MR spectral data – High dimensional data points – Spectral features are highly redundant (high correlation) – Find neighboring features with high correlation in a spectral dataset, such as a MR spectral dataset. frequeny intensit Fig. 1 high dimensional data points Fig. 2 correlation coefficient matrix Both axes are the features or the number of dimensions
10 Problem Finding a low-dimensional space - – zhifeng: define low dimensional space – Curse of dimensionality We extract an abstract problem: Neighboring Feature Clustering (NFC) – Features are ordered. Each cluster contains only neighboring features. – Find an optimal clusters according to certain criteria
11 Another application (with variation) Array Comparative Genomic Hybridization to detect copy number alterations. aCGH data are noisy – Smoothing – Segmentation Fig. 3 aCGH technology Fig. 4 aCGH data (smoothed). The X axis is log ratio Fig.5 aCGH data(segmented). The X axis is log ratio
12 Related work An algorithm trying to solve a similar problem – Baumgartner, et al, “Unsupervised feature dimension reduction for classification of MR spectra”, Magnetic Resonance Image, 22: ,2004 An extensive literature on the reduced problem – The, et al, “On the detection of dominant points on digital curves”, IEEE PAMI, 11(8) , 1989 – Statistical methods… Fig. 6 1D piece-wise approximation
13 Related work: statistical methods HMM based – Fridlyand, et al, “Hidden Markov models approach to the analysis of array CGH data”, J. Multivariate Anal., 90, Partial sum based – Lipson etc., ‘”Efficient Calculation of Interval Scores for DNA copy Number Data analysis”, RECOMB 2005 Maximum likelihood based – Picard, etc., “A statistical approach for array CGH data analysis”, BMC Bioinformatics, 6:27,2005
14 Framework of the method proposed 3. MDL code length (revised) frequency intensity 1. Correlation coefficient matrix 13 2 n-1 n C 1, 2 C 2, 3 … C 3, n-1 C 2, n-1 C 1, n-1 C n- 1,n C 3, n C 2, n C 1,2 2. For each pair of features 4. Code length matrix 5. Shortest path (dynamic programming) intensity frequency Fig. 7 our approach
15 Minimum Description Length Information Criterion – A model selection scheme – Common information criteria are Akaike Information Criterion(AIC), Bayesian Information Criterion (BIC), and Minimum Description Length (MDL) – MDL is to encode the model and the data given the model. The balance is achieved in terms of code length Fig. 6 1D piece-wise approximation
16 Encoding model and data given model For each pair (n*(n-1)/2 in total) of features – Encoding model Cluster boundary, Gaussian parameter (standard deviation) – Encoding data given model d Fig.8 encoding the data given model for each feature pair
17 Minimize the code length Code length matrix Shortest path – Recursive function – Dynamic programming 13 2 n-1 n C 1,2 C 2,3 … C 3,n-1 C 2,n-1 C 1,n-1 C n-1,n C 3,n C 2,n C 1,2 Fig. 9 alternative representation of matrix C Fig. 10 Recursive function for the shortest path
18 Results We test on simulated data. Fig. 11 the revised correlation matrix and the computed code length matrix