Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Neighboring Feature Clustering Author: Z. Wang, W. Zheng, Y. Wang, J. Ford, F. Makedon, J. Pearlman Presenter: Prof. Fillia Makedon Dartmouth College.

Similar presentations


Presentation on theme: "1 Neighboring Feature Clustering Author: Z. Wang, W. Zheng, Y. Wang, J. Ford, F. Makedon, J. Pearlman Presenter: Prof. Fillia Makedon Dartmouth College."— Presentation transcript:

1 1 Neighboring Feature Clustering Author: Z. Wang, W. Zheng, Y. Wang, J. Ford, F. Makedon, J. Pearlman Presenter: Prof. Fillia Makedon Dartmouth College

2 2 What is Neighboring Feature Clustering Given an m × n matrix M,where m denotes m samples and n denotes n (ordered) dimensional features, the goal is to find a intrinsic partition of the features based on their characteristics such that each cluster is a continuous piece of features. We assume there is a natural ordering of features that has relevance to the problem being solved – E.g., in spectral datasets, such characteristics could be correlations – For example, if we decide feature 1 and 10 belong to a cluster, feature 2 to 9 should also belong to that cluster. – ZHIFENG: PLEASE IMPROVE THIS SLIDE, PROVIDE AN INTUITIVE DIAGRAM

3 3 MR spectral features and DNA Copy Number??? MR spectral features are highly redundant suggesting that the data lie in some low-dimensional space (ZHIFENG: WHAT DO YOU MEAN BY LOW DIMENSIONAL SPACE - CLARIFY) Neighboring spectral features of MR spectra are highly correlated Using NFC, we can partition the features into clusters. A cluster can be represented by a single feature, hence reducing the dimensionality. This idea can be applied to DNA copy number analysis too. Zhifeng: Yuhang said these two are not related!! Please explain how these are related.

4 4 Use MDL method to solve NFC Reduce NFC into a one dimensional piece-wise linear approximation problem. Given a sequence of n one dimensional points, find the optimal step function-like line segments that can be fitted to the points Fig. 1. Piecewise linear approximation [3] [4] is usually 2D. Here we use its concept for a 1D situation. We use minimum description length (MDL) method [2] to solve this reduced problem. Zhifeng: define and explain MDL

5 5 Minimum Description Length (MDL) Zhifeng, please provide a slide to define this EXPLAIN HOW THE TRANSFORMATION IS DONE (AS IN [1]) TO GIVE 1D piece-wise linear approximation. Represent all the points by two line segments. Trade-off between approximation accuracy and number of line segments. A compromise can be made using MDL. ??? Zhifeng: it is all very cryptic, pieces of explanation are missing!

6 6 Outline The problem – Spectral data – The abstract problem Related work – HMM based, partial sum based, maximum likelihood based Our approach – Problem reduced to 1D linear approximation – MDL approach

7 7 Reducing NFC to 1D Piece-Wise Linear Approximation Problem 1 Let correlation coefficient matrix of M be denoted as C. LetC ∗ be the strictly upper triangular matrix derived from 1−|C| (entries near 0 imply high correlation between the corresponding two features). For features from i to j (1 ≤ i ≤ j ≤ n), the submatrix C ∗ i:j,i:j depicts pairwise correlations. We use its entries (excluding lower and diagonal entries) as the points to be explained by a line in the 1D piece-wise linear approximation problem. The objective is to find the optimal piece-wise line segments to fit those created points. Points near 0 mean high correlation. We need to force high correlations among a set. Thus the points are always approximated by 0.

8 8 example For example, suppose we have a set with points all around 0.3. In piece-wise linear approximation, it is better to use 0.3 as the approximation. However in NFC, we should penalize the points that stray away from 0. So we still use 0 as the approximation. Unlike usual 1D piece-wise linear approximation problem, the reduced problem has dynamic points (because they are created on the fly). Zhifeng: provide figure to illustrate above example

9 9 Spectral data MR spectral data – High dimensional data points – Spectral features are highly redundant (high correlation) – Find neighboring features with high correlation in a spectral dataset, such as a MR spectral dataset. frequeny intensit Fig. 1 high dimensional data points Fig. 2 correlation coefficient matrix Both axes are the features or the number of dimensions

10 10 Problem Finding a low-dimensional space - – zhifeng: define low dimensional space – Curse of dimensionality We extract an abstract problem: Neighboring Feature Clustering (NFC) – Features are ordered. Each cluster contains only neighboring features. – Find an optimal clusters according to certain criteria

11 11 Another application (with variation) Array Comparative Genomic Hybridization to detect copy number alterations. aCGH data are noisy – Smoothing – Segmentation Fig. 3 aCGH technology Fig. 4 aCGH data (smoothed). The X axis is log ratio Fig.5 aCGH data(segmented). The X axis is log ratio

12 12 Related work An algorithm trying to solve a similar problem – Baumgartner, et al, “Unsupervised feature dimension reduction for classification of MR spectra”, Magnetic Resonance Image, 22:251-256,2004 An extensive literature on the reduced problem – The, et al, “On the detection of dominant points on digital curves”, IEEE PAMI, 11(8) 859-872, 1989 – Statistical methods… Fig. 6 1D piece-wise approximation

13 13 Related work: statistical methods HMM based – Fridlyand, et al, “Hidden Markov models approach to the analysis of array CGH data”, J. Multivariate Anal., 90, 132- 153 Partial sum based – Lipson etc., ‘”Efficient Calculation of Interval Scores for DNA copy Number Data analysis”, RECOMB 2005 Maximum likelihood based – Picard, etc., “A statistical approach for array CGH data analysis”, BMC Bioinformatics, 6:27,2005

14 14 Framework of the method proposed 3. MDL code length (revised) frequency intensity 1. Correlation coefficient matrix 13 2 n-1 n C 1, 2 C 2, 3 … C 3, n-1 C 2, n-1 C 1, n-1 C n- 1,n C 3, n C 2, n C 1,2 2. For each pair of features 4. Code length matrix 5. Shortest path (dynamic programming) intensity frequency Fig. 7 our approach

15 15 Minimum Description Length Information Criterion – A model selection scheme – Common information criteria are Akaike Information Criterion(AIC), Bayesian Information Criterion (BIC), and Minimum Description Length (MDL) – MDL is to encode the model and the data given the model. The balance is achieved in terms of code length Fig. 6 1D piece-wise approximation

16 16 Encoding model and data given model For each pair (n*(n-1)/2 in total) of features – Encoding model Cluster boundary, Gaussian parameter (standard deviation) – Encoding data given model d Fig.8 encoding the data given model for each feature pair

17 17 Minimize the code length Code length matrix Shortest path – Recursive function – Dynamic programming 13 2 n-1 n C 1,2 C 2,3 … C 3,n-1 C 2,n-1 C 1,n-1 C n-1,n C 3,n C 2,n C 1,2 Fig. 9 alternative representation of matrix C Fig. 10 Recursive function for the shortest path

18 18 Results We test on simulated data. Fig. 11 the revised correlation matrix and the computed code length matrix


Download ppt "1 Neighboring Feature Clustering Author: Z. Wang, W. Zheng, Y. Wang, J. Ford, F. Makedon, J. Pearlman Presenter: Prof. Fillia Makedon Dartmouth College."

Similar presentations


Ads by Google