Download presentation
Presentation is loading. Please wait.
1
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola {rohit,chandola}@cs.umn.edu
2
Outline Introduction to Clustering Description of Application Domain From Traditional Clustering to Overlapping Co-clustering Current State of Art A Frequent Itemsets Based Solution An Alternate Minimization Based Solution Application to Gene Expression Data Experimental Results Conclusions and Future Directions
3
Clustering Clustering is an unsupervised machine learning technique Uses unlabeled samples In the simplest form – determine groups (clusters) of data objects such that the objects in one cluster are similar to each other and dissimilar to objects in other clusters Where each data object is a set of attributes (or features) with a definite notion of proximity Most traditional clustering algorithms Are partitional in nature. Assign a data object to exactly one cluster Perform clustering along one dimension
4
Application Domains Gene Expression Data Genes vs. Experimental Conditions Find similar genes based on their expression values for different experimental conditions Each cluster would represent a potential functional module in the organism Text Documents Data Documents vs. Words Movie Recommendation Systems Users vs. Movies
5
Overlapping Clustering Also known as soft clustering, fuzzy clustering A data object can be assigned to more than one cluster Motivation is that many real world data sets have inherently overlapping clusters A gene can be a part of multiple functional modules (clusters)
6
Co-clustering Co-clustering is the problem of simultaneously clustering rows and columns of a data matrix Also known as bi-clustering, subspace clustering, bi-dimensional clustering, simultaneous clustering, block clustering The resulting clusters are blocks in the input data matrix These blocks often represent more coherent and meaningful clusters Only a subset of genes participate in any cellular process of interest that is active for only a subset of conditions
7
Overlapping Co-clustering overlapsCo-clusters [Bergmann et al, 2003] Segal et al, 2003, Banerjee et al, 2005Dhillon et al, 2003, Cho et al 2004, Banerjee et al, 2005 Overlapping Co-clusters
8
Current State of Art Traditional Clustering – Numerous algorithms like k-means Overlapping Clustering – Probabilistic Relational Model Based Approach by Segal et al and Banerjee et al Co-clustering – Dhillon et al for gene expression data and document clustering. (Banerjee et al provided a general framework using a general class of Bregman distortion functions) Overlapping co-clustering Iterative Signature Algorithm (ISA) by Bergmann et al for gene expression data -Uses an Alternate Minimization technique -Involves thresholding after every iteration We propose a more formal framework based on the co-clustering approach by Dhillon et al and another simpler frequent itemsets based solution
9
Frequent Itemsets Based Approach Based on the concept of frequent itemsets from association analysis domain A frequent itemset is a set of items (features) which occur together more than a specified number of times (referred to as support threshold) in the data set The data has to be binary (only presence or absence is considered)
10
Frequent Itemsets Based Approach (2) Application to gene expression data: Normalization – first along columns (conditions) to remove scaling effects and then along rows (genes) Binarization -Values above a preset threshold λ are set to 1 and the rest to 0. -Values above a preset percentile are set to 1 and the rest to 0. -Split each gene column to three components g +, g 0 and g - signifying the up and down regulation of the gene's expression. This triples the number of items (or genes) Gene expression matrix converted to transaction format data – each experiment is a transaction and contains index values for the genes that were expressed in this experiment
11
Algorithm: Run closed frequent itemset algorithm to generate frequent closed itemsets with a specified support threshold σ Post-Processing: Prune frequent itemsets (set of genes) of length < α For each remaining itemset, scan the transaction data to record all the transactions (experiments) in which this itemset occurs (Note: The combination of these transactions (experiments) and the itemset (genes) will give the desired sets of genes with subsets of conditions they are most tightly co-expressed with) Frequent Itemsets Based Approach (3)
12
Binarization of the gene expression matrix may lose some of the patterns in the data Up-regulation and down-regulation of genes not directly taken into account Setting up right support threshold incorporating the domain knowledge is not trivial Large number of modules obtained – difficult to evaluate biologically Traditional association analysis based approaches only considers dense blocks, noise may break the actual module in this case – Error tolerant Itemsets (ETI) offers a potential solution though Limitations of Frequent Itemsets Based Approach
13
Extends the non-overlapping co-clustering approach by [Dhillon et al, 2003, Banerjee et al 2005] Algorithm Input: Data Matrix A (size: m x n) and k, l (# of row and column clusters) Initialize row and column cluster mappings, X (size: m x k) and Y (size: n x l) Random assignment of rows (or columns) to row (or column) clusters Any traditional one dimensional clustering can be used to initialize X and Y Objective function: ||A – Â|| 2, Â is matrix approximation of A computed as follows: Each element of a co-cluster (obtained using current X and Y) is replaced by mean of co-cluster (a I,J ) Each element of a co-cluster is replaced by (a i,J + a I,j – a I,J ) i.e row mean + column mean – overall mean Alternate Minimization (AM) Based Approach
14
While (converged) -Phase 1: Compute row cluster prototypes (based on current X and matrix A) Compute Bregman distance, d Φ (r i, R r ) - each row to each row cluster prototype Compute probability with which each of m rows fall into each of k row clusters Update row cluster X keeping column cluster Y same (some thresholding is required here to allow limited overlap) Alternate Minimization (AM) Based Approach(2) -Phase 2: Compute column cluster prototypes (based on current Y and matrix A) Compute Bregman distance, d Φ (c j, C c ) - each column to each column cluster prototype Compute probability with which each of n columns fall into each of l column clusters Update column cluster Y keeping row cluster X same -Compute objective function: ||A – Â|| 2 -Check convergence
15
Each row or column can be assigned to multiple row and column clusters respectively by certain probability based on their distances from respective cluster prototypes. This will produce overlapping co- clustering. Maximum overlapping co-clusters that could be obtained = k x l Initialization of X and Y can be done in multiple ways – two ways are explored in the experiments Thresholding to control percent overlap is tricky and requires domain knowledge Cluster Evaluation is important – internal and external SSE, Entropy of each co-cluster Biological evaluation using GO (Gene Ontology) for results on gene expression data Observations
16
Experimental Results (1) Frequent Itemsets Based Approach A synthetic data set (40 X 40) Total Number of co-clusters detected = 3
17
Experimental Results (2) Frequent Itemsets Based Approach Another synthetic data set (40 X 40) Total Number of co-clusters detected = 7 All 4 blocks (in the original data set) were detected Need post-processing to eliminate unwanted co-clusters
18
Experimental Results (3) AM Based Approach Synthetic data sets (20 X 20) Finds co-clusters for each case
19
Experimental Results (4) AM Based Approach on Gene Expression Dataset Human Lymphoma Microarray Data [Described in Cho et al, 2004] # genes = 854 # conditions = 96 k = 5, l = 5, one dimensional k-means for initialization of X and Y Total Number of co-clusters = 25 Input Data Objective Function vs. Iterations A preliminary analysis of the 25 co-clusters show that only one meaningful co-cluster is obtained
20
Conclusions Frequent Itemsets based approach is guaranteed to find dense overlapping co-clusters Error Tolerant Itemset Approach offers a potential solution to address the problem of noise AM based approach is a formal algorithm to find overlapping co-clusters Simultaneously performs clustering in both dimensions while minimizing a global objective function Results on synthetic data prove the correctness of the algorithm Preliminary results on gene expression data show promise and will be further evaluated A key insight here is that application of these techniques to gene expression data requires domain knowledge for pre-processing, initialization, thresholding as well as post-processing of the co- clusters obtained
21
[Bergmann et al, 2003] Sven Bergmann, Jan Ihmels and Naama Barkai, Iterative signature algorithm for the analysis of large-scale gene expression data, Phys. Rev. E 67, pp 31902, 2003 [Liu et al, 2004] Jinze Liu, Paulsen Susan, Wei Wang, Andrew Nobel and Jan Prins, Mining Approximate Frequent Itemset from Noisy Data, Proc. IEEE ICDM, pp. 463-466, 2004 [Cho et al, 2004] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra, Minimum sum- squared residue co-clustering of gene expression data. In Proceedings of SIAM Data Mining Conference, pages 114-125, 2004 [Dhillon et al, 2003] Inderjit S. Dhillon, Subramanyam Mallela and Dharmendra S. Modha, Information-Theoretic Co-Clustering, Proc. ACM SIGKDD, pp. 89-98, 2003 [Banerjee et al, 2004] A generalized maximum entropy approach to bregman co- clustering and matrix approximation. In KDD '04: Proceedings of the 10 th ACM SIGKDD, pages 509-514, 2004 References
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.