The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Hierarchical Co-Clustering Based on Entropy Splitting Wei Cheng 1 Xiang Zhang 2 Feng Pan 3 Wei Wang 4 1.

Slides:



Advertisements
Similar presentations
Clustering Overview Algorithm Begin with all sequences in one cluster While splitting some cluster improves the objective function: { Split each cluster.
Advertisements

Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach Xiaoli Zhang Fern, Carla E. Brodley ICML’2003 Presented by Dehong Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Finding Local Linear Correlations in High Dimensional Data Xiang Zhang Feng Pan Wei Wang University of.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL FastANOVA: an Efficient Algorithm for Genome-Wide Association Study Xiang Zhang Fei Zou Wei Wang University.
Lecture outline Clustering aggregation – Reference: A. Gionis, H. Mannila, P. Tsaparas: Clustering aggregation, ICDE 2004 Co-clustering (or bi-clustering)
Clustering II.
UPGMA Algorithm.  Main idea: Group the taxa into clusters and repeatedly merge the closest two clusters until one cluster remains  Algorithm  Add a.
Simultaneous Identification of Multiple Driver Pathways in Cancer Mark D. M. Leiserson, et.al.
Mutual Information Mathematical Biology Seminar
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Image Segmentation Chapter 14, David A. Forsyth and Jean Ponce, “Computer Vision: A Modern Approach”.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Efficient and Effective Itemset Pattern Summarization: Regression-based Approaches Ruoming Jin Kent State University Joint work with Muad Abu-Ata, Yang.
UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Dino Ienco, Ruggero G. Pensa and Rosa Meo University of Turin, Italy Department of Computer Science ECML-PKDD 2009 – Bled.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Graph Regularized Dual Lasso for Robust eQTL Mapping Wei Cheng 1 Xiang Zhang 2 Zhishan Guo 1 Yu Shi 3 Wei.
Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Flexible and Robust Co-Regularized Multi-Domain Graph Clustering Wei Cheng 1 Xiang Zhang 2 Zhishan Guo.
Clustering Unsupervised learning Generating “classes”
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Non Negative Matrix Factorization
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Matrix Arithmetic. A matrix M is an array of cell entries (m row,column ) and it must have rectangular dimensions (Rows x Columns). Example: 3x x.
Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order Correlations in Binary Data.
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Warm Up. Multiplying Matrices 6.2 part 2 **Multiply rows times columns. **You can only multiply if the number of columns in the 1 st matrix is equal.
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Class Opener:. Identifying Matrices Student Check:
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Weikang Qian. Outline Intersection Pattern and the Problem Motivation Solution 2.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Dual Transfer Learning Mingsheng Long 1,2, Jianmin Wang 2, Guiguang Ding 2 Wei Cheng, Xiang Zhang, and Wei Wang 1 Department of Computer Science and Technology.
Rook Polynomial Relaxation Labeling Ofir Cohen Shay Horonchik.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
Gennette Gill Montek Singh Bottleneck Analysis and Alleviation in Pipelined Systems: A Fast Hierarchical Approach Univ. of North Carolina Chapel Hill,
A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
Do Now: Perform the indicated operation. 1.). Algebra II Elements 11.1: Matrix Operations HW: HW: p.590 (16-36 even, 37, 44, 46)
Chapter 5: Matrices and Determinants Section 5.5: Augmented Matrix Solutions.
PAC-Bayesian Analysis of Unsupervised Learning Yevgeny Seldin Joint work with Naftali Tishby.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
13.4 Product of Two Matrices
Stat 261 Two phase method.
Game Theory Just last week:
University of Ioannina
Discrimination and Classification
Multiplying Matrices.
Information Organization: Clustering
25. Basic matrix operations
Matrices Elements, Adding and Subtracting
Clustering Wei Wang.
Multiplying Matrices.
CS 485G: Special Topics in Data Mining
Multiplying Matrices.
Multiplying Matrices.
ACHIEVEMENT DESCRIPTION
Multiplying Matrices.
Presentation transcript:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Hierarchical Co-Clustering Based on Entropy Splitting Wei Cheng 1 Xiang Zhang 2 Feng Pan 3 Wei Wang 4 1 University of North Carolina at Chapel Hill, 2 Case Western Reserve University, 3 Microsoft, 4 University of California, Los Angeles Speaker: Wei Cheng The 21 st ACM Conference on Information and Knowledge Management (CIKM’12)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Idea of Co-Clustering Co-clustering  To combine the row and column clustering of co- occurrence matrix together and bootstrap each other.  Simultaneously cluster the rows X and columns Y of the co-occurrence matrix.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Hierarchical Co-Clustering Based on Entropy Splitting View (scaled) co-occurrence matrix as a joint probability distribution between row & column random variables Objective: seeking a hierarchical co-clustering containing given number of clusters while maintaining as much “Mutual Information” between row and column clusters as possible. c1c2c3c4 r r r r

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Hierarchical Co-Clustering Based on Entropy Splitting Co-occurrence Matrices Joint probability distribution between row & column cluster random variables

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Hierarchical Co-Clustering Based on Entropy Splitting Update cluster indicators Pipeline: (recursive splitting) While(Termination condition) Find optimal row/column cluster split which achieves maximal Termination Condition:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Hierarchical Co-Clustering Based on Entropy Splitting Randomly split cluster S into S 1 and S 2 Converge at a local optima How to find an optimal split at each step? An Entropy-based Splitting Algorithm: Input: Cluster S Until Convergence Update cluster indicators and probability values For all element x in S, re-assign it to cluster S 1 or S 2 to minimize:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Hierarchical Co-Clustering Based on Entropy Splitting Example Y1Y1 Y2Y2 Y3Y3 Y4Y4 X1X X2X X3X3 0 0 X4X S={ X 1, X 2, X 3, X 4 } S 1 ={ X 1 } S 2 ={ X 2, X 3, X 4 } Naïve method needs trying 7 splits. Exponential time to size of S. Naïve method needs trying 7 splits. Exponential time to size of S. Randomly split Re-assign X 4 to S 1 S 2 ={ X 2, X 3 } S 1 ={ X 1, X 4 }

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Experiments Data sets  Synthetic data  20 Newsgroups data  20 classes, documents

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Results-Synthetic Data *1000 Matrix Add noise to (a) by flipping values with probability 0.3 Randomly permute rows and columns of (b) Clustering result With hierarchical structure

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Results-20 Newsgroups Data Compare with baselines: Method HICCNVBD ICCHCC Dataset m-pre #clusters m-pre #clusters m-pre #clusters m-pre #clusters Multi5 subject Multi N/A Multi10 subject Multi N/A HICC(merged) Single-Link UPGMA WPGMA Complete-Link m-pre #clusters m-pre#clusters m-pre #clusters m-pre #clusters m-pre #clusters Micro- averaged precision: M/N M:number of documents correctly clustered; N: total number of documents

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Thank You ! Questions?