CS 485G: Special Topics in Data Mining

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Dimensionality Reduction PCA -- SVD
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Mutual Information Mathematical Biology Seminar
Turning Privacy Leaks into Floods: Surreptitious Discovery of Social Network Friendships Michael T. Goodrich Univ. of California, Irvine joint w/ Arthur.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
Birch: An efficient data clustering method for very large databases
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Bi-Clustering. 2 Data Mining: Clustering Where K-means clustering minimizes.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
Usman Roshan Machine Learning, CS 698
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
 In the previews parts we have seen some kind of segmentation method.  In this lecture we will see graph cut, which is a another segmentation method.
Ultra-high dimensional feature selection Yun Li
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
DB Seminar Series: The Subspace Clustering Problem By: Kevin Yip (17 May 2002)
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Semi-Supervised Clustering
Cluster Analysis II 10/03/2012.
SIMILARITY SEARCH The Metric Space Approach
Lap Chi Lau we will only use slides 4 to 19
Jinbo Bi Joint work with Jiangwen Sun, Jin Lu, and Tingyang Xu
Lecture Notes for Chapter 2 Introduction to Data Mining
Topics in Algorithms Lap Chi Lau.
Outlier Processing via L1-Principal Subspaces
Yun-FuLiu Jing-MingGuo Che-HaoChang
Principal Component Analysis (PCA)
Dipartimento di Ingegneria «Enzo Ferrari»,
Subspace Clustering/Biclustering
Jianping Fan Dept of CS UNC-Charlotte
Clustering and Multidimensional Scaling
FCTA 2016 Porto, Portugal, 9-11 November 2016 Classification confusion within NEFCLASS caused by feature value skewness in multi-dimensional datasets Jamileh.
Clustering.
Multivariate Statistical Methods
A Fast Algorithm for Subspace Clustering by Pattern Similarity
GPX: Interactive Exploration of Time-series Microarray Data
Concept Decomposition for Large Sparse Text Data Using Clustering
Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics Vipin Kumar William Norris Professor and Head, Department of Computer.
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
J.M. Sotoca, F. Pla, A. C. Klaren
Data Pre-processing Lecture Notes for Chapter 2
Lecture 16. Classification (II): Practical Considerations
Clustering.
Presentation transcript:

CS 485G: Special Topics in Data Mining BiClustering Analysis Jinze Liu

http://www.onmyphd.com/?p=k-means.clustering&ckattempt=1 http://www.jstor.org/stable/2330417?seq=2#page_scan_tab_content s

Outline The Curse of Dimensionality Co-Clustering Subspace-Clustering Partition-based hard clustering Subspace-Clustering Pattern-based

Clustering K-means clustering minimizes Where

The Curse of Dimensionality The dimension of a problem refers to the number of input variables (actually, degrees of freedom). 1–D 2–D 3–D The curse of dimensionality The exponential increase in data required to densely populate space as the dimension increases. The points are equally far apart in high dimensional space.

Motivation Document Clustering: Define a similarity measure Clustering the documents using e.g. k-means Term Clustering: Symmetric with Doc Clustering

Motivation Hierarchical Clustering of Genes Hierarchical Clustering of Patients Genes Patients

Contingency Tables Let X and Y be discrete random variables X and Y take values in {1, 2, …, m} and {1, 2, …, n} p(X, Y) denotes the joint probability distribution—if not known, it is often estimated based on co-occurrence data Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc. Key Obstacles in Clustering Contingency Tables High Dimensionality, Sparsity, Noise Need for robust and scalable algorithms

Co-Clustering Simultaneously Cluster rows of p(X, Y) into k disjoint groups Cluster columns of p(X, Y) into l disjoint groups Key goal is to exploit the “duality” between row and column clustering to overcome sparsity and noise

Co-clustering Example for Text Data Co-clustering clusters both words and documents simultaneously using the underlying co-occurrence frequency matrix document document clusters word clusters word

Result of Co-Clustering http://adios.tau.ac.il/SpectralCoClustering/ http://adios.tau.ac.il/SpectralCoClustering/

Clustering by Patterns

Clustering by Pattern Similarity (p-Clustering) The micro-array “raw” data shows 3 genes and their values in a multi-dimensional space Parallel Coordinates Plots Difficult to find their patterns “non-traditional” clustering

Clusters Are Clear After Projection

Motivation E-Commerce: collaborative filtering Movie 1 Movie 2 Movie 3 Viewer 1 1 2 4 3 5 Viewer 2 6 7 Viewer 3 Viewer 4 Viewer 5

Motivation

Motivation Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 Movie 7 Viewer 1 1 2 4 3 5 Viewer 2 6 7 Viewer 3 Viewer 4 Viewer 5

Motivation

Motivation DNA microarray analysis CH1I CH1B CH1D CH2I CH2B CTFC3 4392 284 4108 280 228 VPS8 401 281 120 275 298 EFB1 318 37 277 215 SSA1 292 109 580 238 FUN14 2857 285 2576 271 226 SP07 290 48 224 MDM10 538 272 266 236 CYS3 322 288 41 278 219 DEP1 312 40 273 232 NTG1 329 296 33 274

Motivation

Motivation Strong coherence exhibits by the selected objects on the selected attributes. They are not necessarily close to each other but rather bear a constant shift. Object/attribute bias

bi-cluster Consists of a (sub)set of objects and a (sub)set of attributes Corresponds to a submatrix Occupancy threshold  Each object/attribute has to be filled by a certain percentage. Volume: number of specified entries in the submatrix Base: average value of each object/attribute (in the bi-cluster)

bi-cluster CH1I CH1B CH1D CH2I CH2B Obj base CTFC3 VPS8 401 120 298 273 EFB1 318 37 215 190 SSA1 FUN14 SP07 MDM10 CYS3 322 41 219 194 DEP1 NTG1 Attr base 347 66 244

bi-cluster Perfect -cluster Imperfect -cluster dij diJ Residue: dIJ

bi-cluster The smaller the average residue, the stronger the coherence. Objective: identify -clusters with residue smaller than a given threshold

Cheng-Church Algorithm Find one bi-cluster. Replace the data in the first bi-cluster with random data Find the second bi-cluster, and go on. The quality of the bi-cluster degrades (smaller volume, higher residue) due to the insertion of random data.

The FLOC algorithm Generating initial clusters Determine the best action for each row and each column Perform the best action of each row and column sequentially Y Improved? N

The FLOC algorithm Action: the change of membership of a row(or column) with respect to a cluster column M=4 1 2 3 4 row 3 4 2 2 1 M+N actions are Performed at each iteration 2 1 3 2 3 N=3 3 4 2 4

The FLOC algorithm Gain of an action: the residue reduction incurred by performing the action Order of action: Fixed order Random order Weighted random order Complexity: O((M+N)MNkp) 

The FLOC algorithm Additional features Maximum allowed overlap among clusters Minimum coverage of clusters Minimum volume of each cluster Can be enforced by “temporarily blocking” certain action during the mining process if such action would violate some constraint.

Performance Microarray data: 2884 genes, 17 conditions 100 bi-clusters with smallest residue were returned. Average residue = 10.34 The average residue of clusters found via the state of the art method in computational biology field is 12.54 The average volume is 25% bigger The response time is an order of magnitude faster

Conclusion Remark The model of bi-cluster is proposed to capture coherent objects with incomplete data set. base residue Many additional features can be accommodated (nearly for free).