The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP 790-90 Seminar Spring 2008.

Slides:



Advertisements
Similar presentations
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
An Association Analysis Approach to Biclustering website:
Gene Shaving – Applying PCA Identify groups of genes a set of genes using PCA which serve as the informative genes to classify samples. The “gene shaving”
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Finding Local Linear Correlations in High Dimensional Data Xiang Zhang Feng Pan Wei Wang University of.
A Probabilistic Framework for Semi-Supervised Clustering
Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.
Mutual Information Mathematical Biology Seminar
“Visibility-based Probabilistic Roadmaps for Motion Planning” Siméon, Laumond, Nissoux Presentation by: Eric Ng CS326A: Paper Review Spring 2003.
Reduced Support Vector Machine
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Clustering (Part II) 11/26/07. Spectral Clustering.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
A Sparsification Approach for Temporal Graphical Model Decomposition Ning Ruan Kent State University Joint work with Ruoming Jin (KSU), Victor Lee (KSU)
Protein Structure Prediction Samantha Chui Oct. 26, 2004.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
COMMUNITIES IN MULTI-MODE NETWORKS 1. Heterogeneous Network Heterogeneous kinds of objects in social media – YouTube Users, tags, videos, ads – Del.icio.us.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequence Clustering COMP Research Seminar Spring 2011.
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
Gene expression & Clustering (Chapter 10)
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Bi-Clustering. 2 Data Mining: Clustering Where K-means clustering minimizes.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
Lecture 20: Cluster Validation
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
3. Rough set extensions  In the rough set literature, several extensions have been developed that attempt to handle better the uncertainty present in.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
Advancing Wireless Link Signatures for Location Distinction Mobicom 2008 Junxing Zhang, Mohammad H. Firooz Neal Patwari, Sneha K. Kasera University of.
Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University.
Segmentation of Vehicles in Traffic Video Tun-Yu Chiang Wilson Lau.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011.
Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004 Mihai Surdeanu.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.
Yue Xu Shu Zhang.  A person has already rated some movies, which movies he/she may be interested, too?  If we have huge data of user and movies, this.
Using the Particle Filter Approach to Building Partial Correspondences between Shapes Rolf Lakaemper, Marc Sobel Temple University, Philadelphia,PA,USA.
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
Irfan Ullah Department of Information and Communication Engineering Myongji university, Yongin, South Korea Copyright © solarlits.com.
Ultra-high dimensional feature selection Yun Li
Identifying Ethnic Origins with A Prototype Classification Method Fu Chang Institute of Information Science Academia Sinica ext. 1819
Graph Indexing From managing and mining graph data.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
Joint Routing and Scheduling Optimization in Wireless Mesh Networks with Directional Antennas A. Capone, I. Filippini, F. Martignon IEEE international.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Jinbo Bi Joint work with Jiangwen Sun, Jin Lu, and Tingyang Xu
Principal Component Analysis (PCA)
Subspace Clustering/Biclustering
CARPENTER Find Closed Patterns in Long Biological Datasets
CS 485G: Special Topics in Data Mining
Fall 2018, COMP 562 Poster Session
Inferring Cellular Processes from Coexpressing Genes
Presentation transcript:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 2 Data Mining: Clustering Where K-means clustering minimizes

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 3 Clustering by Pattern Similarity (p-Clustering) The micro-array “raw” data shows 3 genes and their values in a multi-dimensional space  Parallel Coordinates Plots  Difficult to find their patterns “non-traditional” clustering

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 4 Clusters Are Clear After Projection

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 5 Motivation E-Commerce: collaborative filtering Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 Movie 7 Viewer Viewer Viewer Viewer Viewer 55534

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 6 Motivation

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 7 Motivation Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 Movie 7 Viewer Viewer Viewer Viewer Viewer 55534

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 8 Motivation

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 9 Motivation DNA microarray analysis CH1ICH1BCH1DCH2ICH2B CTFC VPS EFB SSA FUN SP MDM CYS DEP NTG

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 10 Motivation

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 11 Motivation Strong coherence exhibits by the selected objects on the selected attributes.  They are not necessarily close to each other but rather bear a constant shift.  Object/attribute bias bi-cluster

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 12 Challenges The set of objects and the set of attributes are usually unknown. Different objects/attributes may possess different biases and such biases  may be local to the set of selected objects/attributes  are usually unknown in advance May have many unspecified entries

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 13 Previous Work Subspace clustering  Identifying a set of objects and a set of attributes such that the set of objects are physically close to each other on the subspace formed by the set of attributes. Collaborative filtering: Pearson R  Only considers global offset of each object/attribute.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 14 bi-cluster Consists of a (sub)set of objects and a (sub)set of attributes  Corresponds to a submatrix  Occupancy threshold   Each object/attribute has to be filled by a certain percentage.  Volume: number of specified entries in the submatrix  Base: average value of each object/attribute (in the bi-cluster)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 15 bi-cluster CH1ICH1BCH1DCH2ICH2BObj base CTFC3 VPS EFB SSA1 FUN14 SP07 MDM10 CYS DEP1 NTG1 Attr base

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 16 bi-cluster Perfect  -cluster Imperfect  -cluster  Residue: d IJ d Ij d iJ d ij

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 17 bi-cluster The smaller the average residue, the stronger the coherence. Objective: identify  -clusters with residue smaller than a given threshold

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 18 Cheng-Church Algorithm Find one bi-cluster. Replace the data in the first bi-cluster with random data Find the second bi-cluster, and go on. The quality of the bi-cluster degrades (smaller volume, higher residue) due to the insertion of random data.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 19 The FLOC algorithm Generating initial clusters Determine the best action for each row and each column Perform the best action of each row and column sequentially Improved? Y N

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 20 The FLOC algorithm Action: the change of membership of a row(or column) with respect to a cluster column row M+N actions are Performed at each iteration N=3 M=4

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 21 The FLOC algorithm Gain of an action: the residue reduction incurred by performing the action Order of action:  Fixed order  Random order  Weighted random order Complexity: O((M+N)MNkp) 

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 22 The FLOC algorithm Additional features  Maximum allowed overlap among clusters  Minimum coverage of clusters  Minimum volume of each cluster Can be enforced by “temporarily blocking” certain action during the mining process if such action would violate some constraint.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 23 Performance Microarray data: 2884 genes, 17 conditions  100 bi-clusters with smallest residue were returned.  Average residue =  The average residue of clusters found via the state of the art method in computational biology field is  The average volume is 25% bigger  The response time is an order of magnitude faster

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 24 Conclusion Remark The model of bi-cluster is proposed to capture coherent objects with incomplete data set.  base  residue Many additional features can be accommodated (nearly for free).