University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.

Slides:



Advertisements
Similar presentations
Mining Association Rules
Advertisements

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Frequent Closed Pattern Search By Row and Feature Enumeration
ICDM'06 Panel 1 Apriori Algorithm Rakesh Agrawal Ramakrishnan Srikant (description by C. Faloutsos)
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Data Mining Association Analysis: Basic Concepts and Algorithms
Rakesh Agrawal Ramakrishnan Srikant
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
University at BuffaloThe State University of New York Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data Daxin Jiang Jian.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Data Mining Association Analysis: Basic Concepts and Algorithms
University at BuffaloThe State University of New York Mining Phenotype Structures Chun Tang and Aidong Zhang Bioinformatics Journal, 20(6): , 2004.
Mining Association Rules
Mining Association Rules
Birch: An efficient data clustering method for very large databases
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Sequential PAttern Mining using A Bitmap Representation
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
Mining High Utility Itemset in Big Data
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Mingyang Zhu, Huaijiang Sun, Zhigang Deng Quaternion Space Sparse Decomposition for Motion Compression and Retrieval SCA 2012.
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
Data Mining Find information from data data ? information.
Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Data Mining and Decision Support
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Graph Indexing From managing and mining graph data.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
CARPENTER Find Closed Patterns in Long Biological Datasets
Chao Zhang1, Yu Zheng2, Xiuli Ma3, Jiawei Han1
A Fast Algorithm for Subspace Clustering by Pattern Similarity
GPX: Interactive Exploration of Time-series Microarray Data
15-826: Multimedia Databases and Data Mining
Fraction-Score: A New Support Measure for Co-location Pattern Mining
Presentation transcript:

University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure

University at BuffaloThe State University of New York What Is Pattern-based Clustering? A cluster: a set of objects following the same pattern in a subset of dimensions (Wang et al, 2002)

University at BuffaloThe State University of New York Challenges Most clustering approaches do not address the temporal variations in time series gene expression data, which is an important feature and affect the performance. Previous approaches try to find coherent patterns and clusters w.r.t. the entire set of attributes Patterns may be embedded in sub attribute spaces qOnly a subset of genes participate in any cellular processes of interest qAny cellular process may take place only in a subset of experiment conditions. a) raw data b) shifting patterns c) scaling patterns

University at BuffaloThe State University of New York Gene-Sample-Time (GST) Microarray Data 2D time-series data 3D gene-sample-time data three dimensions The GST microarray data consist of three dimensions various phenotypes The samples often exhibit various phenotypes, e.g., cancer vs. control A collection of samples

University at BuffaloThe State University of New York Challenges of Mining GST Data Challenges2D data3D data Mining Process Partition genes Partition genes and samples simultaneously Cluster model Two types of variables Three types of variables Most clustering algorithms were designed for 2D data, and cannot be directly extended for 3D data.

University at BuffaloThe State University of New York Coherent Gene Cluster The group of samples (s j1, s j2, s j3 ) may exhibit the same phenotype The group of genes (g i1,g i2,g i3 ) may be strongly correlated to the phenotype shared by (s j1, s j2, s j3 ) A coherent gene cluster A 3D GST data set The 2D representation

University at BuffaloThe State University of New York Results from a Real Data Set The Multiple Sclerosis (MS) data consist of q 4324 genes q 13 MS patients q 10 time points before and after IFN-  treatment 25 coherent gene clusters were reported Sample A Sample B Sample C Sample D Sample E Sample F Sample G Sample H An example of coherent gene clusters (107 genes, 8 samples)

University at BuffaloThe State University of New York Other Types of Coherent Clusters

University at BuffaloThe State University of New York Problem Definition Given a GST microarray data matrix M, a maximal coherent gene cluster C=(G  S) is a combination of a subset of genes G and a subset of samples S such that: q Coherent : the subset of genes G are coherent across the subset of samples S; qSignificant : |G|≥min g, |S|≥min s, where min g and min s are user-specified parameters; qMaximal : any insertion of g  G or s  S will make C not coherent. The problem of mining coherent gene clusters is to find the complete set of maximal coherent gene clusters in M.

University at BuffaloThe State University of New York Coherence Measure Various coherence measures exist. Measure selection is application dependent. A general coherence model qGiven a coherence measure sim() and a user-specified threshold , qA gene g a is coherent on samples s i and s j, if sim(p ai,p aj )≥ . qCoherent gene matrix (G 1,S 1 ): if every gene g i  G 1 is coherent across samples in S 1. qTrivial coherent gene matrix: ({g i }, {s j }), (G, {s j }) We choose the Person’s correlation coefficient. Other coherence measures are also applicable.

University at BuffaloThe State University of New York Related Work Clustering algorithms on Gene-Sample or Gene-Time microarray data q The cluster model is completely different Subspace clustering q Find subsets of objects coherent with subsets of attributes Frequent pattern mining q Find subsets of items frequently appearing in transaction databases

University at BuffaloThe State University of New York Algorithm Outline Phase 1 (Pre-processing) : For each gene g, find the complete set of maximal coherent sample sets of gene g. Phase 2: Compute the complete set of maximal coherent gene clusters based on pre-processing results.

University at BuffaloThe State University of New York Coherent Sample Sets Given a gene g, a maximal coherent sample set of g is a subset of samples S i such that: q coherent : g is coherent across S i ; q significant : |S i |  min s ; q maximal : there exists no superset S’  S i such that g is also coherent with S’. (g  S i ) is a building block for coherent gene clusters including g.

University at BuffaloThe State University of New York Preprocessing Phase s1s2s3s4s5s6 s s s s s s Suppose min s = 3 The coherence matrix of gene g The coherence graph of gene g s1s1 s2s2 s3s3 s5s5 s4s4 s6s6 s4s4 s3s3 s5s5 s6s6 {s 3,s 4,s 5,s 6 } is a coherent sample set of gene g

University at BuffaloThe State University of New York Sample-gene Search Set enumeration tree q Enumerate all subsets of samples systematically. q Each node on the tree corresponds to a subset of samples. For each node S q Find the maximal set of genes G s which is coherent with S

University at BuffaloThe State University of New York Set Enumeration Tree The set enumeration tree for {a,b,c,d} {} {a}{c}{b}{d} {a,b}{a,c}{a,d}{b,c}{b,d}{c,d} {a,b,d}{a,b,c}{a,c,d}{b,c,d} {a,b,c,d}

University at BuffaloThe State University of New York Find the Maximal Coherent Subset of Genes After the pre-processing phase: Given a subset of samples S, how to find the maximal coherent set of genes G S ? q Expensive approach: scan the table once For each S, G s can be derived by a single scan of the maximal coherent samples of all genes. If S  S j, g  G s. q Efficient approach: use the inverted list. g1{s1, s2, s3, s4, s5} g2{s1,s2,s4}, {s1,s5} g3{s1,s2,s3,s4,s5} g4{s1,s2,s3},{s5,s6} g5{s1,s5,s6}

University at BuffaloThe State University of New York The Inverted List GeneMaximal Coherent sample sets g1{s1, s2, s3, s4, s5} g2{s1, s2, s4}, {s1, s5} g3{s1, s2, s3, s4, s5} g4{s1, s2, s3}, {s5, s6} g5{s1, s5, s6} SampleThe inverted list s1{g1.b1, g2.b1, g2.b2, g3.b1, g4.b1, g5.b1} s2{g1.b1, g2.b1, g3.b1, g4.b1} s3{g1.b1, g3.b1, g4.b1} s4{g1.b1, g2.b1, g3.b1} s5{g1.b1, g2.b2, g3.b1, g4.b2, g5.b1} s6{g4.b2, g5.b1} The table of maximal coherent sample sets for genes The table of inverted lists for samples g2.b1 g2.b2

University at BuffaloThe State University of New York Intersection Instead of Scanning Given a subset of samples S={s i1,…,s ik }, intersect the inverted lists of s i1,…,s ik. q For example, given S={s1,s2,s3}, L s1 ^L s2 ^L s3 ={g1.b1,g3.b1,g4.b1}, so G s ={g1,g3,g4}. q Suppose the parent of S is S’={s i1,…,s ik-1 }, then L S =L S’  L s ik.

University at BuffaloThe State University of New York Anti-monotonic Property Given a combination (G  S), qif G is not coherent on S, q then for any superset S’  S, G cannot be coherent on S’. For any descendant S’ of S on the tree q let G S be the maximal coherent gene set of S, q let G S’ be the maximal coherent gene sets of S’, q since S’  S, we have G S’  G S.

University at BuffaloThe State University of New York Pruning Irrelevant Samples Given a subset of samples S={s i1,…,s ik }, a sample s j  tail s, if q j > i k q there exists at least min g genes g such that g is coherent with S  {s j } Samples s l  tail s (irrelevant samples) cannot be used to extend S.

University at BuffaloThe State University of New York Pruning Unpromising Nodes Given a subset of samples S={s i1,…,s ik }, q if |S|+|tail s |< min s, then prune the subtree of S. q let the maximal coherent subset of genes of S be G s,  if there exists (G’  S’) such that (S  tail s )  S’ G s  G’,  the prune the subtree of S

University at BuffaloThe State University of New York Determination of Maximal Coherent Gene Clusters The depth-first search strategy: q For any superset S’ of S, S’ is  visited before S;  or a child of S. To determine whether a coherent gene cluster (G s  S) is maximal, q check (G s  S) after visiting all its children, q report (G s  S) if it is not subsumed.

University at BuffaloThe State University of New York { } {s1} {s2,s3,s4,s5} {s2} {s3,s4} {s3} {} {s4} {} {s1,s2} {s3,s4} {g1.b1, g2.b1, g3.b1, g4.b1} {s1,s3} {} {g1.b1, g3.b1, g4.b1} {s1,s4} {} {g1.b1, g2.b1, g3.b1} {s2,s3} {} {g1.b1, g3.b1, g4.b1} {s2,s4} {} {g1.b1, g2.b1, g3.b1} {s1,s2,s3} {} {g1.b1,g3.b1,g4.b1} {s1,s2,s4} {} {g1.b1,g2.b1,g3.b1} SampleThe inverted list s1{g1.b1, g2.b1, g2.b2, g3.b1, g4.b1, g5.b1} s2{g1.b1, g2.b1, g3.b1, g4.b1} s3{g1.b1, g3.b1, g4.b1} s4{g1.b1, g2.b1, g3.b1} s5{g1.b1, g2.b2, g3.b1, g4.b2, g5.b1} s6{g4.b2, g5.b1}

University at BuffaloThe State University of New York Mining Coherent Gene Clusters Systematic enumeration of genes and samples q Sample-Gene Search q Gene-Sample Search Pruning rules Determination of whether a coherent gene cluster (G  S) is maximal

University at BuffaloThe State University of New York Gene-sample Search Sample-Gene SearchGene-Sample Search Subjects to enumerate samplesgenes Number of subjects to enumerate 10 1 ~ ~10 4 Coherent objectsSingle set of maxmial coherent genes Single or multiple sets of maxmial coherent sample Efficiency on GST data HighLow

University at BuffaloThe State University of New York Experiment Data Sets Real-world gene expression data q 4324 genes q 13 multiple sclerosis (MS) patients q before and at 1,2,4,8,24,48,120 and 168 hours after IFN-  treatment Synthetic data q Given the number of genes N G, samples N S and coherent gene clusters N C q Simulate the pre-processing results q Embed N C maximal coherent gene clusters (G  S)

University at BuffaloThe State University of New York A Coherent Gene Cluster from Real Data

University at BuffaloThe State University of New York Effect of Parameters Number of clusters vs. min g (min s =3,  =0.8) Number of clusters vs. min s (min g =10,  =0.8) Number of clusters vs.  (min g =10,min s =3)

University at BuffaloThe State University of New York Scalability Scalability of phase 1 Scalability w.r.t. number of genes (number of samples: 30) Scalability w.r.t. number of samples (number of genes: 3,000)

University at BuffaloThe State University of New York Conclusion We define the new problem of mining coherent gene clusters from the novel gene- sample-time microarray data. We propose two approaches: the sample- gene search and the gene-sample search. We conduct an extensive empirical evaluation on both real and synthetic data sets.

University at BuffaloThe State University of New York Future Work New problems from the gene-sample-time microarray data: q Coherent sample clusters (G  S)  for each s  S, any pair of genes g i, g j  G has coherent patterns. q Coherent gene-sample clusters (G  S),  both a coherent gene cluster and a coherent sample cluster.