A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets Ashok Sharma, Robert Podolsky, Jieping.

Slides:



Advertisements
Similar presentations
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
1 Abstract This paper presents a novel modification to the classical Competitive Learning (CL) by adding a dynamic branching mechanism to neural networks.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Quantitative Genetics
Birch: An efficient data clustering method for very large databases
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Health and CS Philip Chan. DNA, Genes, Proteins What is the relationship among DNA Genes Proteins ?
Evaluating Performance for Data Mining Techniques
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Distributed Anomaly Detection in Wireless Sensor Networks Ksutharshan Rajasegarar, Christopher Leckie, Marimutha Palaniswami, James C. Bezdek IEEE ICCS2006(Institutions.
Intelligent Database Systems Lab 1 Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Model-based evaluation of clustering validation measures.
Cluster validation Integration ICES Bioinformatics.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
1 A latent information function to extend domain attributes to improve the accuracy of small-data-set forecasting Reporter : Zhao-Wei Luo Che-Jung Chang,Der-Chiang.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
1 5. Abstract Data Structures & Algorithms 5.6 Algorithm Evaluation.
Unsupervised Learning
Bayesian Semi-Parametric Multiple Shrinkage
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Data Driven Resource Allocation for Distributed Learning
Hierarchical Clustering
Instance Based Learning
Constrained Clustering -Semi Supervised Clustering-
PCB 3043L - General Ecology Data Analysis.
Parallel Density-based Hybrid Clustering
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Greedy Algorithm for Community Detection
Principal Component Analysis (PCA)
BIRCH: An Efficient Data Clustering Method for Very Large Databases
CS 685: Special Topics in Data Mining Jinze Liu
Image Processing for Physical Data
John Nicholas Owen Sarah Smith
Hierarchical and Ensemble Clustering
Clustering.
Multivariate Statistical Methods
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Hierarchical and Ensemble Clustering
Dimension reduction : PCA and Clustering
COMP60621 Fundamentals of Parallel and Distributed Systems
Analysis of Algorithms
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Hairong Qi, Gonzalez Family Professor
COMP60611 Fundamentals of Parallel and Distributed Systems
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Unsupervised Learning
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets Ashok Sharma, Robert Podolsky, Jieping Zhao and Richard A. McIndoe BIOINFORMATICS, 2009

Outline Introduction Methods Results Discussion

Introduction Three common clustering algorithms: Hierarchical Clustering K-means Self-organizing maps(SOM)

Introduction The size of the dataset becomes an issue in cluster analysis with technological advances. It needs much memory and CPU time to perform cluster analysis for large datasets.

Introduction A two stage algorithm would be proposed: The first stage: reduce the data complexity. The second stage: cluster the resulting partitions from the first stage. Two stage algorithm: HPCluster Gene expression profiles Reduce the data complexity Cluster the resulting partitions Clustering results

Methods Algorithm: HPCluster Estimation of the maximum diameter of a CF Initial centroids calculation schemes

Algorithm: HPCluster HPCluster works in two stages: Reduction of the dataset using cluster features (CFs). conventional K-means clustering on the CFs. How to reduce the scale of dataset? Two stage algorithm: HPCluster Gene expression profiles Reduce the data complexity Cluster the resulting partitions Clustering results cluster features K-means

How to reduce the scale of dataset? Cluster features(CFs): A CF is a group of closely related data points and is defined as: : Number of data points in the CF. : Linear sum of the vector data points. : Sum of the squares of the data points.

How to reduce the scale of dataset? Q where is the d-dimensional data point vector.

How to reduce the scale of dataset? The centroid, radius and diameter of a CF can be calculated:

HPCluster: The first stage

Estimation of the maximum diameter of a CF The degree of data reduction depends upon the maximum allowable size of the CFs. How to estimating the initial ?

Estimation of the maximum diameter of a CF We use 90-10 rule to assess the similarity of the data between nodes of a cluster. Based on the observation, ∼90% of the agglomerations in a hierarchical cluster of non-uniform data will initially have very small distances as compared with the maximum closest pair distances (the last 10% of agglomerations).

Estimation of the maximum diameter of a CF We randomly choose 10% of the genes and perform hierarchical clustering. The inflection point of the distance plot indicates a sharp increase from the smallest distance pairs to the largest distance pairs (the last 10% of agglomerations).

HPCluster: The second stage

Initial centroids calculation schemes Because of the usage of K-means algorithm, we need to set the initial centroids: Dense regions from data (DEN) Random initial assignments (RIA)

Dense regions from data (DEN) After completing the first stage, we sort the CFs based on the value of in each CF and use the top k number of clusters as the initial centroids. Cluster features (CFs) top k Sorting based on the value of (Number of data points)

Random initial assignments (RIA) The initial centroids are calculated based on a random assignment of each gene to a cluster followed by a calculation of the mean for the random clusters. If k = 3, we assume there are 3 clusters: genes C1 the mean of the clusters The 3 initial centroids random assignment C2 C3

Results Experimental setting Significantly reduced time to completion Evaluation of cluster quality Effect of incorrect cluster number

Experimental setting The machine: Dell Poweredge 2650 machine with Dual 3.06 GHz/512K Cache Xeon Processors and 8.0 GB DDR 266MHz RAM. K-means and SOM algorithm are performed by using R. Eisen Cluster software version 2.0 and 3.0 are compared.

Experimental setting HPCluster was written using the .NET Framework v1.1 and C#.

Experimental setting Data standardization for centering: : the expression value of the i-th gene across the j-th array : the mean expression of the i-th gene : the standard deviation of the i-th gene.

Significantly reduced time to completion We clustered both the simulated and real microarray datasets for both centered and un-centered data and recorded the time to completion. Each dataset had 100 arrays, varying numbers of genes (10 000, 22 283 and 44 760), varying numbers of clusters (4, 10 and 20) and run 12 times each.

Significantly reduced time to completion The comparison result for centered data:

Significantly reduced time to completion The comparison result for un-centered data:

Evaluation of cluster quality To evaluate the quality of the cluster assignments produced by HPCluster, we use the adjusted Rand index (ARI) for experiment. Accuracy was evaluated on the simulated datasets using the ARI calculated between the output cluster partitions and the true partitions.

Evaluation of cluster quality The result: The average ARI results for the accuracy of the gene assignments for 10000 genes and 20 true clusters for both the centered and un-centered data after running each program 12 times.

Evaluation of cluster quality We examined cluster stability for analyses involving the experimental data. The stability plots for both the centered and un-centered 22283 gene experimental dataset using k =10 clusters.

Effect of incorrect cluster number Determining the appropriate number of clusters to use can be difficult and has an impact on the accuracy and usefulness of the resulting cluster analysis. To determine the effect of using an incorrect number of clusters, we ran HPCluster under varying numbers of clusters.

Effect of incorrect cluster number The result for centered data:

Effect of incorrect cluster number The result for un-centered data:

Discussion Partitioning methods such as k-means or SOM are more suitable to cluster the larger datasets. What factors can affect the speed and accuracy of clustering methods? The size and complexity of the data. How the data is processed before analysis. In this study, we investigated the effect of the number of genes, clusters and arrays on the time to completion as well as the effect of preprocessing the data.

Discussion Depending on the design of the microarray experiment, it may be advantageous to center your data before beginning a cluster analysis. All the programs we tested were sensitive to whether or not the data was centered prior to cluster analysis. From the result: When the data is centered: the DEN initialization. When the data is un-centered: the RIA initialization.

Discussion The GEO at NCBI currently has 706 Homo sapiens datasets and over 200000 gene expression measurements. As the number of publically available microarray experiments increases, the ability to analyze extremely large datasets across multiple experiments becomes critical.

Gene Expression Omnibus(GEO) The website: