Sequential Hierarchical Clustering

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Albert Gatt Corpora and Statistical Methods Lecture 13.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
K-NEAREST NEIGHBORS AND DECISION TREE Nonparametric Supervised Learning.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information.
Seeing the forest for the trees : using the Gene Ontology to restructure hierarchical clustering Dikla Dotan-Cohen, Simon Kasif and Avraham A. Melkman.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools.
1 Abstract This paper presents a novel modification to the classical Competitive Learning (CL) by adding a dynamic branching mechanism to neural networks.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Clustering Algorithms Bioinformatics Data Analysis and Tools
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Neural Network Homework Report: Clustering of the Self-Organizing Map Professor : Hahn-Ming Lee Student : Hsin-Chung Chen M IEEE TRANSACTIONS ON.
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
Unsupervised Rough Set Classification Using GAs Reporter: Yanan Yean.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Birch: An efficient data clustering method for very large databases
VAST 2011 Sebastian Bremm, Tatiana von Landesberger, Martin Heß, Tobias Schreck, Philipp Weil, and Kay Hamacher Interactive-Graphics Systems TU Darmstadt,
Querying Structured Text in an XML Database By Xuemei Luo.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Learning the threshold in Hierarchical Agglomerative Clustering
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Extending the Multi- Instance Problem to Model Instance Collaboration Anjali Koppal Advanced Machine Learning December 11, 2007.
Hierarchical Clustering of Gene Expression Data Author : Feng Luo, Kun Tang Latifur Khan Graduate : Chien-Ming Hsiao.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 4 Clustering Algorithms Bioinformatics Data Analysis and Tools
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Lian Yan and David J. Miller 國立雲林科技大學 National Yunlin University of.
Consensus Group Stable Feature Selection
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Protein Folding recognition with Committee Machine Mika Takata.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,
We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Graph clustering to detect network modules
Data Science Credibility: Evaluating What’s Been Learned
Parametric calibration of speed–density relationships in mesoscopic traffic simulator with data mining Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2009/10/20.
Clustering CSC 600: Data Mining Class 21.
k-Nearest neighbors and decision tree
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Guillaume-Alexandre Bilodeau
Results for all features Results for the reduced set of features
Basic machine learning background with Python scikit-learn
A Hybrid Algorithm for Multiple DNA Sequence Alignment
Evaluating classifiers for disease gene discovery
Parametric calibration of speed–density relationships in mesoscopic traffic simulator with data mining Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2009/10/20.
Hierarchical clustering approaches for high-throughput data
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
A Small and Fast IP Forwarding Table Using Hashing
Hierarchical Clustering
Concave Minimization for Support Vector Machine Classifiers
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Low-Rank Sparse Feature Selection for Patient Similarity Learning
Presentation transcript:

Sequential Hierarchical Clustering Author 1, Author 2, and Author 3 Address of the institution Authors’ Email address Crest of the Institution Logo of the Research Group Introduction Clustering is a widely used unsupervised data analysis technique in machine learning. However, a common requirement amongst many existing clustering methods is that all pairwise distances between patterns must be computed in advance. This makes it computationally expensive and difficult to cope with large scale data used in several applications, such as in bioinformatics. Motivation In this work we propose a novel sequential hierarchical clustering technique that initially builds a hierarchical tree from a small fraction of the entire data, while the remaining data is processed sequentially and the tree adapted constructively. Preliminary results using this approach show that the quality of the clusters obtained does not degrade while reducing the computational needs. Methodology Our approach involves … the construction of an initial hierarchical tree (e.g., MC-UPGMA). adaptively changing the structure of the tree in an online fashion. Reduces computational needs while maintaining cluster quality. The threshold θ used in the experiments was determined from the data sample used to construct the initial tree. θ was set as the sum of the pairwise Euclidean distances between the patterns in this data sample. Dataset To evaluate our algorithm we used two bioinformatic datasets. The first is Eisen et al. [2]’s gene expression clusters consisting of ten clusters formed by their clustering algorithm. The second dataset is from a protein fold classification problem, constructed by Ding & Dubchak [1] on a subset of the Structural Classification of Proteins (SCOP) [4] database. This is essentially a classification problem, but we apply clustering to it (without using the class labels) and evaluate how well the resulting cluster membership matches the clusters returned by Single-Round-MC-UPGMA applied on the entire subset. We combined both training and testing sets provided in [1]. Experiments Evaluation Criteria: Figure 2 shows the tree constructed by the Single-Round-MC-UPGMA scheme on the Eisen’s clusters labeled as B and D in [2]. Figure 3 shows the tree obtained by using 4 points for the initial tree, and the remaining 16 points inserted using our approach. Cutting at depth 1 to obtain 2 clusters shows that we get the exact same 2 clusters as in Figure 2. Figure 4 shows the trees obtained by using Eisen’s data clusters (C, B, I and C, B, D and I), and the cuts that returns the desired clusters. Results Results given in Table 1 are the average of 10 independent runs where we randomised the initial subset and the order of presentation of the remaining data. Dealing with Categorical/Symbolic Features Instead of updating the parent nodes using the arithmetic means of their leaf nodes, the most informative child node can be selected to act as a parent. Discussion & Conclusion The proposed technique depends on the quality of the initial tree. The approach can be improved by appropriate choice of the novelty threshold θ. The greatest benefit of this technique lies in its application on very large datasets. Reference [1]. Ding, C.H., Dubchak, I. (2001). Multi-class Protein Fold Recognition using Support Vector Machines and Neural Networks. Bioinformatics 17(4), pp. 349–358. [2]. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. (1998). Cluster Analysis and Display of Genome-wide Expression Patterns. Proceedings of the National Academy of Sciences USA 95(25), pp. 14863–14868. [3]. Loewenstein, Y., Portugaly, E., Fromer, M., Linial, M. (2008). Efficient Algorithms for Accurate Hierarchical Clustering of Huge Datasets: tackling the entire protein space. Bioinformatics 24(13), pp. 41–49. [4]. Lo Conte, L., Ailey, B., Hubbard, T.J., Brenner, S.E., Murzin, A.G., Chothia, C. (2000). SCOP: a Structural Classification Of Proteins database. Nucleic Acids Research, 28(1), 257–259. Start Initial Tree New point Xi Update ancestor(s) Table 1: Preliminary results of the hierarchical clustering performed on a subset of SCOP and Eisen’s data. (1) Beta: ConA-like lectins, (2) Alpha: Four-helical up-anddown bundle, (3) Beta: Immunoglobulin-like beta-sandwich, and (4) A/B: beta/alpha (TIM)-barrel. new_ROOT (Xi,Cur_Node) Cur_Node = ROOT Yes Cur_Node=ROOT dist (Xi,Cur_Node) >θ Dataset #data #data used for initial tree F1-measure SCOP (1) & (2) 28 20 1.0 SCOP (3) & (4) 151 100 0.9921±0.0086 Eisen (B & D) 4 Eisen (C & I) 104 26 Eisen (C, B & I) 113 25 Eisen (C, B, D & I) 124 30 0.9247±0.0652 Yes No No Children = get_children(Cur_Node) make_Sibling (Xi,Cur_Node) Cur_Node= min(dist(Xi,Children)) Children={Φ} “Can a formulation for hierarchical clustering by sequentially processing the data in a one-pass setting be designed?” No Yes Figure 1: Flow diagram of the proposed method. Figure 2: Hierarchical tree constructed by the Single-Round-MC-UPGMA scheme on Eisen’s data clusters labelled as B and D [2]. The tree was constructed by using the whole data of the selected two clusters. The dendrogram was cut at the root node (shown in dotted lines) to obtain two clusters. Figure 3: Tree constructed by the proposed approach with the aid of an initial tree constructed by Single-Round-MC-UPGMA on Eisen’s clusters labelled as B and D [2]. The initial tree was constructed with 20% of the data and the rest was clustered using Algorithm 1. The dendrogram was cut at the root node (shown in dotted lines) to obtain two clusters. Figure 4: Tree constructed by our approach on the selected clusters of Eisen’s data (a) Clusters B, C, D and I; 30 out of 124 data were used for the construction of the initial tree. The tree was cut at the level indicated by the dotted line to yield four perfect clusters. (b) Clusters B, C, and I; 25 out of 113 data were used for the construction of the intial tree. The tree was cut at the level indicated by dotted line to yield three perfect clusters.