An efficient hybrid clustering algorithm for molecular sequences classification Wei-Bang Chen.

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Hierarchical Clustering
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Clustering Categorical Data The Case of Quran Verses
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
IT 433 Data Warehousing and Data Mining Hierarchical Clustering Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
6-1 ©2006 Raj Jain Clustering Techniques  Goal: Partition into groups so the members of a group are as similar as possible and different.
Unsupervised learning: Clustering Ata Kaban The University of Birmingham
Clustering II.
UPGMA Algorithm.  Main idea: Group the taxa into clusters and repeatedly merge the closest two clusters until one cluster remains  Algorithm  Add a.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
By Fernando Seoane, April 25 th, 2006 Demo for Non-Parametric Classification Euclidean Metric Classifier with Data Clustering.
Unsupervised Learning and Data Mining
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Introduction to Bioinformatics - Tutorial no. 12
Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Health and CS Philip Chan. DNA, Genes, Proteins What is the relationship among DNA Genes Proteins ?
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
C OMPUTATIONAL BIOLOGY. O UTLINE Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity of the Algorithms.
Presented by Tienwei Tsai July, 2005
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
Microarrays.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Prepared by: Mahmoud Rafeek Al-Farra
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Clustering Patrice Koehl Department of Biological Sciences
Cluster Analysis II 10/03/2012.
Data Clustering Michael J. Watts
Discrimination and Classification
K-means and Hierarchical Clustering
Clustering.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
SEEM4630 Tutorial 3 – Clustering.
Presentation transcript:

An efficient hybrid clustering algorithm for molecular sequences classification Wei-Bang Chen

Why classification? To classify biological sequences, DNA, RNA and protein, into categories based on their characters and functions. We can use this information for disease treatment and drug design. To determine an unknown biological sequence belong to which family. This information can be used to predict characters and functions of the unknown biological sequence.

Agglomerative and Partition Algorithms Agglomerative algorithm Hierarchical agglomerative algorithms find the clusters by initially assigning each object to its own cluster and then repeatedly merging pairs of clusters until a certain stopping criterion is met. Partition algorithm Partition clustering algorithms compute a k-way clustering of a set of objects either directly or via a sequence of repeated bisections. K-means, K-medoids

Agglomerative algorithms Done

Agglomerative clustering Using the substitution matrix to measure the similarity of two sequences. Higher score means two sequences are more similar.

Agglomerative clustering Using the similarity score to build the distance matrix. Merge two clusters which have the highest similarity Compute the similarity between the new cluster and other clusters.

Done k-mean partition algorithms K = 3

k-mean clustering base on HMM k-mean starts with arbitrary assign the sequences to clusters.

k-mean clustering base on HMM Calculate the probability model of each cluster.

k-mean clustering base on HMM Calculate the probability of sequences in every cluster model. Assign sequences to a cluster which has the highest probability. Repeat these two steps until cluster has no change.

Example of protein clustering 0>CCPC QDGDAAKGEKEFN-K-CKACHMIQAPDGTDII-KGGKTGPNLYGVVGRKIASEEGFK-YGEGILEVAEKNPDLTWTEADLIEYVTDPKPWLVKMTDDK------GAKTKM- -TFK---MGKNQA--DVVAFLAQNSPDAGGDGEAA--- 1>CCRF2S QEGDPEAGAKAFN-Q-CQTCHVIVDDSGTTIAGRNAKTGPNLYGVVGRTAGTQADFKGYGEGMKEAGA--KGLAWDEEHFVQYVQDPTKFLKEYTGDA KAKGKM--TFK---LKKEADAHNIWAYLQQVAVRP >CCRF2C GDAAKGEKEFN-K-CKTCHSIIAPDGTEIV-KGAKTGPNLYGVVGRTAGTYPEFK-YKDSIVALGA--SGFAWTEEDIATYVKDPGAFLKEKLDDK------KAKTGM-- AFK---LAKGGE--DVAAYLASVVK >CCQF2R EGDAAAGEKVSK-K-CLACHTFDQGGAN KVGPNLFGVFENTAAHKDNYA-YSESYTEMKA--KGLTWTEANLAAYVKNPKAFVLEKSGDP------KAKSKM-- TFK---LTKDDEIENVIAYLKTLK >CCQF2P AGDAAVGEKIAKAK-CTACHDLNKGGPI KVGPPLFGVFGRTTGTFAGYS-YSPGYTVMGQ--KGHTWDDNALKAYLLDPKGYVQAKSGDP------KANSKM-- IFR---LEKDDDVANVIAYLHTMK >CCRF2P QDAAKGEAVFK-Q-CMTCHRADKN MVGPALGGVVGRKAGTAAGFT-YSPLNHNSGE--AGLVWTQENIIAYLPDPNAYLKKFLTDKGQADKATGSTKM-- TFK---LANDQQRKDVAAYLATLK >CCRF2A AGDPDAGQKVFLK--CAACHKIGPGAKN GVGPSLNGVANRKAGQAEGFA-YSDANKN-----SGLTWDEATFKEYITAPQKKV PGTKM--TFPG-- LPNEADRDNIWAYLSQFKADGSK >CCRF2V QDAASGEQVFKQ--CLVCHSIGPGAKN KVGPVLNGLFGRHSGTIEGFS-YSDANKN-----SGITWTEEVFREYIRDPKAKI PGTKM--IFAG-- IKDEQKVSDLIAYLKQFNADGSKK >CCHU GDVEKGKKIFIMK-CSQCHTVEKGGKH KTGPNLHGLFGRKTGQAPGYS-YTAANKN-----KGIIWGEDTLMEYLENPKKYI PGTKM--IFVG-- IKKKEERADLIAYLKKATNE >CCCH GDIEKGKKIFVQK-CSQCHTVEKGGKH KTGPNLHGLFGRKTGQAEGFS-YTDANKN-----KGITWGEDTLMEYLENPKKYI PGTKM--IFAG-- IKKKSERVDLIAYLKDATSK >CCBN GDVAKGKKTFVQK-CAQCHTVENGGKH KVGPNLWGLFGRKTGQAEGYS-YTDANKS-----KGIVWNENTLMEYLENPKKYI PGTKM--IFAG-- IKKKGERQDLVAYLKSATS >CCRD AGDPVKGEQVFKQ--CKICHQVGPTAKN GVGPEQNDVFGQKAGARPGFN-YSDAMKN-----SGLTWDEATLDKYLENPKAVV PGTKM--VFVG-- LKNPQDRADVIAYLKQLSGK >CCQF2M ADAPPPAFNQ--CKACHSID-AGKN GVGPSLSGAYGRKVGLAPNYK-YSPAHLA-----SGMTIDDAMLTKYLANPKETI PGNKMGAAFGG-- LKNPADVAAVIAYLKTVK >CCQF2F ADAPTAFNQ--CKACHSIE-AGKN GVGPSLSGAYGRKVGLAPNYK-YSAAHLA-----SGMTIDEAMLTNYLANPKATI PGNKMGASFGG-- LKKPEDVKAVIEYLKTVK >CCQFM ADAPAGFTL--CKACHSVE-AGKN GVGPSLAGVYGRKAGTISGFK-FSDPHIK-----SGLTWDEPTLTKYLADPKTVI PGNKM--VFAG-- LKNPDDVKAVIEYLKTLK >CCQFF ADAPPAFGM--CKACHSVE-AGKN GVGPSLAGVYGRKAGTLAGFK-FSDPHAK-----SGLTWDEPTLTKYLADPKGVI PGNKM--VFAG-- LKNPADVAAVIAYLKSL >CCRF2G ----GSAPPGDPVEGKHLFHTI-CILCHT-DIKGRN KVGPSLYGVVGRHSGIEPGYN-YSEANIK-----SGIVWTPDVLFKYIEHPQKIV PGTKM--GYPG-- QPDPQKRADIIAYLETLK >CCQF2T ADESALAQTKGCLACHNPEKKV VGPAYGWVAKKYAGQAGA EAKLVAKVMAGGQGVWAKQLG AEIPM---PAN-- NVTKEEATRLVKWVLSLKQIDYK >CCRFG ATPAELATKAGCAVCHQPTAKG LGPSYQEIAKKYKGQAGA PALMAERVRKGSVGIFG KLPMTPTPPA-- RISDADLKLVIDWILKTP >CCPS5A EDPEVLFKNKGCVACHAIDTKM VGPAYKDVAAKFAGQAGA EAELAQRIKNGSQGVWG PIPM---PPN-- AVSDDEAQTLAKWVLSQK >CCPS5F EDGAALFKSKPCAACHTIDSKM VGPALKEVAAKNAGVKDA DKTLAGHIKNGTQGNWG PIPM---PPN-- QVTDAEALTLAQWVLSLK >CCPS5S QDGEALFKSKPCAACHSIDAKL VGPAFKEVAAKYAGQDGA ADLLAGHIKNGSQGVWG PIPM---PPN-- PVTEEEAKILAEWILSQK >CCPS5M ASGEELFKSKPCGACHSVQAKL VGPALKDVAAKNAGVDGA ADVLAGHIKNGSTGVWG AMPM---PPN-- PVTEEEAKTLAEWVLTLK >CCPS5D STGEELFKAKACVACHSVDKKL VGPAFHDVAAKYGAQGDG VAHITNSIKTGSKGNWG PIPM---PPN-- AVSPEEAKTLAEWIVTLK >CCAV ETGEELYKTKGCTVCHAIDSKL VGPSFKEVTAKYAGQAGI ADTLAAKIKAGGSGNWG QIPM---PPN-- PVSEAEAKTLAEWVLTHK >CCSG GDVAAGASVFSAN-CAACHMGGRNV IVAN--KTLSKSD--LAKYL KGFDDDAVAAVAYQV--TN GKNAM-PGFNG-- RLSPKQIEDVAAYVVDQAEKGW >CCAA ADLAHGGQVFSAN-CASCHLGGRNV VNPA--KTLEKAD--LDEY GMASIEAITTQV--TN GKGAM-PAFGA-- KLSADDIEGVASYALDQSGKEW >CCAU IDINNGENIFTAN-CSACHAGGNNV IMPE--KTLKKDA--LADN KMVSVNAITYQV--TN GKNAM-PAFGS-- RLAETDIEDVANFVLTQSDKGWD >CCPR ADLDNGEKVFSAN-CAACHAGGNNA IMPD--KTLKKDV--LEAN SMNTIDAITYQV--QN GKNAM-PAFGG-- RLVDEDIEDAANYVLSQSEKGW >CCBF ADIENGERIFTAN-CAACHAGGNNV IMPE--KTLKKDA--LEAN GMNAVSAITYQV--TN GKNAM-PAFGG-- RLSDSDIEDVANYVLSQSEQGWD >CCEG GGADVFADN-CSTCHVNGGNV ISAG--KVLSKTA--IEEYL D--GGYTKEAIEYQV--RN GKGPM-PAWEG-- VLSEDEIVAVTDYVYTQAGGAWANVS >CCPC AGDAAAGEDKIGT--CVACHGTDGQG LAPI----YPNLTGQSATYL ESSIKAYRDGQRKGG NAALMTPMAQ--- GLSDEDIADIAAYYSSQE >CCPS4A LFRGGKIAEGMPA--CTGCHGSSPVG IATA---GFPHLGGQHATYV AKQLTDFREGTRNDD GTKIMQSIAAI-- KLSNKDIAAISSYIQGLH >CCPSVM ----AASAGGGARSADDIIAKH-CNACHGAGVLG APKI--GDTAAWKERADHQG GLDGILAKAISGI NAM- PPKGTCADCSDDELREAIQKMSGL >CCCF YDAAAGKATYDAS-CAMCHKTGMMG APKV--GDKAAWAPHIAK GMNVMVANSIKGY----KG TKGMMPAKGGNPKLTDAQVGNAVAYMVGQSK >CCPH55 AVTKADVEQYDLANGKTVYDAN-CASCHAAGIMG APKT--GTARKWNSRLPQ GLATMIEKSVAGYEGEYRG SKTFMPAKGGNPDLTDKQVGDAVAYMVNEVL >CCDV ADGAALYKS--CIGCHSADGG KAMMTNAVKGKYSDEELK ALADYMKAAMGSAKPVKGQ GAEELYKMKG---- YADGSYGGERKAMSKL

Problem of k-mean clustering 3 -> 3 0 -> 0 5 -> 5 4 -> 4 1 -> 1 3 -> 3 1 -> 1 6 -> 1 2 -> 1 1 -> 1 4 -> 1 6 -> 6 0 -> 0 6 -> 6 0 -> 0 1 -> 1 0 -> 0 2 -> 2 4 -> 4 3 -> 4 5 -> 5 2 -> 2 6 -> 6 2 -> 2 0 -> 0 4 -> 4 5 -> 1 1 -> 1 2 -> 2 5 -> 5 0 -> 0 2 -> 2 3 -> 3 0 -> 0 5 -> 5 4 -> 1 1 -> 1 3 -> 3 1 -> 1 6 -> 6 0 -> 0 6 -> 1 0 -> 0 1 -> 1 0 -> 0 2 -> 2 4 -> 4 5 -> 4 2 -> 2 6 -> 4 2 -> 2 0 -> 0 4 -> 1 1 -> 1 2 -> 2 5 -> 5 0 -> 0 2 -> 2 3 -> 3 0 -> 0 5 -> 1 1 -> 1 3 -> 1 1 -> 1 6 -> 1 0 -> 1 1 -> 1 0 -> 1 1 -> 1 0 -> 4 2 -> 2 4 -> 4 2 -> 4 4 -> 4 2 -> 2 0 -> 1 1 -> 1 2 -> 2 5 -> 5 0 -> 0 2 -> 2 3 -> 3 3 -> 1 0 -> 1 1 -> 1 4 -> 4 2 -> 2 4 -> 4 2 -> 1 1 -> 1 2 -> 2 5 -> 5 0 -> 0 2 -> 2 3 -> 3 1 -> 1 4 -> 4 2 -> 2 4 -> 4 1 -> 1 2 -> 2 5 -> 5 0 -> 0 2 -> 2 3 -> 3 6 -> 6 1 -> 1 5 -> 5 3 -> 3 2 -> 2 0 -> 0 4 -> 4 2 -> 2 6 -> 0 4 -> 0 0 -> 0 5 -> 0 2 -> 2 1 -> 0 5 -> 5 3 -> 3 2 -> 2 3 -> 3 2 -> 2 0 -> 3 4 -> 4 3 -> 3 5 -> 5 4 -> 4 3 -> 3 1 -> 1 6 -> 6 2 -> 2 1 -> 1 3 -> 3 6 -> 6 1 -> 1 5 -> 5 3 -> 3 2 -> 2 0 -> 0 4 -> 0 2 -> 0 0 -> 0 2 -> 0 0 -> 0 5 -> 0 5 -> 5 3 -> 3 2 -> 3 3 -> 3 2 -> 3 3 -> 3 4 -> 3 3 -> 3 5 -> 5 4 -> 4 3 -> 3 1 -> 1 6 -> 6 2 -> 2 1 -> 1 3 -> 3 6 -> 6 1 -> 1 5 -> 0 3 -> 0 2 -> 0 0 -> 0 5 -> 3 3 -> 3 5 -> 5 4 -> 4 3 -> 3 1 -> 1 6 -> 6 2 -> 2 1 -> 1 3 -> 3 6 -> 0 1 -> 0 0 -> 0 3 -> 3 5 -> 5 4 -> 4 3 -> 3 1 -> 1 6 -> 6 2 -> 2 1 -> 1 3 -> 3 0 -> 0 3 -> 3 5 -> 5 4 -> 4 3 -> 3 1 -> 1 6 -> 6 2 -> 2 1 -> 1 3 -> 3 First exampleSecond example

Flowchart of the hybrid algorithm

Phase I: agglomerative clustering Current_cluster_number:36 Cluster 0 : 0 Cluster 1 : 1 Cluster 2 : 2 Cluster 3 : 3 Cluster 4 : 4 Cluster 5 : 5 Cluster 6 : 6 Cluster 7 : 7 Cluster 8 : 8 9 Cluster 9 : 10 Cluster 10 : 11 Cluster 11 : 12 Cluster 12 : 13 Cluster 13 : 14 Cluster 14 : 15 Cluster 15 : 16 Cluster 16 : 17 Cluster 17 : 18 Cluster 18 : 19 Cluster 19 : 20 Cluster 20 : 21 Cluster 21 : 22 Cluster 22 : 23 Cluster 23 : 24 Cluster 24 : 25 Cluster 25 : 26 Cluster 26 : 27 Cluster 27 : 28 Cluster 28 : 29 Cluster 29 : 30 Cluster 30 : 31 Cluster 31 : 32 Cluster 32 : 33 Cluster 33 : 34 Cluster 34 : 35 Cluster 35 : 36 Current_cluster_number:37 Cluster 0 : 0 Cluster 1 : 1 Cluster 2 : 2 Cluster 3 : 3 Cluster 4 : 4 Cluster 5 : 5 Cluster 6 : 6 Cluster 7 : 7 Cluster 8 : 8 Cluster 9 : 9 Cluster 10 : 10 Cluster 11 : 11 Cluster 12 : 12 Cluster 13 : 13 Cluster 14 : 14 Cluster 15 : 15 Cluster 16 : 16 Cluster 17 : 17 Cluster 18 : 18 Cluster 19 : 19 Cluster 20 : 20 Cluster 21 : 21 Cluster 22 : 22 Cluster 23 : 23 Cluster 24 : 24 Cluster 25 : 25 Cluster 26 : 26 Cluster 27 : 27 Cluster 28 : 28 Cluster 29 : 39 Cluster 30 : 30 Cluster 31 : 31 Cluster 32 : 32 Cluster 33 : 33 Cluster 34 : 34 Cluster 35 : 35 Cluster 36 : 36 Current_cluster_number:35 Cluster 0 : 0 Cluster 1 : 1 Cluster 2 : 2 Cluster 3 : 3 Cluster 4 : 4 Cluster 5 : 5 Cluster 6 : 6 Cluster 7 : 7 Cluster 8 : Cluster 9 : 11 Cluster 10 : 12 Cluster 11 : 13 Cluster 12 : 14 Cluster 13 : 15 Cluster 14 : 16 Cluster 15 : 17 Cluster 16 : 18 Cluster 17 : 19 Cluster 18 : 20 Cluster 19 : 21 Cluster 20 : 22 Cluster 21 : 23 Cluster 22 : 24 Cluster 23 : 25 Cluster 24 : 26 Cluster 25 : 27 Cluster 26 : 28 Cluster 27 : 29 Cluster 28 : 30 Cluster 29 : 31 Cluster 30 : 32 Cluster 31 : 33 Cluster 32 : 34 Cluster 33 : 35 Cluster 34 : 36 Current_cluster_number:7 Cluster 0 : Cluster 1 : Cluster 2 : Cluster 3 : Cluster 4 : Cluster 5 : 33 Cluster 6 : Until k clusters remain

Phase II: k-mean clustering base on HMM profile Current_cluster_number:7 Cluster 0 : Cluster 1 : Cluster 2 : Cluster 3 : Cluster 4 : Cluster 5 : 33 Cluster 6 : Building HMM profile Cluster 0 : Cluster 1 : Cluster 2 : Cluster 3 : Cluster 4 : Cluster 5 : 33 Cluster 6 : Result Cluster 0 : Cluster 1 : Cluster 2 : Cluster 3 : Cluster 4 : Cluster 5 : 33 Cluster 6 : Answer: Cluster 0 : Cluster 1 : Cluster 2 : Cluster 3 : Cluster 4 : Cluster 5 : Cluster 6 : 36

Summary Agglomerative clustering algorithm can train the partition clustering algorithm to increase the accuracy. Once we build the HMM profile for all clusters, we can determine an unknown sequence belong to which family by compute the score of the unknown sequence against all HMM profiles.

Future work We want to use the partition clustering algorithm in phase II to fix the inaccuracy output which is obtained from phase I.

THE END Thank you !!