Download presentation
Presentation is loading. Please wait.
Published byDelphia Gregory Modified over 9 years ago
1
An efficient hybrid clustering algorithm for molecular sequences classification Wei-Bang Chen
2
Why classification? To classify biological sequences, DNA, RNA and protein, into categories based on their characters and functions. We can use this information for disease treatment and drug design. To determine an unknown biological sequence belong to which family. This information can be used to predict characters and functions of the unknown biological sequence.
3
Agglomerative and Partition Algorithms Agglomerative algorithm Hierarchical agglomerative algorithms find the clusters by initially assigning each object to its own cluster and then repeatedly merging pairs of clusters until a certain stopping criterion is met. Partition algorithm Partition clustering algorithms compute a k-way clustering of a set of objects either directly or via a sequence of repeated bisections. K-means, K-medoids
4
Agglomerative algorithms 123456 Done
5
Agglomerative clustering Using the substitution matrix to measure the similarity of two sequences. Higher score means two sequences are more similar.
6
Agglomerative clustering Using the similarity score to build the distance matrix. Merge two clusters which have the highest similarity Compute the similarity between the new cluster and other clusters.
7
Done k-mean partition algorithms K = 3
8
k-mean clustering base on HMM k-mean starts with arbitrary assign the sequences to clusters.
9
k-mean clustering base on HMM Calculate the probability model of each cluster.
10
k-mean clustering base on HMM Calculate the probability of sequences in every cluster model. Assign sequences to a cluster which has the highest probability. Repeat these two steps until cluster has no change.
11
Example of protein clustering 0>CCPC50 --------QDGDAAKGEKEFN-K-CKACHMIQAPDGTDII-KGGKTGPNLYGVVGRKIASEEGFK-YGEGILEVAEKNPDLTWTEADLIEYVTDPKPWLVKMTDDK------GAKTKM- -TFK---MGKNQA--DVVAFLAQNSPDAGGDGEAA--- 1>CCRF2S --------QEGDPEAGAKAFN-Q-CQTCHVIVDDSGTTIAGRNAKTGPNLYGVVGRTAGTQADFKGYGEGMKEAGA--KGLAWDEEHFVQYVQDPTKFLKEYTGDA------ KAKGKM--TFK---LKKEADAHNIWAYLQQVAVRP---------- 2>CCRF2C ----------GDAAKGEKEFN-K-CKTCHSIIAPDGTEIV-KGAKTGPNLYGVVGRTAGTYPEFK-YKDSIVALGA--SGFAWTEEDIATYVKDPGAFLKEKLDDK------KAKTGM-- AFK---LAKGGE--DVAAYLASVVK------------ 3>CCQF2R ---------EGDAAAGEKVSK-K-CLACHTFDQGGAN-------KVGPNLFGVFENTAAHKDNYA-YSESYTEMKA--KGLTWTEANLAAYVKNPKAFVLEKSGDP------KAKSKM-- TFK---LTKDDEIENVIAYLKTLK------------- 4>CCQF2P ---------AGDAAVGEKIAKAK-CTACHDLNKGGPI-------KVGPPLFGVFGRTTGTFAGYS-YSPGYTVMGQ--KGHTWDDNALKAYLLDPKGYVQAKSGDP------KANSKM-- IFR---LEKDDDVANVIAYLHTMK------------- 5>CCRF2P ----------QDAAKGEAVFK-Q-CMTCHRADKN----------MVGPALGGVVGRKAGTAAGFT-YSPLNHNSGE--AGLVWTQENIIAYLPDPNAYLKKFLTDKGQADKATGSTKM-- TFK---LANDQQRKDVAAYLATLK------------- 6>CCRF2A --------AGDPDAGQKVFLK--CAACHKIGPGAKN-------GVGPSLNGVANRKAGQAEGFA-YSDANKN-----SGLTWDEATFKEYITAPQKKV--------------PGTKM--TFPG-- LPNEADRDNIWAYLSQFKADGSK--------- 7>CCRF2V ---------QDAASGEQVFKQ--CLVCHSIGPGAKN-------KVGPVLNGLFGRHSGTIEGFS-YSDANKN-----SGITWTEEVFREYIRDPKAKI--------------PGTKM--IFAG-- IKDEQKVSDLIAYLKQFNADGSKK-------- 8>CCHU ---------GDVEKGKKIFIMK-CSQCHTVEKGGKH-------KTGPNLHGLFGRKTGQAPGYS-YTAANKN-----KGIIWGEDTLMEYLENPKKYI--------------PGTKM--IFVG-- IKKKEERADLIAYLKKATNE------------ 9>CCCH ---------GDIEKGKKIFVQK-CSQCHTVEKGGKH-------KTGPNLHGLFGRKTGQAEGFS-YTDANKN-----KGITWGEDTLMEYLENPKKYI--------------PGTKM--IFAG-- IKKKSERVDLIAYLKDATSK------------ 10>CCBN ---------GDVAKGKKTFVQK-CAQCHTVENGGKH-------KVGPNLWGLFGRKTGQAEGYS-YTDANKS-----KGIVWNENTLMEYLENPKKYI--------------PGTKM--IFAG-- IKKKGERQDLVAYLKSATS------------- 11>CCRD2 --------AGDPVKGEQVFKQ--CKICHQVGPTAKN-------GVGPEQNDVFGQKAGARPGFN-YSDAMKN-----SGLTWDEATLDKYLENPKAVV--------------PGTKM--VFVG-- LKNPQDRADVIAYLKQLSGK------------ 12>CCQF2M -----------ADAPPPAFNQ--CKACHSID-AGKN-------GVGPSLSGAYGRKVGLAPNYK-YSPAHLA-----SGMTIDDAMLTKYLANPKETI--------------PGNKMGAAFGG-- LKNPADVAAVIAYLKTVK-------------- 13>CCQF2F ------------ADAPTAFNQ--CKACHSIE-AGKN-------GVGPSLSGAYGRKVGLAPNYK-YSAAHLA-----SGMTIDEAMLTNYLANPKATI--------------PGNKMGASFGG-- LKKPEDVKAVIEYLKTVK-------------- 14>CCQFM2 ------------ADAPAGFTL--CKACHSVE-AGKN-------GVGPSLAGVYGRKAGTISGFK-FSDPHIK-----SGLTWDEPTLTKYLADPKTVI--------------PGNKM--VFAG-- LKNPDDVKAVIEYLKTLK-------------- 15>CCQFF2 ------------ADAPPAFGM--CKACHSVE-AGKN-------GVGPSLAGVYGRKAGTLAGFK-FSDPHAK-----SGLTWDEPTLTKYLADPKGVI--------------PGNKM--VFAG-- LKNPADVAAVIAYLKSL--------------- 16>CCRF2G ----GSAPPGDPVEGKHLFHTI-CILCHT-DIKGRN-------KVGPSLYGVVGRHSGIEPGYN-YSEANIK-----SGIVWTPDVLFKYIEHPQKIV--------------PGTKM--GYPG-- QPDPQKRADIIAYLETLK-------------- 17>CCQF2T ------------ADESALAQTKGCLACHNPEKKV----------VGPAYGWVAKKYAGQAGA-----------------EAKLVAKVMAGGQGVWAKQLG-----------AEIPM---PAN-- NVTKEEATRLVKWVLSLKQIDYK---------- 18>CCRFG2 ------------ATPAELATKAGCAVCHQPTAKG----------LGPSYQEIAKKYKGQAGA-----------------PALMAERVRKGSVGIFG----------------KLPMTPTPPA-- RISDADLKLVIDWILKTP--------------- 19>CCPS5A ------------EDPEVLFKNKGCVACHAIDTKM----------VGPAYKDVAAKFAGQAGA-----------------EAELAQRIKNGSQGVWG----------------PIPM---PPN-- AVSDDEAQTLAKWVLSQK--------------- 20>CCPS5F ------------EDGAALFKSKPCAACHTIDSKM----------VGPALKEVAAKNAGVKDA-----------------DKTLAGHIKNGTQGNWG----------------PIPM---PPN-- QVTDAEALTLAQWVLSLK--------------- 21>CCPS5S ------------QDGEALFKSKPCAACHSIDAKL----------VGPAFKEVAAKYAGQDGA-----------------ADLLAGHIKNGSQGVWG----------------PIPM---PPN-- PVTEEEAKILAEWILSQK--------------- 22>CCPS5M ------------ASGEELFKSKPCGACHSVQAKL----------VGPALKDVAAKNAGVDGA-----------------ADVLAGHIKNGSTGVWG----------------AMPM---PPN-- PVTEEEAKTLAEWVLTLK--------------- 23>CCPS5D ------------STGEELFKAKACVACHSVDKKL----------VGPAFHDVAAKYGAQGDG-----------------VAHITNSIKTGSKGNWG----------------PIPM---PPN-- AVSPEEAKTLAEWIVTLK--------------- 24>CCAV5 ------------ETGEELYKTKGCTVCHAIDSKL----------VGPSFKEVTAKYAGQAGI-----------------ADTLAAKIKAGGSGNWG----------------QIPM---PPN-- PVSEAEAKTLAEWVLTHK--------------- 25>CCSG6 ---------GDVAAGASVFSAN-CAACHMGGRNV----------IVAN--KTLSKSD--LAKYL-------------KGFDDDAVAAVAYQV--TN---------------GKNAM-PGFNG-- RLSPKQIEDVAAYVVDQAEKGW----------- 26>CCAA6 ---------ADLAHGGQVFSAN-CASCHLGGRNV----------VNPA--KTLEKAD--LDEY-----------------GMASIEAITTQV--TN---------------GKGAM-PAFGA-- KLSADDIEGVASYALDQSGKEW----------- 27>CCAU6 ---------IDINNGENIFTAN-CSACHAGGNNV----------IMPE--KTLKKDA--LADN-----------------KMVSVNAITYQV--TN---------------GKNAM-PAFGS-- RLAETDIEDVANFVLTQSDKGWD---------- 28>CCPR6 ---------ADLDNGEKVFSAN-CAACHAGGNNA----------IMPD--KTLKKDV--LEAN-----------------SMNTIDAITYQV--QN---------------GKNAM-PAFGG-- RLVDEDIEDAANYVLSQSEKGW----------- 29>CCBF6 ---------ADIENGERIFTAN-CAACHAGGNNV----------IMPE--KTLKKDA--LEAN-----------------GMNAVSAITYQV--TN---------------GKNAM-PAFGG-- RLSDSDIEDVANYVLSQSEQGWD---------- 30>CCEG6 -------------GGADVFADN-CSTCHVNGGNV----------ISAG--KVLSKTA--IEEYL-------------D--GGYTKEAIEYQV--RN---------------GKGPM-PAWEG-- VLSEDEIVAVTDYVYTQAGGAWANVS------- 31>CCPC54 --------AGDAAAGEDKIGT--CVACHGTDGQG----------LAPI----YPNLTGQSATYL-------------------ESSIKAYRDGQRKGG-------------NAALMTPMAQ--- GLSDEDIADIAAYYSSQE--------------- 32>CCPS4A2 --------LFRGGKIAEGMPA--CTGCHGSSPVG----------IATA---GFPHLGGQHATYV-------------------AKQLTDFREGTRNDD-------------GTKIMQSIAAI-- KLSNKDIAAISSYIQGLH--------------- 33>CCPSVM ----AASAGGGARSADDIIAKH-CNACHGAGVLG----------APKI--GDTAAWKERADHQG-------------GLDGILAKAISGI-----------------------NAM- PPKGTCADCSDDELREAIQKMSGL---------------- 34>CCCF55 ---------YDAAAGKATYDAS-CAMCHKTGMMG----------APKV--GDKAAWAPHIAK---------------GMNVMVANSIKGY----KG--------------- TKGMMPAKGGNPKLTDAQVGNAVAYMVGQSK--------------- 35>CCPH55 AVTKADVEQYDLANGKTVYDAN-CASCHAAGIMG----------APKT--GTARKWNSRLPQ---------------GLATMIEKSVAGYEGEYRG--------------- SKTFMPAKGGNPDLTDKQVGDAVAYMVNEVL--------------- 36>CCDV53 ------------ADGAALYKS--CIGCHSADGG-----------KAMMTNAVKGKYSDEELK----------------ALADYMKAAMGSAKPVKGQ--------------GAEELYKMKG---- YADGSYGGERKAMSKL----------------
12
Problem of k-mean clustering 3 -> 3 0 -> 0 5 -> 5 4 -> 4 1 -> 1 3 -> 3 1 -> 1 6 -> 1 2 -> 1 1 -> 1 4 -> 1 6 -> 6 0 -> 0 6 -> 6 0 -> 0 1 -> 1 0 -> 0 2 -> 2 4 -> 4 3 -> 4 5 -> 5 2 -> 2 6 -> 6 2 -> 2 0 -> 0 4 -> 4 5 -> 1 1 -> 1 2 -> 2 5 -> 5 0 -> 0 2 -> 2 3 -> 3 0 -> 0 5 -> 5 4 -> 1 1 -> 1 3 -> 3 1 -> 1 6 -> 6 0 -> 0 6 -> 1 0 -> 0 1 -> 1 0 -> 0 2 -> 2 4 -> 4 5 -> 4 2 -> 2 6 -> 4 2 -> 2 0 -> 0 4 -> 1 1 -> 1 2 -> 2 5 -> 5 0 -> 0 2 -> 2 3 -> 3 0 -> 0 5 -> 1 1 -> 1 3 -> 1 1 -> 1 6 -> 1 0 -> 1 1 -> 1 0 -> 1 1 -> 1 0 -> 4 2 -> 2 4 -> 4 2 -> 4 4 -> 4 2 -> 2 0 -> 1 1 -> 1 2 -> 2 5 -> 5 0 -> 0 2 -> 2 3 -> 3 3 -> 1 0 -> 1 1 -> 1 4 -> 4 2 -> 2 4 -> 4 2 -> 1 1 -> 1 2 -> 2 5 -> 5 0 -> 0 2 -> 2 3 -> 3 1 -> 1 4 -> 4 2 -> 2 4 -> 4 1 -> 1 2 -> 2 5 -> 5 0 -> 0 2 -> 2 3 -> 3 6 -> 6 1 -> 1 5 -> 5 3 -> 3 2 -> 2 0 -> 0 4 -> 4 2 -> 2 6 -> 0 4 -> 0 0 -> 0 5 -> 0 2 -> 2 1 -> 0 5 -> 5 3 -> 3 2 -> 2 3 -> 3 2 -> 2 0 -> 3 4 -> 4 3 -> 3 5 -> 5 4 -> 4 3 -> 3 1 -> 1 6 -> 6 2 -> 2 1 -> 1 3 -> 3 6 -> 6 1 -> 1 5 -> 5 3 -> 3 2 -> 2 0 -> 0 4 -> 0 2 -> 0 0 -> 0 2 -> 0 0 -> 0 5 -> 0 5 -> 5 3 -> 3 2 -> 3 3 -> 3 2 -> 3 3 -> 3 4 -> 3 3 -> 3 5 -> 5 4 -> 4 3 -> 3 1 -> 1 6 -> 6 2 -> 2 1 -> 1 3 -> 3 6 -> 6 1 -> 1 5 -> 0 3 -> 0 2 -> 0 0 -> 0 5 -> 3 3 -> 3 5 -> 5 4 -> 4 3 -> 3 1 -> 1 6 -> 6 2 -> 2 1 -> 1 3 -> 3 6 -> 0 1 -> 0 0 -> 0 3 -> 3 5 -> 5 4 -> 4 3 -> 3 1 -> 1 6 -> 6 2 -> 2 1 -> 1 3 -> 3 0 -> 0 3 -> 3 5 -> 5 4 -> 4 3 -> 3 1 -> 1 6 -> 6 2 -> 2 1 -> 1 3 -> 3 First exampleSecond example 00000011111111111222222223333334445560000001111111111122222222333333444556
13
Flowchart of the hybrid algorithm
14
Phase I: agglomerative clustering Current_cluster_number:36 Cluster 0 : 0 Cluster 1 : 1 Cluster 2 : 2 Cluster 3 : 3 Cluster 4 : 4 Cluster 5 : 5 Cluster 6 : 6 Cluster 7 : 7 Cluster 8 : 8 9 Cluster 9 : 10 Cluster 10 : 11 Cluster 11 : 12 Cluster 12 : 13 Cluster 13 : 14 Cluster 14 : 15 Cluster 15 : 16 Cluster 16 : 17 Cluster 17 : 18 Cluster 18 : 19 Cluster 19 : 20 Cluster 20 : 21 Cluster 21 : 22 Cluster 22 : 23 Cluster 23 : 24 Cluster 24 : 25 Cluster 25 : 26 Cluster 26 : 27 Cluster 27 : 28 Cluster 28 : 29 Cluster 29 : 30 Cluster 30 : 31 Cluster 31 : 32 Cluster 32 : 33 Cluster 33 : 34 Cluster 34 : 35 Cluster 35 : 36 Current_cluster_number:37 Cluster 0 : 0 Cluster 1 : 1 Cluster 2 : 2 Cluster 3 : 3 Cluster 4 : 4 Cluster 5 : 5 Cluster 6 : 6 Cluster 7 : 7 Cluster 8 : 8 Cluster 9 : 9 Cluster 10 : 10 Cluster 11 : 11 Cluster 12 : 12 Cluster 13 : 13 Cluster 14 : 14 Cluster 15 : 15 Cluster 16 : 16 Cluster 17 : 17 Cluster 18 : 18 Cluster 19 : 19 Cluster 20 : 20 Cluster 21 : 21 Cluster 22 : 22 Cluster 23 : 23 Cluster 24 : 24 Cluster 25 : 25 Cluster 26 : 26 Cluster 27 : 27 Cluster 28 : 28 Cluster 29 : 39 Cluster 30 : 30 Cluster 31 : 31 Cluster 32 : 32 Cluster 33 : 33 Cluster 34 : 34 Cluster 35 : 35 Cluster 36 : 36 Current_cluster_number:35 Cluster 0 : 0 Cluster 1 : 1 Cluster 2 : 2 Cluster 3 : 3 Cluster 4 : 4 Cluster 5 : 5 Cluster 6 : 6 Cluster 7 : 7 Cluster 8 : 8 9 10 Cluster 9 : 11 Cluster 10 : 12 Cluster 11 : 13 Cluster 12 : 14 Cluster 13 : 15 Cluster 14 : 16 Cluster 15 : 17 Cluster 16 : 18 Cluster 17 : 19 Cluster 18 : 20 Cluster 19 : 21 Cluster 20 : 22 Cluster 21 : 23 Cluster 22 : 24 Cluster 23 : 25 Cluster 24 : 26 Cluster 25 : 27 Cluster 26 : 28 Cluster 27 : 29 Cluster 28 : 30 Cluster 29 : 31 Cluster 30 : 32 Cluster 31 : 33 Cluster 32 : 34 Cluster 33 : 35 Cluster 34 : 36 Current_cluster_number:7 Cluster 0 : 0 1 2 3 4 5 Cluster 1 : 6 7 8 9 10 11 12 13 14 15 16 Cluster 2 : 17 18 19 20 21 22 23 24 36 Cluster 3 : 25 26 27 28 29 30 Cluster 4 : 31 32 Cluster 5 : 33 Cluster 6 : 34 35 Until k clusters remain
15
Phase II: k-mean clustering base on HMM profile Current_cluster_number:7 Cluster 0 : 0 1 2 3 4 5 Cluster 1 : 6 7 8 9 10 11 12 13 14 15 16 Cluster 2 : 17 18 19 20 21 22 23 24 36 Cluster 3 : 25 26 27 28 29 30 Cluster 4 : 31 32 Cluster 5 : 33 Cluster 6 : 34 35 Building HMM profile Cluster 0 : 0 1 2 3 4 5 Cluster 1 : 6 7 8 9 10 11 12 13 14 15 16 Cluster 2 : 17 18 19 20 21 22 23 24 36 Cluster 3 : 25 26 27 28 29 30 Cluster 4 : 31 32 Cluster 5 : 33 Cluster 6 : 34 35 Result Cluster 0 : 0 1 2 3 4 5 Cluster 1 : 6 7 8 9 10 11 12 13 14 15 16 Cluster 2 : 17 18 19 20 21 22 23 24 36 Cluster 3 : 25 26 27 28 29 30 Cluster 4 : 31 32 Cluster 5 : 33 Cluster 6 : 34 35 Answer: Cluster 0 : 0 1 2 3 4 5 Cluster 1 : 6 7 8 9 10 11 12 13 14 15 16 Cluster 2 : 17 18 19 20 21 22 23 24 Cluster 3 : 25 26 27 28 29 30 Cluster 4 : 31 32 33 Cluster 5 : 34 35 Cluster 6 : 36
16
Summary Agglomerative clustering algorithm can train the partition clustering algorithm to increase the accuracy. Once we build the HMM profile for all clusters, we can determine an unknown sequence belong to which family by compute the score of the unknown sequence against all HMM profiles.
17
Future work We want to use the partition clustering algorithm in phase II to fix the inaccuracy output which is obtained from phase I.
18
THE END Thank you !!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.