An efficient hybrid clustering algorithm for molecular sequences classification Wei-Bang Chen
Why classification? To classify biological sequences, DNA, RNA and protein, into categories based on their characters and functions. We can use this information for disease treatment and drug design. To determine an unknown biological sequence belong to which family. This information can be used to predict characters and functions of the unknown biological sequence.
Agglomerative and Partition Algorithms Agglomerative algorithm Hierarchical agglomerative algorithms find the clusters by initially assigning each object to its own cluster and then repeatedly merging pairs of clusters until a certain stopping criterion is met. Partition algorithm Partition clustering algorithms compute a k-way clustering of a set of objects either directly or via a sequence of repeated bisections. K-means, K-medoids
Agglomerative algorithms Done
Agglomerative clustering Using the substitution matrix to measure the similarity of two sequences. Higher score means two sequences are more similar.
Agglomerative clustering Using the similarity score to build the distance matrix. Merge two clusters which have the highest similarity Compute the similarity between the new cluster and other clusters.
Done k-mean partition algorithms K = 3
k-mean clustering base on HMM k-mean starts with arbitrary assign the sequences to clusters.
k-mean clustering base on HMM Calculate the probability model of each cluster.
k-mean clustering base on HMM Calculate the probability of sequences in every cluster model. Assign sequences to a cluster which has the highest probability. Repeat these two steps until cluster has no change.
Example of protein clustering 0>CCPC QDGDAAKGEKEFN-K-CKACHMIQAPDGTDII-KGGKTGPNLYGVVGRKIASEEGFK-YGEGILEVAEKNPDLTWTEADLIEYVTDPKPWLVKMTDDK------GAKTKM- -TFK---MGKNQA--DVVAFLAQNSPDAGGDGEAA--- 1>CCRF2S QEGDPEAGAKAFN-Q-CQTCHVIVDDSGTTIAGRNAKTGPNLYGVVGRTAGTQADFKGYGEGMKEAGA--KGLAWDEEHFVQYVQDPTKFLKEYTGDA KAKGKM--TFK---LKKEADAHNIWAYLQQVAVRP >CCRF2C GDAAKGEKEFN-K-CKTCHSIIAPDGTEIV-KGAKTGPNLYGVVGRTAGTYPEFK-YKDSIVALGA--SGFAWTEEDIATYVKDPGAFLKEKLDDK------KAKTGM-- AFK---LAKGGE--DVAAYLASVVK >CCQF2R EGDAAAGEKVSK-K-CLACHTFDQGGAN KVGPNLFGVFENTAAHKDNYA-YSESYTEMKA--KGLTWTEANLAAYVKNPKAFVLEKSGDP------KAKSKM-- TFK---LTKDDEIENVIAYLKTLK >CCQF2P AGDAAVGEKIAKAK-CTACHDLNKGGPI KVGPPLFGVFGRTTGTFAGYS-YSPGYTVMGQ--KGHTWDDNALKAYLLDPKGYVQAKSGDP------KANSKM-- IFR---LEKDDDVANVIAYLHTMK >CCRF2P QDAAKGEAVFK-Q-CMTCHRADKN MVGPALGGVVGRKAGTAAGFT-YSPLNHNSGE--AGLVWTQENIIAYLPDPNAYLKKFLTDKGQADKATGSTKM-- TFK---LANDQQRKDVAAYLATLK >CCRF2A AGDPDAGQKVFLK--CAACHKIGPGAKN GVGPSLNGVANRKAGQAEGFA-YSDANKN-----SGLTWDEATFKEYITAPQKKV PGTKM--TFPG-- LPNEADRDNIWAYLSQFKADGSK >CCRF2V QDAASGEQVFKQ--CLVCHSIGPGAKN KVGPVLNGLFGRHSGTIEGFS-YSDANKN-----SGITWTEEVFREYIRDPKAKI PGTKM--IFAG-- IKDEQKVSDLIAYLKQFNADGSKK >CCHU GDVEKGKKIFIMK-CSQCHTVEKGGKH KTGPNLHGLFGRKTGQAPGYS-YTAANKN-----KGIIWGEDTLMEYLENPKKYI PGTKM--IFVG-- IKKKEERADLIAYLKKATNE >CCCH GDIEKGKKIFVQK-CSQCHTVEKGGKH KTGPNLHGLFGRKTGQAEGFS-YTDANKN-----KGITWGEDTLMEYLENPKKYI PGTKM--IFAG-- IKKKSERVDLIAYLKDATSK >CCBN GDVAKGKKTFVQK-CAQCHTVENGGKH KVGPNLWGLFGRKTGQAEGYS-YTDANKS-----KGIVWNENTLMEYLENPKKYI PGTKM--IFAG-- IKKKGERQDLVAYLKSATS >CCRD AGDPVKGEQVFKQ--CKICHQVGPTAKN GVGPEQNDVFGQKAGARPGFN-YSDAMKN-----SGLTWDEATLDKYLENPKAVV PGTKM--VFVG-- LKNPQDRADVIAYLKQLSGK >CCQF2M ADAPPPAFNQ--CKACHSID-AGKN GVGPSLSGAYGRKVGLAPNYK-YSPAHLA-----SGMTIDDAMLTKYLANPKETI PGNKMGAAFGG-- LKNPADVAAVIAYLKTVK >CCQF2F ADAPTAFNQ--CKACHSIE-AGKN GVGPSLSGAYGRKVGLAPNYK-YSAAHLA-----SGMTIDEAMLTNYLANPKATI PGNKMGASFGG-- LKKPEDVKAVIEYLKTVK >CCQFM ADAPAGFTL--CKACHSVE-AGKN GVGPSLAGVYGRKAGTISGFK-FSDPHIK-----SGLTWDEPTLTKYLADPKTVI PGNKM--VFAG-- LKNPDDVKAVIEYLKTLK >CCQFF ADAPPAFGM--CKACHSVE-AGKN GVGPSLAGVYGRKAGTLAGFK-FSDPHAK-----SGLTWDEPTLTKYLADPKGVI PGNKM--VFAG-- LKNPADVAAVIAYLKSL >CCRF2G ----GSAPPGDPVEGKHLFHTI-CILCHT-DIKGRN KVGPSLYGVVGRHSGIEPGYN-YSEANIK-----SGIVWTPDVLFKYIEHPQKIV PGTKM--GYPG-- QPDPQKRADIIAYLETLK >CCQF2T ADESALAQTKGCLACHNPEKKV VGPAYGWVAKKYAGQAGA EAKLVAKVMAGGQGVWAKQLG AEIPM---PAN-- NVTKEEATRLVKWVLSLKQIDYK >CCRFG ATPAELATKAGCAVCHQPTAKG LGPSYQEIAKKYKGQAGA PALMAERVRKGSVGIFG KLPMTPTPPA-- RISDADLKLVIDWILKTP >CCPS5A EDPEVLFKNKGCVACHAIDTKM VGPAYKDVAAKFAGQAGA EAELAQRIKNGSQGVWG PIPM---PPN-- AVSDDEAQTLAKWVLSQK >CCPS5F EDGAALFKSKPCAACHTIDSKM VGPALKEVAAKNAGVKDA DKTLAGHIKNGTQGNWG PIPM---PPN-- QVTDAEALTLAQWVLSLK >CCPS5S QDGEALFKSKPCAACHSIDAKL VGPAFKEVAAKYAGQDGA ADLLAGHIKNGSQGVWG PIPM---PPN-- PVTEEEAKILAEWILSQK >CCPS5M ASGEELFKSKPCGACHSVQAKL VGPALKDVAAKNAGVDGA ADVLAGHIKNGSTGVWG AMPM---PPN-- PVTEEEAKTLAEWVLTLK >CCPS5D STGEELFKAKACVACHSVDKKL VGPAFHDVAAKYGAQGDG VAHITNSIKTGSKGNWG PIPM---PPN-- AVSPEEAKTLAEWIVTLK >CCAV ETGEELYKTKGCTVCHAIDSKL VGPSFKEVTAKYAGQAGI ADTLAAKIKAGGSGNWG QIPM---PPN-- PVSEAEAKTLAEWVLTHK >CCSG GDVAAGASVFSAN-CAACHMGGRNV IVAN--KTLSKSD--LAKYL KGFDDDAVAAVAYQV--TN GKNAM-PGFNG-- RLSPKQIEDVAAYVVDQAEKGW >CCAA ADLAHGGQVFSAN-CASCHLGGRNV VNPA--KTLEKAD--LDEY GMASIEAITTQV--TN GKGAM-PAFGA-- KLSADDIEGVASYALDQSGKEW >CCAU IDINNGENIFTAN-CSACHAGGNNV IMPE--KTLKKDA--LADN KMVSVNAITYQV--TN GKNAM-PAFGS-- RLAETDIEDVANFVLTQSDKGWD >CCPR ADLDNGEKVFSAN-CAACHAGGNNA IMPD--KTLKKDV--LEAN SMNTIDAITYQV--QN GKNAM-PAFGG-- RLVDEDIEDAANYVLSQSEKGW >CCBF ADIENGERIFTAN-CAACHAGGNNV IMPE--KTLKKDA--LEAN GMNAVSAITYQV--TN GKNAM-PAFGG-- RLSDSDIEDVANYVLSQSEQGWD >CCEG GGADVFADN-CSTCHVNGGNV ISAG--KVLSKTA--IEEYL D--GGYTKEAIEYQV--RN GKGPM-PAWEG-- VLSEDEIVAVTDYVYTQAGGAWANVS >CCPC AGDAAAGEDKIGT--CVACHGTDGQG LAPI----YPNLTGQSATYL ESSIKAYRDGQRKGG NAALMTPMAQ--- GLSDEDIADIAAYYSSQE >CCPS4A LFRGGKIAEGMPA--CTGCHGSSPVG IATA---GFPHLGGQHATYV AKQLTDFREGTRNDD GTKIMQSIAAI-- KLSNKDIAAISSYIQGLH >CCPSVM ----AASAGGGARSADDIIAKH-CNACHGAGVLG APKI--GDTAAWKERADHQG GLDGILAKAISGI NAM- PPKGTCADCSDDELREAIQKMSGL >CCCF YDAAAGKATYDAS-CAMCHKTGMMG APKV--GDKAAWAPHIAK GMNVMVANSIKGY----KG TKGMMPAKGGNPKLTDAQVGNAVAYMVGQSK >CCPH55 AVTKADVEQYDLANGKTVYDAN-CASCHAAGIMG APKT--GTARKWNSRLPQ GLATMIEKSVAGYEGEYRG SKTFMPAKGGNPDLTDKQVGDAVAYMVNEVL >CCDV ADGAALYKS--CIGCHSADGG KAMMTNAVKGKYSDEELK ALADYMKAAMGSAKPVKGQ GAEELYKMKG---- YADGSYGGERKAMSKL
Problem of k-mean clustering 3 -> 3 0 -> 0 5 -> 5 4 -> 4 1 -> 1 3 -> 3 1 -> 1 6 -> 1 2 -> 1 1 -> 1 4 -> 1 6 -> 6 0 -> 0 6 -> 6 0 -> 0 1 -> 1 0 -> 0 2 -> 2 4 -> 4 3 -> 4 5 -> 5 2 -> 2 6 -> 6 2 -> 2 0 -> 0 4 -> 4 5 -> 1 1 -> 1 2 -> 2 5 -> 5 0 -> 0 2 -> 2 3 -> 3 0 -> 0 5 -> 5 4 -> 1 1 -> 1 3 -> 3 1 -> 1 6 -> 6 0 -> 0 6 -> 1 0 -> 0 1 -> 1 0 -> 0 2 -> 2 4 -> 4 5 -> 4 2 -> 2 6 -> 4 2 -> 2 0 -> 0 4 -> 1 1 -> 1 2 -> 2 5 -> 5 0 -> 0 2 -> 2 3 -> 3 0 -> 0 5 -> 1 1 -> 1 3 -> 1 1 -> 1 6 -> 1 0 -> 1 1 -> 1 0 -> 1 1 -> 1 0 -> 4 2 -> 2 4 -> 4 2 -> 4 4 -> 4 2 -> 2 0 -> 1 1 -> 1 2 -> 2 5 -> 5 0 -> 0 2 -> 2 3 -> 3 3 -> 1 0 -> 1 1 -> 1 4 -> 4 2 -> 2 4 -> 4 2 -> 1 1 -> 1 2 -> 2 5 -> 5 0 -> 0 2 -> 2 3 -> 3 1 -> 1 4 -> 4 2 -> 2 4 -> 4 1 -> 1 2 -> 2 5 -> 5 0 -> 0 2 -> 2 3 -> 3 6 -> 6 1 -> 1 5 -> 5 3 -> 3 2 -> 2 0 -> 0 4 -> 4 2 -> 2 6 -> 0 4 -> 0 0 -> 0 5 -> 0 2 -> 2 1 -> 0 5 -> 5 3 -> 3 2 -> 2 3 -> 3 2 -> 2 0 -> 3 4 -> 4 3 -> 3 5 -> 5 4 -> 4 3 -> 3 1 -> 1 6 -> 6 2 -> 2 1 -> 1 3 -> 3 6 -> 6 1 -> 1 5 -> 5 3 -> 3 2 -> 2 0 -> 0 4 -> 0 2 -> 0 0 -> 0 2 -> 0 0 -> 0 5 -> 0 5 -> 5 3 -> 3 2 -> 3 3 -> 3 2 -> 3 3 -> 3 4 -> 3 3 -> 3 5 -> 5 4 -> 4 3 -> 3 1 -> 1 6 -> 6 2 -> 2 1 -> 1 3 -> 3 6 -> 6 1 -> 1 5 -> 0 3 -> 0 2 -> 0 0 -> 0 5 -> 3 3 -> 3 5 -> 5 4 -> 4 3 -> 3 1 -> 1 6 -> 6 2 -> 2 1 -> 1 3 -> 3 6 -> 0 1 -> 0 0 -> 0 3 -> 3 5 -> 5 4 -> 4 3 -> 3 1 -> 1 6 -> 6 2 -> 2 1 -> 1 3 -> 3 0 -> 0 3 -> 3 5 -> 5 4 -> 4 3 -> 3 1 -> 1 6 -> 6 2 -> 2 1 -> 1 3 -> 3 First exampleSecond example
Flowchart of the hybrid algorithm
Phase I: agglomerative clustering Current_cluster_number:36 Cluster 0 : 0 Cluster 1 : 1 Cluster 2 : 2 Cluster 3 : 3 Cluster 4 : 4 Cluster 5 : 5 Cluster 6 : 6 Cluster 7 : 7 Cluster 8 : 8 9 Cluster 9 : 10 Cluster 10 : 11 Cluster 11 : 12 Cluster 12 : 13 Cluster 13 : 14 Cluster 14 : 15 Cluster 15 : 16 Cluster 16 : 17 Cluster 17 : 18 Cluster 18 : 19 Cluster 19 : 20 Cluster 20 : 21 Cluster 21 : 22 Cluster 22 : 23 Cluster 23 : 24 Cluster 24 : 25 Cluster 25 : 26 Cluster 26 : 27 Cluster 27 : 28 Cluster 28 : 29 Cluster 29 : 30 Cluster 30 : 31 Cluster 31 : 32 Cluster 32 : 33 Cluster 33 : 34 Cluster 34 : 35 Cluster 35 : 36 Current_cluster_number:37 Cluster 0 : 0 Cluster 1 : 1 Cluster 2 : 2 Cluster 3 : 3 Cluster 4 : 4 Cluster 5 : 5 Cluster 6 : 6 Cluster 7 : 7 Cluster 8 : 8 Cluster 9 : 9 Cluster 10 : 10 Cluster 11 : 11 Cluster 12 : 12 Cluster 13 : 13 Cluster 14 : 14 Cluster 15 : 15 Cluster 16 : 16 Cluster 17 : 17 Cluster 18 : 18 Cluster 19 : 19 Cluster 20 : 20 Cluster 21 : 21 Cluster 22 : 22 Cluster 23 : 23 Cluster 24 : 24 Cluster 25 : 25 Cluster 26 : 26 Cluster 27 : 27 Cluster 28 : 28 Cluster 29 : 39 Cluster 30 : 30 Cluster 31 : 31 Cluster 32 : 32 Cluster 33 : 33 Cluster 34 : 34 Cluster 35 : 35 Cluster 36 : 36 Current_cluster_number:35 Cluster 0 : 0 Cluster 1 : 1 Cluster 2 : 2 Cluster 3 : 3 Cluster 4 : 4 Cluster 5 : 5 Cluster 6 : 6 Cluster 7 : 7 Cluster 8 : Cluster 9 : 11 Cluster 10 : 12 Cluster 11 : 13 Cluster 12 : 14 Cluster 13 : 15 Cluster 14 : 16 Cluster 15 : 17 Cluster 16 : 18 Cluster 17 : 19 Cluster 18 : 20 Cluster 19 : 21 Cluster 20 : 22 Cluster 21 : 23 Cluster 22 : 24 Cluster 23 : 25 Cluster 24 : 26 Cluster 25 : 27 Cluster 26 : 28 Cluster 27 : 29 Cluster 28 : 30 Cluster 29 : 31 Cluster 30 : 32 Cluster 31 : 33 Cluster 32 : 34 Cluster 33 : 35 Cluster 34 : 36 Current_cluster_number:7 Cluster 0 : Cluster 1 : Cluster 2 : Cluster 3 : Cluster 4 : Cluster 5 : 33 Cluster 6 : Until k clusters remain
Phase II: k-mean clustering base on HMM profile Current_cluster_number:7 Cluster 0 : Cluster 1 : Cluster 2 : Cluster 3 : Cluster 4 : Cluster 5 : 33 Cluster 6 : Building HMM profile Cluster 0 : Cluster 1 : Cluster 2 : Cluster 3 : Cluster 4 : Cluster 5 : 33 Cluster 6 : Result Cluster 0 : Cluster 1 : Cluster 2 : Cluster 3 : Cluster 4 : Cluster 5 : 33 Cluster 6 : Answer: Cluster 0 : Cluster 1 : Cluster 2 : Cluster 3 : Cluster 4 : Cluster 5 : Cluster 6 : 36
Summary Agglomerative clustering algorithm can train the partition clustering algorithm to increase the accuracy. Once we build the HMM profile for all clusters, we can determine an unknown sequence belong to which family by compute the score of the unknown sequence against all HMM profiles.
Future work We want to use the partition clustering algorithm in phase II to fix the inaccuracy output which is obtained from phase I.
THE END Thank you !!