Download presentation
Presentation is loading. Please wait.
Published byKaylyn Ridgely Modified over 9 years ago
1
March 2, 2004, BMI 731 - Biomedical Data Management Improving Performance of Multiple Sequence Alignment Analysis in Multi-client Environments Use of Inexpensive Storage as Grid Cache Umit Catalyurek, Mike Gray, Eric Stahlberg, Renato Ferreira, Tahsin Kurc, Joel Saltz Department of Biomedical Informatics The Ohio State University Ohio Supercomputer Center
2
March 2, 2004, BMI 731 - Biomedical Data Management Outline Multi Sequence Alignment CLUSTALW Sequence Analysis in Multiple Client Environment –Caching Intermediate Results –Deployment on SMP Machine –Deployment on Distributed Memory Machine Experimental Results Conclusion
3
March 2, 2004, BMI 731 - Biomedical Data Management Sequence Alignment alignment is a mutual arrangement of two sequences –where the two sequences are similar, and where they differ Sequence s: AAT AGCAA AGCACACA Sequence t: TAA ACATA ACACACTA Hamming Dist: 2 3 6
4
March 2, 2004, BMI 731 - Biomedical Data Management Edit Distance Unit Cost: s: AGCACAC-A AG-CACACA t: A-CACACTA or ACACACT-A cost 2 cost 4 distance(s, t) = 2
5
March 2, 2004, BMI 731 - Biomedical Data Management Multiple Sequence Alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG-- VTISCTGSSSNIG-AGNHVKWYQQLPG VTISCTGTSSNIG--SITVNWYQQLPG LRLSCSSSGFIFS--SYAMYWVRQAPG LSLTCTVSGTSFD--DYYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNW--YVDG ATLVCLISDFYPG--AVTVAW--KADS AALGCLVKDYFPE--PVTVSW--NS-G VSLTCLVKGFYPS--DIAVEW--ESNG Optimal: O(2 n |s i |) 6 sequences of length 100 if constant is 10 -9 seconds running time 6.4 x 10 4 seconds add 2 sequences running time 2.6 x 10 9 seconds or
6
March 2, 2004, BMI 731 - Biomedical Data Management CLUSTAL W Based on Higgins & Sharp CLUSTAL [Gene88] Progressive alignment-based strategy –Pairwise Alignment (n 2 l 2 ) A distance matrix is computed using either an approximate method (fast) or dynamic programming (more accurate, slower) –Computation of Guide Tree (n 3 ): phylogenetic tree Computed from the distance matrix Iteratively selecting aligned pairs and linking them. –Progressive Alignment (nl 2 ) A series of pairwise alignments computed using full dynamic programming to align larger and larger groups of sequences. The order in the Guide Tree determines the ordering of sequence alignments. At each step; either two sequences are aligned, or a new sequence is aligned with a group, or two groups are aligned. n: number of sequences in the query l : average sequence length
7
March 2, 2004, BMI 731 - Biomedical Data Management Sequence Analysis in Multiple Client Environment Many Gene and Protein databases can be accessed over Internet –Multiple request by multiple client Data Caching –Cache pairwise alignments Most expensive phase Computations are independent
8
March 2, 2004, BMI 731 - Biomedical Data Management Data Caching Low-cost high-performance, high-capacity commodity hardware –Disks are cheap: 100GB EIDE Disks around $250. –A PC costs around $700-$1000 no monitor, no high-end graphics card, moderate size memory (128MB-512MB) –Switched fast ethernet Better performance with channel bonding –In 2001: 6 Pentium III PCs, 1 TB of disk storage < $10,000 –In 2002: 5 Pentium 4 PCs, 2.5TB of disk storage < $9,000 –BMI Storage Cluster 7.2TB, 24 PCs = $50,000-$55,000 –UMD Storage Cluster 9.5 TB, 50 PCs
9
March 2, 2004, BMI 731 - Biomedical Data Management Caching Pairwise Alignment Scores Sequence -> Unique ID (UID): –use Hash (tested 10 hash functions including MD5; 4 of them gives similar result with MD5) –Resolve collisions and assign UID to each sequence For more than 1 million sequences from GenBank max collision per hash value was 3: constant time For each pairwise alignment, store two UIDs and a float score –B-Tree: used GIST B-Tree implementation
10
March 2, 2004, BMI 731 - Biomedical Data Management Sequence -> Unique ID (UID):
11
March 2, 2004, BMI 731 - Biomedical Data Management Deployment on SMP Machine A hash table is used to associate a sequence with a unique integer ID (UID) Partitioned B tree stores pairwise alignment results Cache partition chosen by min (UID1, UID2)% #Partitions Multiple threads for Pairwise alignment computation
12
March 2, 2004, BMI 731 - Biomedical Data Management DataCutter Component Framework for Combined Task/Data Parallelism Core Services –Indexing Service: Multilevel hierarchical indexes based on R-tree indexing method. –Filtering Service: Distributed C++ component framework User defines sequence of pipelined components (filters and filter groups) –Pleasingly Parallel –Generalized Reduction User directive tells preprocessor/runtime system to generate and instantiate copies of filters Stream based communication Multiple filter groups can be active simultaneously Flow control between transparent filter copies –Replicated individual filters –Transparent: single stream illusion http://www.datacutter.org
13
March 2, 2004, BMI 731 - Biomedical Data Management Deployment on Distributed Memory Machine DataCutter version of ClustalW – v1 Hash Filter –Stores/computes sequence to unique IDs mapping –Partitioned (declustered) hash Cache Filter –Partitioned (declustered) cache –computes pairwise alignment if it doesn’t exist in the cache Owner computes: computational imbalance CLUSTALW Filter –computes guide tree generation and progressive alignment CLUSTALW Hash (UniqueID) Cache & Compute
14
March 2, 2004, BMI 731 - Biomedical Data Management DataCutter version of ClustalW – v2 DC-ClustalW-v1 + Separate Pairwise Alignment Filter –Cache misses computed in Pairwise Align –Balanced computation Handles multiple queries –multiple copies of CLUSTALW filter CLUSTALW Hash (UniqueID) Cache Pairwise Align Deployment on Distributed Memory Machine
15
March 2, 2004, BMI 731 - Biomedical Data Management Multiple Query Processing -QueryManager Filter -ClustalW Filter -Hash Filter -Cache Filter -Pairwise Alignment Filter CW H C P Host-1 Host-n+1 CW Host-n H C P Host-2n QM Host-0 Deployment on Distributed Memory Machine DataCutter version of ClustalW – v2
16
March 2, 2004, BMI 731 - Biomedical Data Management Experimental Setup 1.Pentium III 650 MHz, 768MB Memory 1000 random sequences from GPCR Average length 450 amino acids per sequence 2.24-Processor Sun Fire 6800, 750MHz, 24GB Memory 350 MSA queries from GPCR; from 2 sequences per query to over 200 sequences per query 3.16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk 64 queries each consist of 40 unique protein sequences from GPCR Average length 450 amino acids per sequence
17
March 2, 2004, BMI 731 - Biomedical Data Management Experiment 1 – Execution Time of CLUSTAL W Pentium III 650 MHz, 768MB Memory 1000 random sequences from GPCR Average length 450 amino acids per sequence
18
March 2, 2004, BMI 731 - Biomedical Data Management Experiment 2 - SMP Results 24-Processor Sun Fire 6800, 750MHz, 24GB Memory 350 MSA queries from GPCR; from 2 sequences per query to over 200 sequences per query
19
March 2, 2004, BMI 731 - Biomedical Data Management Experiment 3 – Distributed Memory DataCutter version of ClustalW – v1 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk 64 queries each consist of 40 unique protein sequences from GPCR Average length 450 amino acids per sequence
20
March 2, 2004, BMI 731 - Biomedical Data Management 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk 64 queries each consist of 40 unique protein sequences from GPCR Average length 450 amino acids per sequence Experiment 3 – Distributed Memory DataCutter version of ClustalW – v1
21
March 2, 2004, BMI 731 - Biomedical Data Management Experiment 3 – Distributed Memory DataCutter version of ClustalW – v2 1 ClustalW filter intra-query parallelization 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk 64 queries each consist of 40 unique protein sequences from GPCR Average length 450 amino acids per sequence
22
March 2, 2004, BMI 731 - Biomedical Data Management Experiment 3 – Distributed Memory DataCutter version of ClustalW – v2 Multiple ClustalW filters inter-query parallelization 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk 8 running a copy of Hash, Cache and PairAlign, 8 running ClustalW 64 queries each consist of 40 unique protein sequences from GPCR Average length 450 amino acids per sequence
23
March 2, 2004, BMI 731 - Biomedical Data Management Conclusion Caching intermediate results –computational intensive application data intensive application SMP Distributed Memory implementation with DataCutter
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.