Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation
CMSC 838T – Presentation Overview u Overview of talk CLUSTALW algorithm, speedup opportunities Problems with caching Parallelizing technique Weaknesses Applying technique to other bioinformatics problems
CMSC 838T – Presentation Motivation u Query overlap in queries submitted to MSA tools Single researcher: new sequences vs. database Multiple researchers: similar subsets u CLUSTALW: Progressive algorithm Three steps Progressive refinement u Opportunities for speedup Caching Query ordering
CMSC 838T – Presentation CLUSTALW: Progressive global alignment u Step 1: Pairwise alignment, distance matrix Fast technique calculates distance between two scores Calculated for all sequence pairs Cost: O(q 2 l 2 ) u Step 2: Guide tree Group nearest first Build tree sequentially Cost: O(q 3 ) u Step 3: Progressive alignment Align, starting at leaves of tree Cost: O(ql 2 ) * q sequences – mean length l
CMSC 838T – Presentation Optimization: Query caching u Step 1: Pairwise alignment, building distance matrix Many requests partially duplicated Individual distance calculation not dependent on rest of query Observation: Dominant step in execution time u Steps 2, 3: Output dependent on results of entire query Results less reusable u Technique: cache output of step 1 Individual distances MLI…GIS…QPA… MLISHSDLNQ…0.0 GISRETSS…0.0 QPAKKTYTW…0.0 Query 1 Query 2 MLI…GIS…MST… MLISHSDLNQ…0.0 GISRETSS…0.0 MSTVTKYFYKGE…0.0
CMSC 838T – Presentation Challenges to cache implementation u I/O and filesystem overhead Large cache vs. 2GB file size limit High seek times within single file u Search and insertion overhead Sequence: lengthy key Keyed on each pair of sequences
CMSC 838T – Presentation Technique: 2-level B-Tree cache u Level 1: Map sequence text to sequence ID Hash of sequence? Sequentially assigned number Cache size: O(ql) u Level 2: Map ID pairs to calculated distance Concatenate IDs from level 1 Lower Level 1 ID -> upper half of Level 2 key Cache size: O(q 2 ) u Distribute level 2 cache across bins Round robin or block allocated Distribute bins across machines * q sequences – mean length l
CMSC 838T – Presentation SMP u Parallelizable: Pairwise searches performed independently Farmed out to query threads Web server Query Thread Query Thread Query Thread Query Thread Level 1 maps (per-machine) Level 2 bins (distributed)
CMSC 838T – Presentation SMP u Challenge: Cache coherence Read-only? l Requires advance knowledge of query details Online update and serialization? l Locking, duplicate updates Offline updates? l Per-thread list of cache changes Query Thread Query Thread Query Thread Query Thread Level 1 maps (per-machine) Level 2 bins (distributed)
CMSC 838T – Presentation Evaluation: Implementation u Public B-Tree implementation: GIST library u First evaluation on Intel PC (Pentium III 650, 75GB disks) q = sequences l = 450 amino acids per sequence u Second evaluation on Sun Fire (Sun Fire 6800, 48*750MHz CPUs, 48GB main memory) l = 417 amino acids per sequence q = sequences Seeded cache with dummy values u Future work: architectural impact
CMSC 838T – Presentation Evaluation: Results
CMSC 838T – Presentation Observations u Simple technique Cheap and easy to implement Cheap and easy to deploy Unsupported claim: Are queries really similar? u Concern about distribution across processors Paper mentions latency, workload balancing Also reliability of distributed bins Cache lifetimes? u Proposed solution “component-based system” “Hand-wavey”; would like to see more.