Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.

Similar presentations


Presentation on theme: "Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation."— Presentation transcript:

1 Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation

2 CMSC 838T – Presentation Overview u Overview of talk  CLUSTALW algorithm, speedup opportunities  Problems with caching  Parallelizing technique  Weaknesses  Applying technique to other bioinformatics problems

3 CMSC 838T – Presentation Motivation u Query overlap in queries submitted to MSA tools  Single researcher: new sequences vs. database  Multiple researchers: similar subsets u CLUSTALW: Progressive algorithm  Three steps  Progressive refinement u Opportunities for speedup  Caching  Query ordering

4 CMSC 838T – Presentation CLUSTALW: Progressive global alignment u Step 1: Pairwise alignment, distance matrix  Fast technique calculates distance between two scores  Calculated for all sequence pairs  Cost: O(q 2 l 2 ) u Step 2: Guide tree  Group nearest first  Build tree sequentially  Cost: O(q 3 ) u Step 3: Progressive alignment  Align, starting at leaves of tree  Cost: O(ql 2 ) * q sequences – mean length l

5 CMSC 838T – Presentation Optimization: Query caching u Step 1: Pairwise alignment, building distance matrix  Many requests partially duplicated  Individual distance calculation not dependent on rest of query  Observation: Dominant step in execution time u Steps 2, 3:  Output dependent on results of entire query  Results less reusable u Technique: cache output of step 1  Individual distances MLI…GIS…QPA… MLISHSDLNQ…0.0 GISRETSS…0.0 QPAKKTYTW…0.0 Query 1 Query 2 MLI…GIS…MST… MLISHSDLNQ…0.0 GISRETSS…0.0 MSTVTKYFYKGE…0.0

6 CMSC 838T – Presentation Challenges to cache implementation u I/O and filesystem overhead  Large cache vs. 2GB file size limit  High seek times within single file u Search and insertion overhead  Sequence: lengthy key  Keyed on each pair of sequences

7 CMSC 838T – Presentation Technique: 2-level B-Tree cache u Level 1: Map sequence text to sequence ID  Hash of sequence?  Sequentially assigned number  Cache size: O(ql) u Level 2: Map ID pairs to calculated distance  Concatenate IDs from level 1  Lower Level 1 ID -> upper half of Level 2 key  Cache size: O(q 2 ) u Distribute level 2 cache across bins  Round robin or block allocated  Distribute bins across machines * q sequences – mean length l

8 CMSC 838T – Presentation SMP u Parallelizable:  Pairwise searches performed independently  Farmed out to query threads Web server Query Thread Query Thread Query Thread Query Thread Level 1 maps (per-machine) Level 2 bins (distributed)

9 CMSC 838T – Presentation SMP u Challenge: Cache coherence  Read-only? l Requires advance knowledge of query details  Online update and serialization? l Locking, duplicate updates  Offline updates? l Per-thread list of cache changes Query Thread Query Thread Query Thread Query Thread Level 1 maps (per-machine) Level 2 bins (distributed)

10 CMSC 838T – Presentation Evaluation: Implementation u Public B-Tree implementation: GIST library u First evaluation on Intel PC  (Pentium III 650, 75GB disks)  q = 25-1000 sequences  l = 450 amino acids per sequence u Second evaluation on Sun Fire  (Sun Fire 6800, 48*750MHz CPUs, 48GB main memory)  l = 417 amino acids per sequence  q = 2-200 sequences  Seeded cache with dummy values u Future work: architectural impact

11 CMSC 838T – Presentation Evaluation: Results

12 CMSC 838T – Presentation Observations u Simple technique  Cheap and easy to implement  Cheap and easy to deploy  Unsupported claim: Are queries really similar? u Concern about distribution across processors  Paper mentions latency, workload balancing  Also reliability of distributed bins  Cache lifetimes? u Proposed solution “component-based system”  “Hand-wavey”; would like to see more.


Download ppt "Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation."

Similar presentations


Ads by Google