Presentation is loading. Please wait.

Presentation is loading. Please wait.

Encyclopedia of Life as a Target VGrADS Application

Similar presentations


Presentation on theme: "Encyclopedia of Life as a Target VGrADS Application"— Presentation transcript:

1 Encyclopedia of Life as a Target VGrADS Application
A. Chien, H. Casanova, Y.-S. Kee, R. Huang, F. Berman VGrADS Workshop, Sept. 2004

2 Outline Description of EOL What we can use for VGrADS
Current VGrADS work on EOL

3 The EOL Application: Goal
For each of 800+ complete or partial publicly available genomes, define protein annotation and model 3D structure Discover the what and how of the proteins associated with each genome Use public domain software core computation Web-accessible archival data

4 EOL Overview NCBI Structure analysis tools EOL Fold Library

5 EOL Overview NCBI Library of known protein structures, based on lab work Info from PDB, SCOP, PDP Calculated using CE (Ilya Shindyalov, SDSC) Fold Library Structure analysis tools EOL

6 EOL Overview NCBI Non-Redundant DB Fold Library ~ 1.5 M sequences
~ 1000 species Doubling every 2 yrs Fold Library Structure analysis tools EOL

7 EOL Overview NCBI Structure annotation tools (wublast, psiblast, 123D+) Filters (tmhmm, psort, signalp, coils) Fold Library sequences possible folds Structure analysis tools predicted structures EOL

8 EOL Overview NCBI Structure annotation tools (wublast, psiblast, 123D+) Filters (tmhmm, psort, signalp, coils) Fold Library sequences possible folds iGAP Structure analysis tools predicted structures EOL

9 EOL: core computation psiblast and 123D are the two computations that take significant time, relatively: A few KB psiblast A few KB 3 times (8 minutes) A few KB 123D (45 minutes) 2GHz pentium

10 EOL: core computation EOL
Mostly considered as a single task because software can run everywhere and databases are everywhere (although there are a few nuances) EOL A few KB

11 EOL: core computation Maximum Throughput EOL
Mostly considered as a single task because software can run everywhere and databases are everywhere (although there are a few nuances) Maximum Throughput EOL EOL A few KB EOL EOL EOL EOL EOL EOL EOL EOL EOL EOL EOL EOL EOL EOL EOL EOL

12 Computational Requirements
To calculate current NCBI NR database: 1.5 M sequences (and growing) 7 structure annotation programs (w/ more to come) => estimated 1.8 M CPU hours ( ~ 200 years) Must handle quarterly NR database releases: Changed sequences New species

13 Available Resources (SC2003)
Host OS/Scheduler Location BII Viper Cluster Linux/PBS The BioInformatics Institute, Singapore EOL Cluster (saxicolous) San Diego Supercomputing Center NCSA Teragrid Cluster National Center for Supercomputing Applications, University of Illinois Morpheus Cluster The University of Michigan SDSC Teragrid Cluster NBCR workstation Solaris/none Titech Condor pool Linux/Condor Tokyo Institute of Technology, Japan Monash University, Australia Linux/none UFCG workstation Universidade Federal de Campina Grande, Brazil BeSc workstation Belfast eScience Center, Ireland

14 Challenges faced by EOL today
Resource heterogeneity / task launching Batch schedulers Globus Logistics over long runs Current solution: Use APST underneath Write a “workflow management system” on top to handle app-specific things Currently deployed and working

15 Outline Description of EOL What we can use for VGrADS
Current VGrADS work on EOL

16 Would VGrADS Need EOL Software?
Status of EOL software it’s “ad-hoc” iGAP is nearly impossible to modify/understand Underlying software is worse and yet the principle is very simple EOL team aware of the fact and there may be a refactoring of the whole thing anyway Current consensus working with the EOL software directly would not be productive

17 Would VGrADS Need EOL Software?
But EOL uses domain name software that’s well documented, it’s only the glue that’s a mess It is straightforward to work with that software in a way that is representative of what EOL or other applications want to do Strategy: Target generic applications based on public domain software Use EOL just as one example of what somebody may do Not use the EOL software at all Problem: is it politically correct?

18 Why would EOL need VGrADS?
EOL needs to do millions of small identical DAGs that can be fused as one task, replicates all DBs everywhere, and data is “small” why do you need a fancy virtual grid? why do you need scheduling? it’s basically in spirit (although there are technical difficulties) Not that there is anything wrong with this, but still, it would be nice to have something more interesting

19 Why would EOL need VGrADS?
But EOL researchers are really planning to move to: more complex and user-customizable DAGS (basically a general workflow system) High throughput will not be the (only) goal Databases may be distributed and not everywhere none of this is in place and it is unclear when it will be (although there are tons of workflow efforts around, and in particular at SDSC) In this view we developed several virtual grid specs for future EOL or other applications like it that may have a stronger need for a VGrid

20 Outline Description of EOL What we can use for VGrADS
Current VGrADS work on EOL

21 Running “EOL” on a VGrid?
Apps like psiblast or 123D can benefit from non-dedicated resources because they are embarrassingly parallel One can request a pool of desktop resources via the VGrid interface How does the freshness of dynamic VGrid resource information affect application performance? How to schedule the application on the VGrid? Specifically, what about? Quality of Host Availability Data Scheduling Policies Resource Management Policies

22 Experimental Framework
Trace driven simulation based on Entropia DCGrid™ software set up at SDSC Max-min scheduling algorithm (using prediction) to schedule sequences as independent tasks running PSI-BLAST and 123D Concept of epoch as time periods for running Max-min algorithm

23 Resource Information and Scheduling
We looked at Quality of Host Availability Data Dynamic (from 1, 2, or 3 epochs ago) Static (only CPU speed) Scheduling Policies Pessimistic Optimistic Resource Management Policies Handling Incomplete Tasks Rescheduled Continued

24 Results: Makespan Not surprisingly, over most of our results, in general fresher availability data leads to better makespan Our results quantify the expected improvements, and thus the trade-off with the cost of obtaining fresh information In some cases there are interesting synergies between information and scheduling policies (e.g., Pessimistic + Continued + Static) Fresher host availability data enables better application makespan Fresher data leads to higher CPU utilization Scheduler with fresher data better able to take advantage of faster CPUs Fresher data leads to lower CPU waste

25 Results: CPU Usage and Waste
Fresher data leads to higher CPU utilization Fresher data makes it possible to better take advantage of faster CPUs Fresher data leads to lower CPU waste Fresher host availability data enables better application makespan Fresher data leads to higher CPU utilization Scheduler with fresher data better able to take advantage of faster CPUs Fresher data leads to lower CPU waste

26 Tasks-to-Resources Ratio
Quality of availability data has a greater effect on makespan for smaller tasks-to-resources ratios

27 Conclusion Working with EOL software isn’t viable
We should just view EOL as a tech transfer target that, along with other applications, may benefit from VGrADS in the future. At the moment we are doing work with the principle of EOL in mind but without the actual software Looking at EOL helped us define and harden the VGrid abstraction Work on scheduling on desktop resources


Download ppt "Encyclopedia of Life as a Target VGrADS Application"

Similar presentations


Ads by Google