Encyclopedia of Life as a Target VGrADS Application

Encyclopedia of Life as a Target VGrADS Application
A. Chien, H. Casanova, Y.-S. Kee, R. Huang, F. Berman VGrADS Workshop, Sept. 2004

Outline Description of EOL What we can use for VGrADS
Current VGrADS work on EOL

The EOL Application: Goal
For each of 800+ complete or partial publicly available genomes, define protein annotation and model 3D structure Discover the what and how of the proteins associated with each genome Use public domain software core computation Web-accessible archival data

EOL Overview NCBI Structure analysis tools EOL Fold Library

EOL Overview NCBI Library of known protein structures, based on lab work Info from PDB, SCOP, PDP Calculated using CE (Ilya Shindyalov, SDSC) Fold Library Structure analysis tools EOL

EOL Overview NCBI Non-Redundant DB Fold Library ~ 1.5 M sequences
~ 1000 species Doubling every 2 yrs Fold Library Structure analysis tools EOL

EOL Overview NCBI Structure annotation tools (wublast, psiblast, 123D+) Filters (tmhmm, psort, signalp, coils) Fold Library sequences possible folds Structure analysis tools predicted structures EOL

EOL Overview NCBI Structure annotation tools (wublast, psiblast, 123D+) Filters (tmhmm, psort, signalp, coils) Fold Library sequences possible folds iGAP Structure analysis tools predicted structures EOL

EOL: core computation psiblast and 123D are the two computations that take significant time, relatively: A few KB psiblast A few KB 3 times (8 minutes) A few KB 123D (45 minutes) 2GHz pentium

EOL: core computation EOL
Mostly considered as a single task because software can run everywhere and databases are everywhere (although there are a few nuances) EOL A few KB

EOL: core computation Maximum Throughput EOL
Mostly considered as a single task because software can run everywhere and databases are everywhere (although there are a few nuances) Maximum Throughput EOL EOL A few KB EOL EOL EOL EOL EOL EOL EOL EOL EOL EOL EOL EOL EOL EOL EOL EOL

Computational Requirements
To calculate current NCBI NR database: 1.5 M sequences (and growing) 7 structure annotation programs (w/ more to come) => estimated 1.8 M CPU hours ( ~ 200 years) Must handle quarterly NR database releases: Changed sequences New species

Available Resources (SC2003)
Host OS/Scheduler Location BII Viper Cluster Linux/PBS The BioInformatics Institute, Singapore EOL Cluster (saxicolous) San Diego Supercomputing Center NCSA Teragrid Cluster National Center for Supercomputing Applications, University of Illinois Morpheus Cluster The University of Michigan SDSC Teragrid Cluster NBCR workstation Solaris/none Titech Condor pool Linux/Condor Tokyo Institute of Technology, Japan Monash University, Australia Linux/none UFCG workstation Universidade Federal de Campina Grande, Brazil BeSc workstation Belfast eScience Center, Ireland

Challenges faced by EOL today
Resource heterogeneity / task launching Batch schedulers Globus Logistics over long runs Current solution: Use APST underneath Write a “workflow management system” on top to handle app-specific things Currently deployed and working

Would VGrADS Need EOL Software?
Status of EOL software it’s “ad-hoc” iGAP is nearly impossible to modify/understand Underlying software is worse and yet the principle is very simple EOL team aware of the fact and there may be a refactoring of the whole thing anyway Current consensus working with the EOL software directly would not be productive

Would VGrADS Need EOL Software?
But EOL uses domain name software that’s well documented, it’s only the glue that’s a mess It is straightforward to work with that software in a way that is representative of what EOL or other applications want to do Strategy: Target generic applications based on public domain software Use EOL just as one example of what somebody may do Not use the EOL software at all Problem: is it politically correct?

Why would EOL need VGrADS?
EOL needs to do millions of small identical DAGs that can be fused as one task, replicates all DBs everywhere, and data is “small” why do you need a fancy virtual grid? why do you need scheduling? it’s basically in spirit (although there are technical difficulties) Not that there is anything wrong with this, but still, it would be nice to have something more interesting

Why would EOL need VGrADS?
But EOL researchers are really planning to move to: more complex and user-customizable DAGS (basically a general workflow system) High throughput will not be the (only) goal Databases may be distributed and not everywhere none of this is in place and it is unclear when it will be (although there are tons of workflow efforts around, and in particular at SDSC) In this view we developed several virtual grid specs for future EOL or other applications like it that may have a stronger need for a VGrid

Running “EOL” on a VGrid?
Apps like psiblast or 123D can benefit from non-dedicated resources because they are embarrassingly parallel One can request a pool of desktop resources via the VGrid interface How does the freshness of dynamic VGrid resource information affect application performance? How to schedule the application on the VGrid? Specifically, what about? Quality of Host Availability Data Scheduling Policies Resource Management Policies

Experimental Framework
Trace driven simulation based on Entropia DCGrid™ software set up at SDSC Max-min scheduling algorithm (using prediction) to schedule sequences as independent tasks running PSI-BLAST and 123D Concept of epoch as time periods for running Max-min algorithm

Resource Information and Scheduling
We looked at Quality of Host Availability Data Dynamic (from 1, 2, or 3 epochs ago) Static (only CPU speed) Scheduling Policies Pessimistic Optimistic Resource Management Policies Handling Incomplete Tasks Rescheduled Continued

Results: Makespan Not surprisingly, over most of our results, in general fresher availability data leads to better makespan Our results quantify the expected improvements, and thus the trade-off with the cost of obtaining fresh information In some cases there are interesting synergies between information and scheduling policies (e.g., Pessimistic + Continued + Static) Fresher host availability data enables better application makespan Fresher data leads to higher CPU utilization Scheduler with fresher data better able to take advantage of faster CPUs Fresher data leads to lower CPU waste

Results: CPU Usage and Waste
Fresher data leads to higher CPU utilization Fresher data makes it possible to better take advantage of faster CPUs Fresher data leads to lower CPU waste Fresher host availability data enables better application makespan Fresher data leads to higher CPU utilization Scheduler with fresher data better able to take advantage of faster CPUs Fresher data leads to lower CPU waste

Tasks-to-Resources Ratio
Quality of availability data has a greater effect on makespan for smaller tasks-to-resources ratios

Conclusion Working with EOL software isn’t viable
We should just view EOL as a tech transfer target that, along with other applications, may benefit from VGrADS in the future. At the moment we are doing work with the principle of EOL in mind but without the actual software Looking at EOL helped us define and harden the VGrid abstraction Work on scheduling on desktop resources

Encyclopedia of Life as a Target VGrADS Application

Similar presentations

Presentation on theme: "Encyclopedia of Life as a Target VGrADS Application"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Encyclopedia of Life as a Target VGrADS Application

Similar presentations

Presentation on theme: "Encyclopedia of Life as a Target VGrADS Application"— Presentation transcript:

Similar presentations

About project

Feedback