Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center
Systematics is the study of the diversification of life on the planet Earth, both past and present, and the relationships among living things through time ?
Evolutionary relationships can (for the most part) be represented as a rooted graph.
Originally, evolutionary relationships were inferred from morphology alone:
Morphological characters are scored “by hand” to create matrices of characters. Scoring occurs via low volume/low throughput methodologies Even though tree inference is NP hard, matrices created using morphological characters alone are typically relatively small, so computations are relatively tractable (with heuristics developed by the community) Originally, evolutionary relationships were inferred from morphology alone:
Evolutionary relationships are also inferred from DNA sequence comparisons:
Unlike morphological characters, DNA sequence determination is now fully automated. The increase in DNA sequences with time is faster than Moore’s law. There are at least 10 7 species, each with ,000 genes, so the need for computational power and new tools will continue to grow. Tree inference is NP hard, so even with heuristics, computational power frequently limits the analysis. Analyses often involve 1000’s of species, and 1000’s of characters, creating very large matrices. Evolutionary relationships are also inferred from DNA sequence comparisons:
The CIPRES Project was created to support this new age of large phylogenetic data sets. The project had as its principal goals: 1.Developing heuristics and tools for analyzing the large DNA data sets that are available. 2.Improving researcher access to computational resources.
The CIPRES Portal was created as part of Goal 2, improving researcher access to computational resources The CIPRES Portal was designed to be a flexible web application that allows users to run analyses of large sequence data sets using community codes on a significant computational resource.
User requirements: Provide login-protected personal user space for storing results indefinitely. Provide access to most or all native command line options for each code. Support addition of new tools and new versions as needed.
The CIPRES Portal was built on a generic portal software package called The Workbench Framework
<!DOCTYPE pise SYSTEM " [ ]> TFASTY 34t10d3 Compare PS to Translated NS Or NS-DB W. Pearson Pearson, W. R. (1999) Flexible sequence similarity searching with the FASTA3 program package. Methods in Molecular Biology W. R. Pearson and D. J. Lipman (1988), Improved Tools for Biological Sequence Analysis, PNAS 85: W. R. Pearson (1998) Empirical statistical estimates for sequence similarity searches. In J. Mol. Biol. 276:71-84 Pearson, W. R. (1996) Effective protein sequence comparison. In Meth. Enz., R. F. Doolittle, ed. (San Diego: Academic Press) 266: Protein Sequence To expose Command Line Tools quickly, the Workbench Framework uses the PISE XML standard….
All command line parameters can be set.
Usage Statistics for CIPRES Portal 5/2007 – 11/ ,500 total jobs in 30 months
Limitations of the original CIPRES Portal all jobs were run serially (efficient, but no gain in wall time) the cluster was modest (16 X 8-way dual core nodes) runs were limited to 72 hours the cluster was at the end of its useful lifetime funding for the project was ending demand for job runs was increasing This is not a scalable, sustainable solution!
Workbench Framework The solution: make community codes available on scalable, sustainable resources (e.g. TeraGrid). TeraGridCIPRES ClusterTriton Parallel codes Serial codes
Greater than 90% of all computational time is used for three tree inference codes: MrBayes, RAxML, and GARLI. Implement parallel versions of these codes on TeraGrid Machines Abe and Lonestar; using Globus/GRAM. Work with community developers to improve the speed-up available through the parallel codes offered by CSG. Keep other serial codes on local SDSC resources that provide the project with fee-for-service cycles.
The Workbench Framework design made it possible to deploy jobs on TeraGrid resources fairly easily. A Science Gateway development allocation allowed us to accomplish the initial setup.
CodeTypeMax cores Speed-upEfficiency MrBayesHybrid MPI/OpenMP322.4 X (4 nodes) ~60% RAxMLHybrid MPI/OpenMP403.0 X (5 nodes) ~ 60% GARLIMPI10077 X (100 nodes) 77-94% CIPRES Science Gateway parallel code profiles
all users new users repeat users CIPRES Science Gateway Usage Dec 2009 – Oct 2010
all users new users repeat users CIPRES Science Gateway Usage Dec 2009 – Oct new TG users
CIPRES Science Gateway Usage
: Intellectual Merit: 90+ publications enabled by the CIPRES Science Gateway Broad Impact: CIPRES Science Gateway used to deliver curriculum by at least 21 instructors
What happens if you build it and too many people come??? ?!?
What happens if you build it and too many people come??? make an explicit fair resource use policy make sure resource use delivers impact make sure resource use is efficient then expand resource base as required
What happens if you build it and too many people come??? make an explicit fair resource use policy make sure resource use delivers impact make sure resource use is efficient then expand resource base as required
What happens if you build it and too many people come??? SUs /month Number of Users % total SU% per user < (75.3%) (12.1%) ( 7.5%) ( 3.4%) , ,0009 ( 1.3%) > 10,0003 ( 0.4%)
What happens if you build it and too many people come??? SUs /month Number of Users % total SU% per user < (75.3%) (12.1%) ( 7.5%) ( 3.4%) , ,0009 ( 1.3%) > 10,0003 ( 0.4%)
What happens if you build it and too many people come??? SUs /month Number of Users % total SU% per user < (75.3%) (12.1%) ( 7.5%) ( 3.4%) , ,0009 ( 1.3%) > 10,0003 ( 0.4%) This level of resource use requires additional justification…
What happens if you build it and too many people come??? Tier % Community allocation used Account Status 1 < 2% of monthly allocation Open access 2 2% of monthly allocation Request personal TG allocation 3 3% of monthly allocation Use of community allocation blocked CIPRES SG Fair Use Policy
What happens if you build it and too many people come??? make an explicit fair resource use policy make sure resource use delivers impact make sure resource use is efficient then expand resource base as required
What happens if you build it and too many people come??? High end users must acquire a personal allocation from the TRAC committee. This will insure that very heavy users of the resource are supporting peer-reviewed research (and have a US institutional affiliation).
What happens if you build it and too many people come??? Tools required to implement the CIPRES SG Fair Use Policy: ability to halt submissions from a given user account ability to charge to a user’s personal TG allocation ability to monitor usage by each account
What happens if you build it and too many people come??? make an explicit fair resource use policy make sure resource use delivers impact make sure resource use is efficient then expand resource base as required
Job Attrition on the CIPRES Science Gateway
CPU timeUserStaff Input errorlowhighlow Machine error0highlow Communication errorhigh Unknown errorhigh low Error Impact analysis
Communication Errors occur when there is a break in communication between the web application and the TG resource kills the job monitoring process for all running jobs. jobs will continue to execute and consume SU, but web application can no longer return job results. the user must ask for the results, and a staff member must fetch them manually.
CONCLUSION: Time to refactor the job monitoring system!
User Submission Command line; files TG Machine Running Task Table Globus “gsissh” Normal Operation LoadResults Daemon Detects results, fetches via Grid ftp, puts results in the user db. User DB
User Submission checkJobs Daemon TG Machine Running Task Table Globus “gsissh” If normal notification fails from: Machine outage Job timeout CIPRES Application down Abnormal Operation: User DB LoadResults Daemon Detects results w/delivery error; fetches via Grid ftp; puts results in the user DB.
SEPT OCT MrBayes RAxML GARLI Total 159* 266* JOBS SAVED BY THE GSISSH / TASK TABLE SYSTEM * 7% of all submitted jobs
Future plans to address other source of attrition User errors (15%) : pre-check uploaded files for valid format. Machine errors (8%): establish redundancy in where codes can run, use tools to check machine availability, queue depth. Unknown errors (3%): TBD
What happens if you build it and too many people come??? make an explicit fair resource use policy make sure resource use delivers impact make sure resource use is efficient then expand resource base as required
What happens if you build it and too many people come??? ?!?
What happens if you build it and too many people come???
Other possible resource opportunities: New TG machines (e.g. Trestles machine at SDSC) The Open Science Grid (OSG) The NSF FutureGrid project (adapting these HPC applications to a cloud environment).
CIPRES Science GatewayTerri Liebowitz TeraGrid Hybrid Code DevelopmentWayne Pfeiffer Alexandros Stamatakis TeraGrid Implementation SupportNancy Wilkins-Diehr Doru Marcusiu Workbench Framework: Paul Hoover Lucie Chan Acknowledgements: