Evolution of the CIPRES Science Gateway, a Public Resource for Phylogenetics. Mark A. Miller San Diego Supercomputer Center
How to fail at creating a Gateway Allow the development staff to focus on their personal goal: creating the coolest, most generic software package ever. Ignore new researchers in your community and focus on an existing user base. Focus on updating an existing Gateway’s capabilities. Focus on low end computational use cases/ classroom use. Fail to anticipate the emerging needs of Biologists for genomics tools. Fail to grasp the importance of access to parallel codes for compute-intensive jobs. Case study: The Next Generation Biology Workbench
Allow the development staff to focus on their personal goal: creating the coolest, most generic software package ever. Ignore new researchers in your community and focus on an existing user base. Focus on updating an existing Gateway’s capabilities. Focus on low end computational use cases/ classroom use. Fail to anticipate the emerging needs of Biologists for genomics tools. Fail to grasp the importance of access to parallel codes for compute-intensive jobs. Case study: The Next Generation Biology Workbench The NGBW closed in It targeted low end use cases, and in the end supported primarily advanced high school students. How to fail at creating a Gateway
Engage only in user- and use case- driven development. Listen to user requests for new features. Expand capacity to meet growing user demands. Be driven by the high end users, help with one-off solutions when necessary Refactor infrastructure as use cases drive need for changes. Build features only in response to user requests, or when usage patterns break the existing infrastructure. Case study: The CIPRES Science Gateway A Better path for Gateway development
Engage only in user- and use case- driven development. Listen to user requests for new features. Expand capacity to meet growing user demands. Be driven by the high end users, help with one-off solutions when necessary Refactor infrastructure as use cases drive need for changes. Build features only in response to user requests, or when usage patterns break the existing infrastructure. Case study: The CIPRES Science Gateway The current CIPRES Science Gateway is built on the same software as the NGBW. But this project always held to user-driven development.
CIPRES has been successful: Over 15,000 users submitted 550,000+ TeraGrid/XSEDE jobs since Dec, An average of ~350 new XSEDE Users registered in each of the last 12 months. 100 million core hours of TeraGrid/XSEDE time distributed to scientists. Supported at least 1800 publications. Used for curriculum delivery by at least 76 instructors.
Tactics for Gateway Success: Step 1: identify a user population in need
Phylogenetics is the study of the diversification of life on the planet Earth, both past and present, and the relationships among living things through time ?
Evolutionary relationships can be inferred from DNA sequence comparisons: 1. Align sequences to determine evolutionary equivalence: 2. Infer evolutionary relationships based on some set of assumptions:
Biology in the new world of abundant DNA sequence data requires a new kind of cyberinfrastructure! Sequence alignment and Tree inference are NP hard. Even with heuristics, community codes scale exponentially with number of species and columns. Phylogenetics codes that were historically run in desktop environments must be moved to high performance computing resources. The need for access to HPC resources will increase for the foreseeable future. Scientists who do not have HPC access will have to tailor their questions to available resources, and risk being left out of the discovery process.
Tactics for Gateway Success: Step 1: identify a user population in need Community pressure causes CIPRES project to provide public access to their compute engine via a Portal. Construction begins….
Workflow for the CIPRES Gateway: Assemble Sequences Upload to Portal Run Alignment Run Tree Inference Download Post-Tree Analysis Store CIPRES Gateway
Tactics for Gateway Success: Step 1: identify a user population in need Step 2: commit to responding to user’s needs
Usage Epochs in CIPRES History
Original architecture. Restricted command line set
Usage Epochs in CIPRES History Make all command line options available
Usage Epochs in CIPRES History The Generic software package from the failed NGBW project allowed us to expose all command line options to users in about 3 months.
Usage Epochs in CIPRES History Make parallel codes available
Usage Epochs in CIPRES History Make parallel codes available The Generic software package from the failed NGBW project allowed us to submit jobs “easily” to TeraGrid/XSEDE resources, and to local HPC resources.
Linear growth in usage has continued every month since….. It has just been a matter of trying to help the software keep up with the changing use cases. Usage Epochs in CIPRES History
Tactics for Gateway Success: Step 1: identify a user population in need Step 2: commit to responding to user’s needs Step 3: let user behavior/usage-created needs drive improvements
Motivation: Too Many Users. Create a tool set that gives: ability to halt submissions from a given user account ability to monitor usage by each account automatically ability for users to track their SU consumption ability to forecast SU cost of a job for users ability to charge to a user’s personal XSEDE allocation
Help users track their resource consumption: Notify users of their usage level
Motivation: users running 2 week jobs Issue: During service interruptions, the app lost track of the job, results must be fetched manually Response: Create a system of daemons that return results robustly even with system outages
CIPRES DB Execution Hosts Running tasks Tasks curl, task is done checkJobsD 1. Find all “submitted” tasks 2. Ask execution host if job is done 3. If yes, set status to “done” loadResultsD 1. Find all “done” tasks 2. Transfer results to CIPRES DB 3. Remove job from “WorkQ” submitJobsD 1. Find all “new” tasks 2. Submit to correct execution host 3. Set status to “submitted” Change status in Running task table to “done” Job Submissions/Results Retrieval is managed by daemons
Motivation: Users input file size grew from KB to MB, output from MB to GB, stressing the system. Software improvement was required to: Keep large files from being read into memory multiple times. Point to files instead of storing them in the DB. Store identical files in the DB only once. Sunset accounts that have been inactive for more than 1 year. Move GB+ files outside the web application/database system
Motivation: Users input file size grew from KB to MB, output from MB to GB, stressing the system. Software improvement was required to: Keep large files from being read into memory multiple times. Point to files instead of storing them in the DB. Store identical files in the DB only once. Sunset accounts that have been inactive for more than 1 year. Move GB+ files outside the web application/database system Limit users to 150 GB of data storage
Help users track their resource consumption: Notify users of their usage level
CIPRES DB Execution Hosts Running tasks Tasks curl, task is done checkJobsD 1. Find all “submitted” tasks 2. Ask execution host if job is done 3. If yes, set status to “done” loadResultsD 1. Find all “done” tasks 2. Transfer results to CIPRES DB 3. Remove job from “WorkQ” submitJobsD 1. Find all “new” tasks 2. Submit to correct execution host 3. Set status to “submitted” Change status in Running task table to “done” What happens when job output is GB in size?
CIPRES DB Execution Hosts Running tasks Tasks curl, task is done loadResultsD 1. Find all “done” tasks 2. Transfer results to CIPRES DB 3. Remove job from “WorkQ” What happens when jobs output is GB in size? After 5 minutes, the transfer is still in progress, the job is still in the WorkQ, and marked “done” loadResultsD finds it, and starts the transfer again…. Soon multiple transfers are in progress, and the system chokes
CIPRES DB Execution Hosts Running tasks Tasks loadResultsD 1. Find all “done” tasks 2. Ask how big the results are. 3. Move large results out of the system, transfer all others 4. Remove job from “WorkQ” Solution: Compress and move large files to cloud storage for direct return to user via hyperlink
CIPRES DB Execution Hosts Running tasks Tasks loadResultsD 1. Find all “done” tasks 2. Ask how big the results are. 3. Move large results out of the system, transfer all others 4. Remove job from “WorkQ” Solution: Compress and move large files to cloud storage for direct return to user via hyperlink 500+ Users have required file downloads by this transfer mechanism….
Tactics for Gateway Success: Step 1: identify a user population in need Step 2: commit to responding to user’s needs Step 3: let user behavior/usage created needs drive improvements Step 4: manage challenges that threaten productivity of high end users
Other issues also arose Gridftp proved unreliable at high load. Move to local Lustre file systems. Under load, a MySQL bug prevented the DB connections from releasing, choking the web app; refactor how the DB manages files.
Other issues also arose The Lustre file system is not good for many Biology codes, so we moved to NFS…
Other issues also arose The Lustre file system is not good for many Biology codes, so we moved to NFS… Lustre failures on long jobs cause surge in resource use
The issue with issues: Dealing with these issues occurred in fire drill mode; users were stymied and frustrated. On average, 30-45% of developer time is spent dealing with these issues. Some days/weeks all forward progress is halted. But on the other hand, making your existing users happy is the first priority…..
Tactics for Gateway Success: Step 1: identify a user population in need Step 2: commit to responding to user’s needs Step 3: let user behavior/usage created needs drive improvements Step 4: manage challenges that threaten productivity of high end users Step 5: stay in touch with your community
Provide many points of contact
When a project belongs to the community…
Tactics for Gateway Success: Step 1: identify a user population in need Step 2: commit to responding to user’s needs Step 3: let user behavior/usage created needs drive improvements Step 4: manage challenges that threaten productivity of high end users Step 5: stay in touch with your community Step 6: embrace customer service
Set aside time for user issues
The goals are: No more than 24 h response time Foster a supportive and helpful culture Make it clear that trouble reports are a gift to CIPRES, not an annoyance
Tactics for Gateway Success: Step 1: identify a user population in need Step 2: commit to responding to user’s needs Step 3: let user behavior/usage created needs drive improvements Step 4: manage challenges that threaten productivity of high end users Step 5: stay in touch with your community Step 6: embrace customer service Step 7: innovate as funds permit
There are highly-evolved legacy desktop/browser applications that help with matrix assembly, but have no tree inference tools or are under powered: raxmlGUI
There are projects that offer powerful and distinct user experiences, and are interested in incorporating powerful tree inference tools into an existing application:
CSG XSEDE Parallel codes We received funding to create a public CIPRES RESTful API (CRA) to help with these use cases…. raxmlGUI
Morpho- Bank MB-DB Character Recording Character Matrix Assembly Team Data Sharing Character Quantification Character Visualization Character Matrix Publication Use Cases: MorphoBank and REST Services MorphoBank provides powerful visual tools for creating and sharing data matrices among large teams……
Morpho- Bank MB-DB Character Recording Character Matrix Assembly Team Data Sharing Character Quantification Character Visualization Character Matrix Publication Use Cases: MorphoBank and REST Services But its has no concept of trees or tree inference……
Morpho- Bank MB-DB Character Recording Character Matrix Assembly Team Data Sharing Character Quantification Character Visualization Character Matrix Publication Use Cases: MorphoBank and REST Services CRA XSEDE Parallel codes CIPRES RESTful API allows users to proceed with their workflow within the MorphoBank environment……
Mesquite Tree Display Tree Editing Tree Reconciliation Sequence Editing Sequence Assembly Tree Analysis Use Cases: Mesquite and REST Services Desktop Mesquite provides powerful visual tools for pre- and post- tree tasks on the desktop……
Mesquite Tree Display Tree Editing Tree Reconciliation Sequence Editing Sequence Assembly Tree Analysis Use Cases: Mesquite and REST Services Desktop But its tree inference is limited by the desktop hardware……
CRA XSEDE Parallel codes Mesquite Tree Display Tree Editing Tree Reconciliation Sequence Editing Sequence Assembly Tree Analysis Use Cases: Mesquite and REST Services Desktop RESTful CIPRES API provides the needed compute power without leaving the app……
Many advanced developers find the workflow supported by the CIPRES browser too restrictive. !!!
Use Cases: Individual developers and REST Services Advanced phylogenetic researchers want: to run many jobs simultaneously create ad hoc workflows Advanced phylogenetic researchers don’t want: to assemble and click each job one at a time to manually port the output of one job to the subsequent job in their workflow
CRA XSEDE Parallel codes Scripting Tools Use Cases: Individual developers and REST Services Assuming modest scripting skills, an advanced researcher can accomplish this goal using the CIPRES RESTful API to avoid the clumsy browser interface
The REST API was released in October 2014, and announced formally January It is available through: MorphoBank Influenza Research Database Virus Pathogen Resource (ViPR) Tree-Based Alignment Selector (TBAS) raxmlGUI Coming soon: Mesquite siMBa BioKepler
Advantages of offering REST services: Preserves the investment in creating and learning to use complex software environments. Makes interaction with the application more flexible for individuals with scripting skills.
But where are the individual scripters we expected? !!!
Perhaps the REST API has too high a barrier to entry.
Web Form Parameter map Front end Validation (Javascript; struts) Backend validation Tool XML Parameter map Backend validation Rest Client Command Line Command Line
Perhaps the REST API has too high a barrier to entry. What next?
Perhaps the REST API has too high a barrier to entry. Web Form Parameter map Front end Validation (Javascript; struts) Backend validation Tool XML Parameter map Backend validation Rest Client Command Line Command Line JavaScript GUI
Use Cases: Individual developers and REST Services Advanced phylogenetic researchers want: to run many jobs simultaneously create ad hoc workflows Advanced phylogenetic researchers don’t want: to assemble and click each job one at a time to manually port the output of one job to the subsequent job in their workflow
Descriptive text Code cells Cell Controls
The Jupyter notebook as the following properties: Interleaving text and live code makes it easy to modify and share workflows. The information is stored as an easily sharable file that can be used in any Jupyter implementation with the proper software installed. Many scripting languages are supported. Supports interactive creating/modifying figures, and GUI interactions.
Create a CIPRES Notebook environment where: Notebooks in R and python are supported (at least). A standard collection of Phylogenetics scripting packages are available in each language. A forum is provided for notebook storage, exchange, and publishing. Ability to submit to virtual HPC clusters on XSEDE resources.
Challenges: How to allow users to submit command lines without major security issues. How to make sure jobs are configured correctly/efficiently
Workflow for the CIPRES Notebook Environment: Assemble Sequences Upload to Portal Run Alignment Run Tree Inference Download Post-Tree Analysis Store CIPRES Gateway
The expanded workflow becomes more tractable in the Notebook Environment because users have the ability to recruit tools, and design their own workflows. Will the barrier to entry be too high?
How will SciGap help us? 7/13/2014
How will SciGap help us? For all apps: As we delve into providing access via the CIPRES Notebook, CIPRES job submissions and middleware can be taken over by SciGaP. This would allow all Gateway developers (Terri and Kenneth, for example) to focus primarily on creating the new interface, while the heavy lifting required of the production application is taken over by SciGaP. Recall that in our team, 30-45% of developer time is spent on putting out fires in the middleware. We would love to give those issues to SciGaP…. 7/13/2014
Acknowledgements Terri Schwartz – Lead developer Wayne Pfeiffer – HPC Expertise Paul Hoover – Database /Backend Mona Wong – Interface