Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center.

Slides:



Advertisements
Similar presentations
Building Portals to access Grid Middleware National Technical University of Athens Konstantinos Dolkas, On behalf of Andreas Menychtas.
Advertisements

W w w. h p c - e u r o p a. o r g HPC-Europa Portal: Uniform Access to European HPC Infrastructure Ariel Oleksiak Poznan Supercomputing.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Natasha Pavlovikj, Kevin Begcy, Sairam Behera, Malachy Campbell, Harkamal Walia, Jitender S.Deogun University of Nebraska-Lincoln Evaluating Distributed.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Enabling Phylogenetic Research via the CIPRES Science Gateway Wayne Pfeiffer.
Institute for Software Science – University of ViennaP.Brezany 1 Databases and the Grid Peter Brezany Institute für Scientific Computing University of.
Chapter 1 and 2 Computer System and Operating System Overview
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
HTPC - High Throughput Parallel Computing (on the OSG) Dan Fraser, UChicago OSG Production Coordinator Horst Severini, OU (Greg Thain, Uwisc) OU Supercomputing.
Virtual Infrastructure in the Grid Kate Keahey Argonne National Laboratory.
The BioBox Initiative: Bio-ClusterGrid Gilbert Thomas Associate Engineer Sun APSTC – Asia Pacific Science & Technology Center.
December, 2009 David Hart.  Allocation Stats  Processing  Interfaces.
Open Science Grid For CI-Days Internet2: Fall Member Meeting, 2007 John McGee – OSG Engagement Manager Renaissance Computing Institute.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Cloud Usage Overview The IBM SmartCloud Enterprise infrastructure provides an API and a GUI to the users. This is being used by the CloudBroker Platform.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
TeraGrid Science Gateways: Scaling TeraGrid Access Aaron Shelmire¹, Jim Basney², Jim Marsteller¹, Von Welch²,
GRAM: Software Provider Forum Stuart Martin Computational Institute, University of Chicago & Argonne National Lab TeraGrid 2007 Madison, WI.
The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.
Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.
Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Open Science Grid For CI-Days Elizabeth City State University Jan-2008 John McGee – OSG Engagement Manager Manager, Cyberinfrastructure.
Protein Molecule Simulation on the Grid G-USE in ProSim Project Tamas Kiss Joint EGGE and EDGeS Summer School.
Wenjing Wu Computer Center, Institute of High Energy Physics Chinese Academy of Sciences, Beijing BOINC workshop 2013.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
SAN DIEGO SUPERCOMPUTER CENTER Inca TeraGrid Status Kate Ericson November 2, 2006.
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
NEES Cyberinfrastructure Center at the San Diego Supercomputer Center, UCSD George E. Brown, Jr. Network for Earthquake Engineering Simulation NEES TeraGrid.
Creating and running an application.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Hybrid MPI/Pthreads Parallelization of the RAxML Phylogenetics Code Wayne Pfeiffer.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
AN ORGANISATION FOR A NATIONAL EARTH SCIENCE INFRASTRUCTURE PROGRAM Virtual Geophysics Laboratory (VGL): Scientific workflows Exploiting the Cloud Josh.
Biomedical and Bioscience Gateway to National Cyberinfrastructure John McGee Renaissance Computing Institute
Adrian Jackson, Stephen Booth EPCC Resource Usage Monitoring and Accounting.
UTA MC Production Farm & Grid Computing Activities Jae Yu UT Arlington DØRACE Workshop Feb. 12, 2002 UTA DØMC Farm MCFARM Job control and packaging software.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
SAN DIEGO SUPERCOMPUTER CENTER Allocation Policies and Proposal Best Practices David L. Hart, TeraGrid Area Director, UFP/CS Presenter:
National Grid Service and EGEE Dr Andrew Richards Executive Director NGS e-Science Department, CCLRC.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Tutorial on Science Gateways, Roma, Catania Science Gateway Framework Motivations, architecture, features Riccardo Rotondo.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Zach Miller Computer Sciences Department University of Wisconsin-Madison Supporting the Computation Needs.
Evolution of the CIPRES Science Gateway, a Public Resource for Phylogenetics. Mark A. Miller San Diego Supercomputer Center.
1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.
CyVerse Workshop Discovery Environment Overview. Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
OpenPBS – Distributed Workload Management System
The CIPRES Science Gateway: Enabling High-Impact Science for Phylogenetics Researchers with Limited Resources Mark Miller, Wayne Pfeiffer, and Terri.
The CIPRES Portal: Current Status and Future Plans
Introduction to XSEDE Resources HPC Workshop 08/21/2017
San Diego Supercomputer Center
Presentation transcript:

Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Systematics is the study of the diversification of life on the planet Earth, both past and present, and the relationships among living things through time ?

Evolutionary relationships can (for the most part) be represented as a rooted graph.

Originally, evolutionary relationships were inferred from morphology alone:

Morphological characters are scored “by hand” to create matrices of characters. Scoring occurs via low volume/low throughput methodologies Even though tree inference is NP hard, matrices created using morphological characters alone are typically relatively small, so computations are relatively tractable (with heuristics developed by the community) Originally, evolutionary relationships were inferred from morphology alone:

Evolutionary relationships are also inferred from DNA sequence comparisons:

Unlike morphological characters, DNA sequence determination is now fully automated. The increase in DNA sequences with time is faster than Moore’s law. There are at least 10 7 species, each with ,000 genes, so the need for computational power and new tools will continue to grow. Tree inference is NP hard, so even with heuristics, computational power frequently limits the analysis. Analyses often involve 1000’s of species, and 1000’s of characters, creating very large matrices. Evolutionary relationships are also inferred from DNA sequence comparisons:

The CIPRES Project was created to support this new age of large phylogenetic data sets. The project had as its principal goals: 1.Developing heuristics and tools for analyzing the large DNA data sets that are available. 2.Improving researcher access to computational resources.

The CIPRES Portal was created as part of Goal 2, improving researcher access to computational resources The CIPRES Portal was designed to be a flexible web application that allows users to run analyses of large sequence data sets using community codes on a significant computational resource.

User requirements: Provide login-protected personal user space for storing results indefinitely. Provide access to most or all native command line options for each code. Support addition of new tools and new versions as needed.

The CIPRES Portal was built on a generic portal software package called The Workbench Framework

<!DOCTYPE pise SYSTEM " [ ]> TFASTY 34t10d3 Compare PS to Translated NS Or NS-DB W. Pearson Pearson, W. R. (1999) Flexible sequence similarity searching with the FASTA3 program package. Methods in Molecular Biology W. R. Pearson and D. J. Lipman (1988), Improved Tools for Biological Sequence Analysis, PNAS 85: W. R. Pearson (1998) Empirical statistical estimates for sequence similarity searches. In J. Mol. Biol. 276:71-84 Pearson, W. R. (1996) Effective protein sequence comparison. In Meth. Enz., R. F. Doolittle, ed. (San Diego: Academic Press) 266: Protein Sequence To expose Command Line Tools quickly, the Workbench Framework uses the PISE XML standard….

All command line parameters can be set.

Usage Statistics for CIPRES Portal 5/2007 – 11/ ,500 total jobs in 30 months

Limitations of the original CIPRES Portal all jobs were run serially (efficient, but no gain in wall time) the cluster was modest (16 X 8-way dual core nodes) runs were limited to 72 hours the cluster was at the end of its useful lifetime funding for the project was ending demand for job runs was increasing This is not a scalable, sustainable solution!

Workbench Framework The solution: make community codes available on scalable, sustainable resources (e.g. TeraGrid). TeraGridCIPRES ClusterTriton Parallel codes Serial codes

Greater than 90% of all computational time is used for three tree inference codes: MrBayes, RAxML, and GARLI. Implement parallel versions of these codes on TeraGrid Machines Abe and Lonestar; using Globus/GRAM. Work with community developers to improve the speed-up available through the parallel codes offered by CSG. Keep other serial codes on local SDSC resources that provide the project with fee-for-service cycles.

The Workbench Framework design made it possible to deploy jobs on TeraGrid resources fairly easily. A Science Gateway development allocation allowed us to accomplish the initial setup.

CodeTypeMax cores Speed-upEfficiency MrBayesHybrid MPI/OpenMP322.4 X (4 nodes) ~60% RAxMLHybrid MPI/OpenMP403.0 X (5 nodes) ~ 60% GARLIMPI10077 X (100 nodes) 77-94% CIPRES Science Gateway parallel code profiles

all users new users repeat users CIPRES Science Gateway Usage Dec 2009 – Oct 2010

all users new users repeat users CIPRES Science Gateway Usage Dec 2009 – Oct new TG users

CIPRES Science Gateway Usage

: Intellectual Merit: 90+ publications enabled by the CIPRES Science Gateway Broad Impact: CIPRES Science Gateway used to deliver curriculum by at least 21 instructors

What happens if you build it and too many people come??? ?!?

What happens if you build it and too many people come??? make an explicit fair resource use policy make sure resource use delivers impact make sure resource use is efficient then expand resource base as required

What happens if you build it and too many people come??? make an explicit fair resource use policy make sure resource use delivers impact make sure resource use is efficient then expand resource base as required

What happens if you build it and too many people come??? SUs /month Number of Users % total SU% per user < (75.3%) (12.1%) ( 7.5%) ( 3.4%) , ,0009 ( 1.3%) > 10,0003 ( 0.4%)

What happens if you build it and too many people come??? SUs /month Number of Users % total SU% per user < (75.3%) (12.1%) ( 7.5%) ( 3.4%) , ,0009 ( 1.3%) > 10,0003 ( 0.4%)

What happens if you build it and too many people come??? SUs /month Number of Users % total SU% per user < (75.3%) (12.1%) ( 7.5%) ( 3.4%) , ,0009 ( 1.3%) > 10,0003 ( 0.4%) This level of resource use requires additional justification…

What happens if you build it and too many people come??? Tier % Community allocation used Account Status 1 < 2% of monthly allocation Open access 2 2% of monthly allocation Request personal TG allocation 3 3% of monthly allocation Use of community allocation blocked CIPRES SG Fair Use Policy

What happens if you build it and too many people come??? make an explicit fair resource use policy make sure resource use delivers impact make sure resource use is efficient then expand resource base as required

What happens if you build it and too many people come??? High end users must acquire a personal allocation from the TRAC committee. This will insure that very heavy users of the resource are supporting peer-reviewed research (and have a US institutional affiliation).

What happens if you build it and too many people come??? Tools required to implement the CIPRES SG Fair Use Policy: ability to halt submissions from a given user account ability to charge to a user’s personal TG allocation ability to monitor usage by each account

What happens if you build it and too many people come??? make an explicit fair resource use policy make sure resource use delivers impact make sure resource use is efficient then expand resource base as required

Job Attrition on the CIPRES Science Gateway

CPU timeUserStaff Input errorlowhighlow Machine error0highlow Communication errorhigh Unknown errorhigh low Error Impact analysis

Communication Errors occur when there is a break in communication between the web application and the TG resource kills the job monitoring process for all running jobs. jobs will continue to execute and consume SU, but web application can no longer return job results. the user must ask for the results, and a staff member must fetch them manually.

CONCLUSION: Time to refactor the job monitoring system!

User Submission Command line; files TG Machine Running Task Table Globus “gsissh” Normal Operation LoadResults Daemon Detects results, fetches via Grid ftp, puts results in the user db. User DB

User Submission checkJobs Daemon TG Machine Running Task Table Globus “gsissh” If normal notification fails from: Machine outage Job timeout CIPRES Application down Abnormal Operation: User DB LoadResults Daemon Detects results w/delivery error; fetches via Grid ftp; puts results in the user DB.

SEPT OCT MrBayes RAxML GARLI Total 159* 266* JOBS SAVED BY THE GSISSH / TASK TABLE SYSTEM * 7% of all submitted jobs

Future plans to address other source of attrition User errors (15%) : pre-check uploaded files for valid format. Machine errors (8%): establish redundancy in where codes can run, use tools to check machine availability, queue depth. Unknown errors (3%): TBD

What happens if you build it and too many people come??? make an explicit fair resource use policy make sure resource use delivers impact make sure resource use is efficient then expand resource base as required

What happens if you build it and too many people come??? ?!?

What happens if you build it and too many people come???

Other possible resource opportunities: New TG machines (e.g. Trestles machine at SDSC) The Open Science Grid (OSG) The NSF FutureGrid project (adapting these HPC applications to a cloud environment).

CIPRES Science GatewayTerri Liebowitz TeraGrid Hybrid Code DevelopmentWayne Pfeiffer Alexandros Stamatakis TeraGrid Implementation SupportNancy Wilkins-Diehr Doru Marcusiu Workbench Framework: Paul Hoover Lucie Chan Acknowledgements: