From Grid to Global Computing: Deploying Parameter Sweep Applications Henri Casanova Grid Research And Innovation Laboratory (GRAIL)

Slides:



Advertisements
Similar presentations
CSF4 Meta-Scheduler Tutorial 1st PRAGMA Institute Zhaohui Ding or
Advertisements

High Performance Computing Course Notes Grid Computing.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.
ProActive Task Manager Component for SEGL Parameter Sweeping Natalia Currle-Linde and Wasseim Alzouabi High Performance Computing Center Stuttgart (HLRS),
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
GridFlow: Workflow Management for Grid Computing Kavita Shinde.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
6th Biennial Ptolemy Miniconference Berkeley, CA May 12, 2005 Distributed Computing in Kepler Ilkay Altintas Lead, Scientific Workflow Automation Technologies.
Application-Aware Management of Parallel Simulation Collections Siu-Man Yau, New York University Steven G. Parker
MCell Usage Scenario Project #7 CSE 260 UCSD Nadya Williams
NPACI Alpha Project Review: Cellular Microphysiology on the Data Grid Fran Berman, UCSD Tom Bartol, Salk Institute.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Workload Management Massimo Sgaravatto INFN Padova.
New Development in the AppLeS Project or User-Level Middleware for the Grid Francine Berman University of California, San Diego.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.
Simo Niskala Teemu Pasanen
Architectural Design Establishing the overall structure of a software system Objectives To introduce architectural design and to discuss its importance.
Biology.sdsc.edu CIPRes in Kepler: An integrative workflow package for streamlining phylogenetic data analyses Zhijie Guan 1, Alex Borchers 1, Timothy.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.
Ajou University, South Korea ICSOC 2003 “Disconnected Operation Service in Mobile Grid Computing” Disconnected Operation Service in Mobile Grid Computing.
Volunteer Computing and Hubs David P. Anderson Space Sciences Lab University of California, Berkeley HUBbub September 26, 2013.
MobSched: An Optimizable Scheduler for Mobile Cloud Computing S. SindiaS. GaoB. Black A.LimV. D. AgrawalP. Agrawal Auburn University, Auburn, AL 45 th.
National Center for Supercomputing Applications The Computational Chemistry Grid: Production Cyberinfrastructure for Computational Chemistry PI: John Connolly.
A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster
Nimrod/G GRID Resource Broker and Computational Economy David Abramson, Rajkumar Buyya, Jon Giddy School of Computer Science and Software Engineering Monash.
Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.
Service Computation 2010November 21-26, Lisbon.
Meta Scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References” slide.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
07:44:46Service Oriented Cyberinfrastructure Lab, Introduction to BOINC By: Andrew J Younge
SAM and D0 Grid Computing Igor Terekhov, FNAL/CD.
F. Cappello, O. Richard, P. Sens ---oo Draft oo--- Contact us for experiment proposal Grid eXplorer (GdX) An Instrument for eXploring the GRID F. Cappello,
Grid Workload Management Massimo Sgaravatto INFN Padova.
The Globus Project: A Status Report Ian Foster Carl Kesselman
BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
Service - Oriented Middleware for Distributed Data Mining on the Grid ,劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.
1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June.
Tools for collaboration How to share your duck tales…
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
6 February 2009 ©2009 Cesare Pautasso | 1 JOpera and XtremWeb-CH in the Virtual EZ-Grid Cesare Pautasso Faculty of Informatics University.
Virtual Private Grid (VPG) : A Command Shell for Utilizing Remote Machines Efficiently Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa Department of Computer.
Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.
Uppsala, April 12-16th 2010EGEE 5th User Forum1 A Business-Driven Cloudburst Scheduler for Bag-of-Task Applications Francisco Brasileiro, Ricardo Araújo,
Adaptive Computing on the Grid Using AppLeS Francine Berman, Richard Wolski, Henri Casanova, Walfredo Cirne, Holly Dail, Marcio Faerman, Silvia Figueira,
A Fully Automated Fault- tolerant System for Distributed Video Processing and Off­site Replication George Kola, Tevfik Kosar and Miron Livny University.
03/03/051 Performance Engineering of Software and Distributed Systems Research Activities at IIT Bombay Varsha Apte March 3 rd, 2005.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Cyber-Research: Meeting the Challenge of a Terascale Computing Infrastructure Francine Berman Department of Computer Science and Engineering, U. C. San.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.
Ansible and Ansible Tower 1 A simple IT automation platform November 2015 Leandro Fernandez and Blaž Zupanc.
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009.
Shaowen Wang 1, 2, Yan Liu 1, 2, Nancy Wilkins-Diehr 3, Stuart Martin 4,5 1. CyberInfrastructure and Geospatial Information Laboratory (CIGI) Department.
Workload Management Workpackage
Clouds , Grids and Clusters
Volunteer Computing for Science Gateways
Grid Computing.
Grid Portal Services IeSE (the Integrated e-Science Environment)
Supporting Fault-Tolerance in Streaming Grid Applications
Wide Area Workload Management Work Package DATAGRID project
Overview of Workflows: Why Use Them?
Resource Allocation for Distributed Streaming Applications
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

From Grid to Global Computing: Deploying Parameter Sweep Applications Henri Casanova Grid Research And Innovation Laboratory (GRAIL) San Diego Supercomputer Center (SDSC) Computer Science and Engineering Dept. (CSE) University of California, San Diego (UCSD)

Parameter Sweep Applications Many compute tasks No or simple dependencies Several output post-processing stages Potentially large datasets Input data Raw Output Tasks Post-processing Final Output

Relevance Arise in virtually every field of science an engineering Monte Carlo, Parameter Space Searches, Parameter Studies, etc. Biology, Astrophysics, Physics, Bioinformatics, Economics, etc. Primary candidate for Grid computing Latency-tolerant, amenable to simple fault- tolerance Need huge amount of resources

Outline of the Presentation Parameter Sweep Applications (PSAs) APST The Virtual Instrument

Scheduling of PSAs ?

Grid Scheduling Practice Ad-hoc solutions: specific to one application hand-tuned to the environment (e.g. SF-Express demo) Large body of work on Scheduling What can we re-use on the Grid?  Heterogeneous resources  Dynamic performance characteristics  Resources downtimes  Complex network topologies  Performance prediction errors

“DataGrid” Scheduling Goal: Co-locate/replicate data and computation Dynamic Priority List-Scheduling Built on heuristics described in [Ibarra77, Siegel99] Added adaptivity Simulation results List-scheduling works, adaptivity should make it practical Experimental results (Demo at SC’00 and SC’01) [HCW’00] H. Casanova, A. Legrand, et al.

Lessons Much scheduling work to re-use List-scheduling with Dynamic Priorities seems effective Simulation Experimental Let’s build software that uses it Let’s target scientific communities

Motivation for APST Started as scheduling research Evolved into a tool that provides Transparency of Grid execution Data movements Remote job management Multiple Grid middleware back-ends Scheduling Self-scheduling List scheduling w/ dynamic priorities

APST Designs The AppLeS Parameter Sweep Template: An Application Execution Environment XML application and resource descriptions APST client Grid Grid Services Scheduler TransportCompute Decisions Actions Metadata Bookkeeper Information APST

APST: Lessons The Grid is difficult to use APST provides a simple software layer that does one thing well Minimal user interface (XML, command-line) Used as a building block for domain-specific applications E.g. multi-cluster bio-informatics (Singapore) Ssh? Default mechanism Critical for gaining user buy in Natural way to lead to using the Grid

APST Status Version 1.1 released 2 weeks ago Available for public download Used for 10+ applications Bioinformatics (BLAST, HMM, …) Computational Neuro-science Globus, NetSolve, Ssh, Condor GASS, IBP, Scp, GridFTP, SRB, NWS, MDS, Ganglia,…

APST Research Directions APST is a research platform Maintained by one staff Several graduate student contributors Partitionable Workload Bioinformatics (database splitting) Factoring: Decrease chunk size Pipelining: Increase chunk size Combined? Create APST-BLAST (Mario Lauria, OSU Yang Yang, UCSD)

Outline of the Presentation Parameter Sweep Applications (PSAs) APST Virtual Instrument

Computational Neuroscience MCell: Monte Carlo Cell simulator  Developed at Salk and PSC  Gain knowledge about neuro-transmission mechanisms Fundamental for drug design (psychiatry) Large user base (yearly MCell workshop) Parallel MC simulations at the molecular level

Traditional MCell usage “By hand” No automatic project management No transparent resource access No automated data management Consequences No interactive simulations No fault-tolerance, scheduling, …  MCell limited to resources in the lab

MCell and APST APST alleviates some of the limitations Large-scale simulations Fault-tolerance and scheduling Data retrieval from distributed storage XML application descriptions No interactivity MCell is exploratory User interaction is fundamental for many users

The Virtual Instrument $2.5M funding from the NSF Salk, PSC, UCSB, UTK, UCSD A running MCell simulation should behave as a lab instrument Computational steering for MCell User interface Grid software Application software Scheduling research (how does one scheduling an application that’s being steered interactively?)

VI Database VI Interface VI Daemon VI User Grid Storage and Compute Resources storage compute Grid Services control data control + data control + data process VI Software OpenDX

Scheduling Goals Reduce the “search” time Let user assign levels of importance to regions on the parameter space Assign fraction of resources with respect to the importance levels  Assign priorities to tasks Interesting questions Job control limited on Grid resource Cannot assign exact fractions Interesting trade-offs between control overhead and accuracy of priorities

Current Status First software prototype released in Feb 2002 Globus and Ssh MySQL OpenDX priority-based scheduling 20,000 lines of C++ Upcoming papers JPDC submission Scheduling paper (SC submission)

Outline of the Presentation Parameter Sweep Applications (PSAs) PSAs on the Grid with APST MCell Virtual Instrument Global Computing

Over 500,000 active participants, most of which run screensaver on home PC Over a cumulative 20 TeraFlop/sec Versus 12.3 TeraFlop/sec of IBM’s ASCI White Cost: $500,000 + $200,000 in donated hardware Less than 1% of the $110 million required for ASCI White

Global vs. Grid Computing Nature of resources Home desktops running Windows and are completely autonomous Machines powered on and off by user Behind firewalls, dynamic IP, transient network connections Programming model Server cannot “push” tasks to clients Server has no little means for remote job control Server has incomplete information about resources and availability

Goal limitations: Embarrassingly parallel Infinite amount of input data Pure throughput Can we do something more? Short-lived applications? Parallel applications? Compute service? Smith-Waterman for short/long sequences No real software yet (build on XtremWeb?)

Scheduling? Sophisticated scheduling algorithms need information and control  At the moment: Simple mechanisms 1.Work unit duplication Specifies max number of times a work unit can be resent 2.Timeouts Time that must elapse before work unit is resent

Simulation Built a simulation model Using statistics/surveys/extrapolations Next: logs from real systems (XtremWeb?, Entropia?) Evaluated the impact of both mechanisms on performance and throughput

Early Lessons Trade-off between throughput and turn- around time Duplication: aggressively decreases turn-around time wastes resources there is an optimal value Timeouts: moderately lowers turnaround times preserves good throughput infinite timeouts is of course not a good idea

Future work Two knobs Question: A compute service? Mix of applications (SETI, short-lived, …) Singapore Bio-informatics institute Notion of fairness? How do we implement policy with many volatile resources? Software Re-use existing platforms: XtremWeb Entropia

Conclusion APST, Virtual Instrument, Other GRAIL activities I didn’t talk about Scientific Computing Simulation Adaptive Scheduling Networking

Experimental Results UTK UCSD TITECH Tokyo  Self-scheduling  XSufferage