SPRUCE Special PRiority and Urgent Computing Environment Nick Trebon Argonne National Laboratory University of Chicago

Slides:

Advertisements

Similar presentations

LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (

Advertisements

TeraGrid Deployment Test of Grid Software JP Navarro TeraGrid Software Integration University of Chicago OGF 21 October 19, 2007.

SSRS 2008 Architecture Improvements Scale-out SSRS 2008 Report Engine Scalability Improvements.

Agreement-based Distributed Resource Management Alain Andrieux Karl Czajkowski.

Power BI Sites and Mobile BI. What You Will Learn Sharing and Collaboration Introducing Power BI Exploring Power BI Features and Services Partner Opportunities.

1 OBJECTIVES To generate a web-based system enables to assemble model configurations. to submit these configurations on different.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

MCell Usage Scenario Project #7 CSE 260 UCSD Nadya Williams

High Throughput Urgent Computing Jason Cope Condor Week 2008.

Workload Management Massimo Sgaravatto INFN Padova.

Is Your IT Out of Alignment? Chargeback and Billing with Parallels Automation Brian Shellabarger, Chief Architect - SaaS.

Understanding and Managing WebSphere V5

Manage Engine: Q Engine. What is it?  Tool developed by Manage Engine that allows one to test web applications using a variety of different tests to.

TeraGrid Information Services December 1, 2006 JP Navarro GIG Software Integration.

KARMA with ProActive Parallel Suite 12/01/2009 Air France, Sophia Antipolis Solutions and Services for Accelerating your Applications.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

DISTRIBUTED COMPUTING

CoG Kit Overview Gregor von Laszewski Keith Jackson.

Gayathri Namala Center for Computation & Technology Louisiana State University Representing the SURA Coastal Ocean Observing and Prediction Program (SCOOP)

Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski

ANSTO E-Science workshop Romain Quilici University of Sydney CIMA CIMA Instrument Remote Control Instrument Remote Control Integration with GridSphere.

INFSO-RI Module 01 ETICS Overview Alberto Di Meglio.

COMP3019 Coursework: Introduction to GridSAM Steve Crouch School of Electronics and Computer Science.

QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.

GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.

1 Overview of the Application Hosting Environment Stefan Zasada University College London.

INFSO-RI Module 01 ETICS Overview Etics Online Tutorial Marian ŻUREK Baltic Grid II Summer School Vilnius, 2-3 July 2009.

Deadline-based Grid Resource Selection for Urgent Computing Nick Trebon 6/18/08 University of Chicago.

Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.

Interoperability Grids, Clouds and Collaboratories Ruth Pordes Executive Director Open Science Grid, Fermilab.

Grid Architecture William E. Johnston Lawrence Berkeley National Lab and NASA Ames Research Center (These slides are available at grid.lbl.gov/~wej/Grids)

Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.

SPRUCE Special PRiority and Urgent Computing Environment Pete Beckman Argonne National Laboratory University of Chicago

09/02 ID099-1 September 9, 2002Grid Technology Panel Patrick Dreher Technical Panel Discussion: Progress in Developing a Web Services Data Analysis Grid.

Advisor: Resource Selection 11/15/2007 Nick Trebon University of Chicago.

 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.

Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,

Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -

TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

SPRUCE Special PRiority and Urgent Computing Environment Advisor Demo Nick Trebon University of Chicago Argonne National Laboratory

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

Microsoft Management Seminar Series SMS 2003 Change Management.

Grid Enabled Neurosurgical Imaging Using Simulation

SAN DIEGO SUPERCOMPUTER CENTER Inca Control Infrastructure Shava Smallen Inca Workshop September 4, 2008.

SPRUCE Special PRiority and Urgent Computing Environment User Demo Nick Trebon Argonne National Laboratory University of Chicago

Data, Visualization and Scheduling (DVS) TeraGrid Annual Meeting, April 2008 Kelly Gaither, GIG Area Director DVS.

Network, Operations and Security Area Tony Rimovsky NOS Area Director

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

OGCE Workflow and LEAD Overview Suresh Marru, Marlon Pierce September 2009.

© 2009 IBM Corporation IBM Cloud Computing Tivoli Service Automation Manager V7.2 The Core of the Service Management System for Cloud Computing.

AT LOUISIANA STATE UNIVERSITY CCT: Center for Computation & LSU Condor in Louisiana Tevfik Kosar Center for Computation & Technology Louisiana.

V7 Foundation Series Vignette Education Services.

 Cloud Computing technology basics Platform Evolution Advantages  Microsoft Windows Azure technology basics Windows Azure – A Lap around the platform.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

Probabilistic Upper Bounds for Urgent Applications Nick Trebon and Pete Beckman University of Chicago and Argonne National Lab, USA Case Study Elevated.

INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.

Workload Management Workpackage

GPIR GridPort Information Repository

Introduction to Cloud Computing

Module 01 ETICS Overview ETICS Online Tutorials

Building and running HPC apps in Windows Azure

Probabilistic Upper Bounds

Presentation transcript:

SPRUCE Special PRiority and Urgent Computing Environment Nick Trebon Argonne National Laboratory University of Chicago

2 University of ChicagoArgonne National Lab Urgent Computing - Modeling and Simulation Play a Critical Role in Decision Making

3 University of ChicagoArgonne National Lab Urgent Computing - I Need it Now! Applications with dynamic data and result deadlines are being deployed Late results are useless  Wildfire path prediction  Storm/Flood prediction  Influenza modeling Some jobs need priority access “Right-of-Way Token”

Severe weather example Source: Kelvin Droegemeier, Center for Analysis and Prediction of Storms (CAPS), University of Oklahoma. Collaboration with LEAD Science Gateway project. Example 1: Severe Weather Predictive Simulation from Real-Time Sensor Input

Example 2: Real Time Neurosurgical Imaging Using Simulation (GENIUS project - HemeLB) Source: Peter Coveney, GENIUS Project University College London x 682 cubic voxels, res: mm pixels x 100 slices, res: mm x 0.8 mm trilinear interpolation Our graphical-editing tool Data is generated by MRA scanners at the National Hospital for Neurosurgery and Neurology Getting the Patient Specific Data HemeLB

Flood modeling example Example 3: SURA Coastal Ocean Observing Program (SCOOP) Source: Center for Computation and Technology, Louisiana State University Integrating data from regional observing systems for real-time coastal forecasts in SE Coastal modelers working closely with computer scientists to couple models, provide data solutions, deploy ensembles of models on the Grid, assemble realtime results with GIS technologies. Three scenarios: event-driven ensemble prediction, retrospective analysis, 24/7 forecasts

7 University of ChicagoArgonne National Lab Urgent Computing - Build supercomputers for the application  Pros: Resource is ALWAYS available  Cons: Incredibly costly (99% idle)  Example: Coast Guard rescue boats Share public infrastructure  Pros: low cost  Cons: Requires complex system for authorization, resource management, and control  Examples: school buses for evacuation, cruise ships for temporary housing How can we get cycles?

8 University of ChicagoArgonne National Lab Urgent Computing - Introducing SPRUCE The Vision:  Build cohesive infrastructure that can provide urgent computing cycles for emergencies Technical Challenges:  Provide high degree of reliability  Elevated priority mechanisms  Resource selection, data movement Social Challenges:  Who? When? What?  How will emergency use impact regular use?  Decision-making, workflow, and interpretation

9 University of ChicagoArgonne National Lab Urgent Computing - GETS USER GETS USER ORGANIZATION GETS priority is invoked “call-by-call” Calling cards are in widespread use and easily understood by the NS/EP User, simplifying GETS usage. GETS is a ‘ubiquitous’ service in the Public Switched Telephone Network. If you can get a DIAL TONE, you can make a GETS call. Existing “Digital Right-of-Way” Emergency Phone System

10 University of ChicagoArgonne National Lab Urgent Computing - Automated Trigger Human Trigger Right-of-Way Token 2 1 Event First Responder Right-of-Way Token SPRUCE Gateway / Web Services SPRUCE Architecture Overview (1/2) Right-of-Way Tokens

11 University of ChicagoArgonne National Lab Urgent Computing - Conventional Job Submission Parameters ! Urgent Computing Parameters Urgent Computing Job Submission SPRUCE Job Manager Supercomputer Resource Local Site Policies Priority Job Queue User TeamAuthentication 3 Choose a Resource 4 5 SPRUCE Architecture Overview (2/2) Submitting Urgent Jobs

12 University of ChicagoArgonne National Lab Urgent Computing - Summary of Components Token & session management  admin, user, job manager Priority queue and local policies Authorization & management for job submission and queuing Central Distributed, Site-local [configuration & policy] [installed software] [SPRUCE Server (WS interface, Web portal)]

13 University of ChicagoArgonne National Lab Urgent Computing - Internal Architecture Central SPRUCE Server SPRUCE User Services || Validation Services Axis 2 Web Service Stack Tomcat Java Servlet Container Apache Web Server MySQL JDBC Mirror Client Interfaces Web Portal or Workflow Tools PHP / Perl AJAX Client-Side Job Tools Computing Resource: Job Manager & Scripts Axis2 Java PHP / Perl SOAP Request Future work

14 University of ChicagoArgonne National Lab Urgent Computing - Site-Local Response Policies: How to Handle Urgent Computing Requests? “Next-to-run” status for priority queue  wait for running jobs to complete Force checkpoint of existing jobs; run urgent job Suspend current job in memory (kill -STOP); run urgent job Preempt normal or lower priority jobs; run urgent job Provide differentiated CPU accounting  “Jobs that can be killed because they maintain their own checkpoints will be charged 20% less” Other incentives  One time use token

15 University of ChicagoArgonne National Lab Urgent Computing - LEAD with SPRUCE Integration

16 University of ChicagoArgonne National Lab Urgent Computing - Emergency Preparedness Testing: “Warm Standby” For urgent computation, there is no time to port code  Applications must be in “warm standby”  Verification and validation runs test readiness periodically (Inca, MDS)  Reliability calculations  Only verified apps participate in urgent computing Grid-wide Information Catalog  Application was last tested & validated on  Also provides key success/failure history logs

17 University of ChicagoArgonne National Lab Urgent Computing - Urgent Computing Resource Selection: Advisor Given a set of resources (each of which has their own urgent computing policies) and a deadline how does a user select the “best” resource?  Advisor approach: Analyze historical and live data to determine the likelihood of meeting a deadline for each possible configuration.  Configuration is a specification of the resource, policy and runtime parameters (e.g., application performance parameters, nodes/ppn, etc.)

18 University of ChicagoArgonne National Lab Urgent Computing - Selecting a Resource Historical Data Network bandwidth Queue wait times Warm standby validation Local site policies Live Data Network status Job/Queue data Urgent Severe Weather Job Deadline: 90 Min Advisor “Best” HPC Resource Trigger Application-Specific Data Performance Model Verified Resources Data Repositories !

19 University of ChicagoArgonne National Lab Urgent Computing - Probabilistic Upper Bounds on Total Turnaround Time The delay consists of three phases:  File staging (both input and output)  Allocation (e.g., queue delay)  Execution If phases are independent a conservative composite bound can be created by combining phase bounds  C B = F B + A B + E B  C Q ≥F Q * A Q * E Q

20 University of ChicagoArgonne National Lab Urgent Computing - Highly Available Resource Co-allocator (HARC) Some scenarios have definite early notice  SCOOP gets hurricane warnings a few hours to days in advance  No need for drastic steps like killing jobs HARC with SPRUCE for reservations  Reservation made via portal will be associated to a token  Any user can use the reservation if added onto that active token  Can bypass local access control lists

21 University of ChicagoArgonne National Lab Urgent Computing - Bandwidth Tokens

22 University of ChicagoArgonne National Lab Urgent Computing - Condor Integration Goals  -Support Condor-managed resources  Traditional Condor pools  Condor-G  Support Condor applications Supported Policies  Elevated priority  Preemption Advantage  Expressive and extensive configuration language enables creation of complex scheduling rules based on criteria that may not be available to other resource managers

23 University of ChicagoArgonne National Lab Urgent Computing - Roadmap Ongoing work  Policy mapping for resources and applications  Notification system with triggers on token use  INCA Q/A monitoring system for SPRUCE services Future Work  Automatic restart tokens  Aggregation, Extension of tokens, ‘start_by’ deadlines  Encode (and probe for) local site policies  Warm standby integration  Failover & redundancy of SPRUCE server

24 University of ChicagoArgonne National Lab Urgent Computing - Deployment Status Deployed and Available on TeraGrid -  UC/ANL  NCSA  SDSC  NCAR  Purdue  TACC  LSU  Indiana Other sites  Virginia Tech  LONI

25 University of ChicagoArgonne National Lab Urgent Computing - Imagine… A world-wide system for supporting urgent computing on supercomputers Slow, patient growth of large-scale urgent apps Expanding our notion of priority queuing, checkpoint/restart, CPU pricing, etc For Capability:10 to 20 supercomputers available on demand For Capacity: Condor Flocks & Clouds provide availability of 250K “node instances”

26 University of ChicagoArgonne National Lab Urgent Computing - Partners

27 University of ChicagoArgonne National Lab Urgent Computing - SC08 Schedule Scheduled presentations at the Argonne National Lab booth:  Tuesday 3:00 - 3:30: SPRUCE overview  Wednesday 1:00 - 2:00: An hour demo session featuring:  User interactions with SPRUCE  How LEAD utilizes SPRUCE in their workflows  Resource selection  Condor with SPRUCE

28 University of ChicagoArgonne National Lab Urgent Computing - Questions? Ready to Join?