Www.ci.anl.gov www.ci.uchicago.edu SCEC CyberShake on TG & OSG: Options and Experiments Allan Espinosa*°, Daniel S. Katz*, Michael Wilde*, Ian Foster*°,

Slides:



Advertisements
Similar presentations
Enabling Cost-Effective Resource Leases with Virtual Machines Borja Sotomayor University of Chicago Ian Foster Argonne National Laboratory/
Advertisements

1 Birmingham LIGO Gravitational Wave Observatory Replicate 1 TB/day of data to 10+ international sites Uses GridFTP, RFT, RLS, DRS Cardiff AEI/Golm
Parallel Scripting on Beagle with Swift Ketan Maheshwari Postdoctoral Appointee (Argonne National.
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
1 High Performance Computing at SCEC Scott Callaghan Southern California Earthquake Center University of Southern California.
Usage Policy (UPL) Research for GriPhyN & iVDGL Catalin L. Dumitrescu, Michael Wilde, Ian Foster The University of Chicago.
April 2009 OSG Grid School - RDU 1 Open Science Grid John McGee – Renaissance Computing Institute University of North Carolina, Chapel.
CyberShake Study 14.2 Technical Readiness Review.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
Ewa Deelman, Pegasus and DAGMan: From Concept to Execution Mapping Scientific Workflows onto the National.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Science on the TeraGrid Daniel S. Katz Director of Science, TeraGrid GIG Senior Computational Researcher, Computation Institute, University.
Open Science Grid For CI-Days Internet2: Fall Member Meeting, 2007 John McGee – OSG Engagement Manager Renaissance Computing Institute.
NSF Geoinformatics Project (Sept 2012 – August 2014) Geoinformatics: Community Computational Platforms for Developing Three-Dimensional Models of Earth.
Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Open Science Grid For CI-Days Elizabeth City State University Jan-2008 John McGee – OSG Engagement Manager Manager, Cyberinfrastructure.
 The workflow description modified to output a VDS DAX.  The workflow description toolkit developed allows any concrete workflow description to be migrated.
SCEC/CME Project - How Earthquake Simulations Drive Middleware Requirements Philip Maechling SCEC IT Architect 24 June 2005.
LIGO Applications Kent Blackburn (Robert Engel, Britta Daudert)
1 SCEC Broadband Platform Development Using USC HPCC Philip Maechling 12 Nov 2012.
1.UCERF3 development (Field/Milner) 2.Broadband Platform development (Silva/Goulet/Somerville and others) 3.CVM development to support higher frequencies.
Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Southern California Earthquake Center - SCEC SCEC/CME Tom Jordan (USC) Bernard Minster (SIO) Carl Kesselman (ISI) Reagan Moore (SDSC) Phil Maechling (USC)
CyberShake Study 15.4 Technical Readiness Review.
OPENQUAKE Mission and Vision It is GEM’s mission to engage a global community in the design, development and deployment of state-of-the-art models and.
Integrating JASMine and Auger Sandy Philpott Thomas Jefferson National Accelerator Facility Jefferson Ave. Newport News, Virginia USA 23606
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
CyberShake Study 2.3 Technical Readiness Review. Study re-versioning SCEC software uses year.month versioning Suggest renaming this study to 13.4.
Fig. 1. A wiring diagram for the SCEC computational pathways of earthquake system science (left) and large-scale calculations exemplifying each of the.
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute
06/22/041 Data-Gathering Systems IRIS Stanford/ USGS UNAVCO JPL/UCSD Data Management Organizations PI’s, Groups, Centers, etc. Publications, Presentations,
Planning Ewa Deelman USC Information Sciences Institute GriPhyN NSF Project Review January 2003 Chicago.
Common User Environments Working Group Shawn T. Brown, PSC CUE Working Group Lead April,
CyberShake Study 15.3 Science Readiness Review. Study 15.3 Scientific Goals Calculate a 1 Hz map of Southern California Produce meaningful 2 second results.
Southern California Earthquake Center CyberShake Progress Update 3 November 2014 through 4 May 2015 UGMS May 2015 Meeting Philip Maechling SCEC IT Architect.
Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF.
Southern California Earthquake Center SCEC Collaboratory for Interseismic Simulation and Modeling (CISM) Infrastructure Philip J. Maechling (SCEC) September.
UCERF3 Uniform California Earthquake Rupture Forecast (UCERF3) 14 Full-3D tomographic model CVM-S4.26 of S. California 2 CyberShake 14.2 seismic hazard.
Funded by the NSF OCI program grants OCI and OCI Mats Rynge, Gideon Juve, Karan Vahi, Gaurang Mehta, Ewa Deelman Information Sciences Institute,
1 1.Used AWP-ODC-GPU to run 10Hz Wave propagation simulation with rough fault rupture in half-space with and without small scale heterogeneities. 2.Used.
Rapid Centroid Moment Tensor (CMT) Inversion in 3D Earth Structure Model for Earthquakes in Southern California 1 En-Jui Lee, 1 Po Chen, 2 Thomas H. Jordan,
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
Southern California Earthquake Center CyberShake Progress Update November 3, 2014 – 4 May 2015 UGMS May 2015 Meeting Philip Maechling SCEC IT Architect.
Welcome to the CME Project Meeting 2013 Philip J. Maechling Information Technology Architect Southern California Earthquake Center.
1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part IV Workflow Mapping and Execution in Pegasus (Thanks.
PEER 2003 Meeting 03/08/031 Interdisciplinary Framework Major focus areas Structural Representation Fault Systems Earthquake Source Physics Ground Motions.
1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.
VO Experiences with Open Science Grid Storage OSG Storage Forum | Wednesday September 22, 2010 (10:30am)
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Recent TeraGrid Visualization Support Projects at NCSA Dave.
Overview of Scientific Workflows: Why Use Them?
CyberShake Study 2.3 Readiness Review
CyberShake Study 16.9 Science Readiness Review
Seismic Hazard Analysis Using Distributed Workflows
Meeting Objectives Discuss proposed CISM structure and activities
Scott Callaghan Southern California Earthquake Center
CyberShake Study 16.9 Discussion
SCEC Community Modeling Environment (SCEC/CME)
High-F Project Southern California Earthquake Center
Philip J. Maechling (SCEC) September 13, 2015
CyberShake Study 17.3 Science Readiness Review
Pegasus Workflows on XSEDE
CyberShake Study 2.2: Science Review Scott Callaghan 1.
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
CyberShake Study 14.2 Science Readiness Review
Southern California Earthquake Center
CyberShake Study 18.8 Technical Readiness Review
CyberShake Study 2.2: Computational Review Scott Callaghan 1.
CyberShake Study 18.8 Planning
Presentation transcript:

SCEC CyberShake on TG & OSG: Options and Experiments Allan Espinosa*°, Daniel S. Katz*, Michael Wilde*, Ian Foster*°, Scott Callaghan §, Phil Maechling § *Computation Institute (University of Chicago & Argonne National Laboratory) °Computer Science (University of Chicago) § SCEC, University of Southern California

2 SCEC CyberShake on TG & OSG SCEC CyberShake Part of SCEC (PI: Tom Jordan, USC) Using large scale simulation data, estimate probabilistic seismic hazard (PSHA) curves for sites in southern California (probability that ground motion will exceed some threshold over a given time period) Used by hospitals, power plants, schools, etc. as part of their risk assessment Based on Rupture Variations (RupVar) set – ~14,000 Potential ruptures, with likelihood derived from earthquake rupture forecast (ERF) Use cases: 1. Build map of area (run 1k-10k locations) based on latest RupVarset 2. Run one location, based on latest RupVar set (~270 locations run to-date by SCEC, all on USC or TG) Managing these requires effective grid workflow tools for job submission, data management and error recovery. Currently running production on TG using Pegasus (ISI) and DAGman (Wisconsin). Image courtesy of Philip Maechling

3 SCEC CyberShake on TG & OSG SCEC CyberShake For each location, need a CyberShake run followed by roughly 840,000 parallel short jobs (420,000 synthetic seismograms, 420,000 extractions of peak ground motion) – Parallelize across locations, not individual workflows RupVar set – ~14,000 Potential ruptures, with likelihood derived from earthquake rupture forecast (ERF) o ~7000 will affect any one site – variations (45 average) per rupture, all equally likely (~600k rupture variations total) – Data size: 2.2 TB total o 146 MB per rupture (average) o 3.3 MB per rupture variation (average)

4 SCEC CyberShake on TG & OSG TeraGrid baseline Pre-stage RupVar data to TG sites X, Y, Z For each location: – Run SGT on TG site X, Y, Z (Pegasus workflow) – Run PP on TG site X, Y, Z (Pegasus workflow)

5 SCEC CyberShake on TG & OSG ExTENCI project ExTENCI funding: – 11 months of a CS grad student in year 1 (expectation that this work may continue under other funding and take advantage of ExTENCI developments) – 0.4 months of Dan in year 1, 0.5 months in year 2 (includes time as co-PI of project and lead PI for U. Chicago)

6 SCEC CyberShake on TG & OSG Initial attempt to use OSG Pre-stage RupVar data to OSG sites A, B, C For each place: – Run SGT on TG site X, Y, Z – Run PP on OSG site A, B, C Done by SCEC/ISI team - works Redone by UC team – works Problems: – Pre-staging data is painful – takes 2 days – data may be deleted from OSG, then needs to be re-staged o Could be solved with motivated sites and storage leases – Opportunistic processors on OSG are insufficient – run time is long (no single OSG site w/ enough free resources) o Could be solved with motivated sites and computing leases

7 SCEC CyberShake on TG & OSG Globus.org data transfer Idea: use globus.org data transfer service to move RupVar data Results: – RupVar coming from CI-PADS – Transfer time to LIGO_UWM_NEMO: ~1 day – Progress: Numbers currently being measured on other OSG resources as well Simple – user only instructs Globus.org what to do – wait for that work is done, or script can query for completion

8 SCEC CyberShake on TG & OSG Swift Idea: use Swift instead of Pegasus/DAGman Because – Adaptive scheduler (no advance planning) – May better suit the opportunistic nature of resources in OSG o Engage VO doesn’t have Glidein WMS – More control – Pegasus/DAGman is a black box – We can change Swift as needed Use Coasters – multi-level scheduling Compute results (all data pre-staged) – A single site (LGU) takes 30k cpu-hours (on PADS and Firefly) In progress by UC team – but need access to large numbers of CPUs to make this reasonable

9 SCEC CyberShake on TG & OSG Swift implementation Site site = "LGU"; Sgt sgt ; Rupture rups[] = get_rupture(sgt); Foreach rup in rups { Sgt sub; sub = extract(sgt, rup); Variations vars = get_variations(site, rup); Seismogram seis[]; PeakValue peak[]; foreach var,i in vars { seis[i] = seismogram(sub, var); peak[i] = peak_calc(seis[i], var); } // end foreach over vars } // end foreach over rups

SCEC CyberShake on TG & OSG Second attempt to use OSG Problems: – Not enough opportunistic OSG resources available o Ideally want large number of slots on a single resource – Not enough data storage available on OSG resources for prestaging Idea: – Change PP workflow to work in smaller chunks, make use of whatever resources we can get In progress by UC team

SCEC CyberShake on TG & OSG Testing OSG-MM (matchmaking on OSG) Small PP chunks – Sample times: o Data transfer: s o Compute: s Test jobs are 1 node, 1 hr sleep Trying to see what throughput we can obtain for SCEC PP jobs, and how to maximize it Running on OSG sites w/ >= 1000 CPUs (10 sites), could reach throughput of 226 running jobs Running on OSG sites w/ >= 100 CPUs (22 sites), could reach throughput of 80 running jobs Options: – OSG-MM is not scaling well – We don’t understand how to use this correctly Second option seems most likely – working with Mats now…

SCEC CyberShake on TG & OSG Future Explorations Understand value of OSG & TG together to SCEC over TG alone Examine use of Lustre-WAN in place of one-time data transfers – Still use globus.org data transfer service for transfers that should be cached Examine use of job overlay scheduling tools