Seismic Hazard Analysis Using Distributed Workflows

Slides:



Advertisements
Similar presentations
CyberShake Project and ShakeMaps. CyberShake Project CyberShake is a SCEC research project that is a physics-based high performance computational approach.
Advertisements

The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
1 High Performance Computing at SCEC Scott Callaghan Southern California Earthquake Center University of Southern California.
6/2/20071 Grid Computing Sun Grid Engine (SGE) Manoj Katwal.
CyberShake Study 14.2 Technical Readiness Review.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Authors: Weiwei Chen, Ewa Deelman 9th International Conference on Parallel Processing and Applied Mathmatics 1.
NSF Geoinformatics Project (Sept 2012 – August 2014) Geoinformatics: Community Computational Platforms for Developing Three-Dimensional Models of Earth.
Managing large-scale workflows with Pegasus Karan Vahi ( Collaborative Computing Group USC Information Sciences Institute Funded.
Combining the strengths of UMIST and The Victoria University of Manchester Utility Driven Adaptive Workflow Execution Kevin Lee School of Computer Science,
1.UCERF3 development (Field/Milner) 2.Broadband Platform development (Silva/Goulet/Somerville and others) 3.CVM development to support higher frequencies.
CyberShake Study 15.4 Technical Readiness Review.
CyberShake Study 2.3 Technical Readiness Review. Study re-versioning SCEC software uses year.month versioning Suggest renaming this study to 13.4.
Fig. 1. A wiring diagram for the SCEC computational pathways of earthquake system science (left) and large-scale calculations exemplifying each of the.
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute
CyberShake Study 15.3 Science Readiness Review. Study 15.3 Scientific Goals Calculate a 1 Hz map of Southern California Produce meaningful 2 second results.
Southern California Earthquake Center CyberShake Progress Update 3 November 2014 through 4 May 2015 UGMS May 2015 Meeting Philip Maechling SCEC IT Architect.
Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Southern California Earthquake Center SCEC Collaboratory for Interseismic Simulation and Modeling (CISM) Infrastructure Philip J. Maechling (SCEC) September.
UCERF3 Uniform California Earthquake Rupture Forecast (UCERF3) 14 Full-3D tomographic model CVM-S4.26 of S. California 2 CyberShake 14.2 seismic hazard.
Funded by the NSF OCI program grants OCI and OCI Mats Rynge, Gideon Juve, Karan Vahi, Gaurang Mehta, Ewa Deelman Information Sciences Institute,
1 1.Used AWP-ODC-GPU to run 10Hz Wave propagation simulation with rough fault rupture in half-space with and without small scale heterogeneities. 2.Used.
Southern California Earthquake Center CyberShake Progress Update November 3, 2014 – 4 May 2015 UGMS May 2015 Meeting Philip Maechling SCEC IT Architect.
Next Generation of Apache Hadoop MapReduce Owen
1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part IV Workflow Mapping and Execution in Pegasus (Thanks.
PEER 2003 Meeting 03/08/031 Interdisciplinary Framework Major focus areas Structural Representation Fault Systems Earthquake Source Physics Ground Motions.
1
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
SCEC CyberShake on TG & OSG: Options and Experiments Allan Espinosa*°, Daniel S. Katz*, Michael Wilde*, Ian Foster*°,
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Recent TeraGrid Visualization Support Projects at NCSA Dave.
Overview of Scientific Workflows: Why Use Them?
CyberShake Study 2.3 Readiness Review
CyberShake Study 16.9 Science Readiness Review
High Performance Computing at SCEC
Simplify Your Science with Workflow Tools
OpenPBS – Distributed Workload Management System
Pegasus WMS Extends DAGMan to the grid world
Introduction to Load Balancing:
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Example: Rapid Atmospheric Modeling System, ColoState U
Simplify Your Science with Workflow Tools
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Meeting Objectives Discuss proposed CISM structure and activities
Scott Callaghan Southern California Earthquake Center
Workflows and the Pegasus Workflow Management System
CyberShake Study 16.9 Discussion
SCEC Community Modeling Environment (SCEC/CME)
US CMS Testbed.
Philip J. Maechling (SCEC) September 13, 2015
Grid Canada Testbed using HEP applications
Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.
University of Southern California
CyberShake Study 17.3 Science Readiness Review
Pegasus Workflows on XSEDE
CyberShake Study 2.2: Science Review Scott Callaghan 1.
Overview of Workflows: Why Use Them?
Mats Rynge USC Information Sciences Institute
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
A General Approach to Real-time Workflow Monitoring
CyberShake Study 14.2 Science Readiness Review
Southern California Earthquake Center
Frieda meets Pegasus-WMS
Southern California Earthquake Center
CyberShake Study 18.8 Technical Readiness Review
CyberShake Study 2.2: Computational Review Scott Callaghan 1.
CyberShake Study 18.8 Planning
Presentation transcript:

Seismic Hazard Analysis Using Distributed Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC15

Outline Seismic Hazard Analysis SCEC CyberShake Scientific Workflows Overview Technical Requirements Scientific Workflows Advantages of workflow tools CyberShake Study 15.4 Challenges Results & Performance Future Directions

Seismic Hazard Analysis What will peak ground motion be over the next 50 years? Used in building codes, insurance, government, planning Answered via Probabilistic Seismic Hazard Analysis (PSHA) Communicated with hazard curves and maps 2% in 50 years 0.25 g Hazard curve for downtown LA Probability of exceeding 0.1g in 50 yrs

PSHA Process Pick a location of interest. Determine what future earthquakes might happen which could affect that location. Estimate the magnitude and probability for each earthquake.

PSHA Process, cont. 4. Determine the shaking caused by each earthquake at the site of interest (we use physics-based simulation) 5. Combine the levels of shaking with the probabilities from (3) to produce a hazard curve.

SCEC CyberShake Project Physics-based seismic hazard analysis Began project in 2008 Continuous improvement in realism Date Study Number of sites Systems 6/08-7/08 0.5 9 USC HPC, NCSA Mercury 4/09-6/09 1.0 223 USC HPC, TACC Ranger 11/11-3/12 1.4 46 10/12-3/13 2.2 268 USC HPC, NICS Kraken 4/13-6/13 13.4 1144 USC HPC, NCSA Blue Waters, TACC Stampede 2/14-3/14 14.2 USC HPC, NCSA Blue Waters

Demand for Higher Frequencies CyberShake runs to date with f = 0.5 Hz Buildings most affected by Frequency = 10 / (height in floors) Demand by building engineers for higher frequencies CyberShake advanced to 1 Hz, ~16x computation

Computational Requirements Component Data Executions Cores/exec Node-hrs Mesh generation 120 GB 1 3840 CPUs 45 Tensor simulation 1.5 TB 2 800 GPUs 1485 Seismogram synthesis (master/worker) 30 GB 3712 CPUs 1220 Curve generation 1 MB < 1 Total 1.6 TB 5 2750 Tensor Creation Post Processing This is for one location of interest; want to run many for a hazard map

System Requirements Need to be able to utilize large HPC systems 2750 node-hrs/site x 300 sites ~ 1M node-hrs Must have GPUs for tensor calculation Coordinate jobs across multiple systems USC HPC NCSA Blue Waters OLCF Titan Data management Automated for around-the-clock execution We use scientific workflows

Workflow Tools We use Pegasus-WMS, HTCondor, Globus stack Pegasus-WMS (USC/ISI) Use API to describe workflow as tasks with dependencies Plan workflow for execution on specific resource Adds data transfer jobs when needed Gideon Juve’s talk for more details HTCondor (U of Wisconsin) Manages runtime execution of jobs Globus (U of Chicago/ANL) GRAM: Protocol for remote job submission GridFTP: Fast transfer between sites

Why these workflow tools? Abstract vs. Concrete workflow Abstract workflow: algorithmic description of tasks with dependencies Concrete workflow: jobs and files are bound to paths on a specific system Since abstract workflow is the same between systems, only need to change the mappings for new systems Data management Transfer intermediate data between systems Transfer output data to USC HPC filesystems Data products registered for discovery

Why these workflow tools? (cont.) Remote job submission Need to submit jobs to HPC systems automatically Interface with HTCondor and GRAM provides that Pilot jobs for systems which don’t support GRAM Pegasus-MPI-cluster (PMC) Tool provided with Pegasus MPI wrapper around tasks Master-worker paradigm Converts many small jobs into 1 large efficient one Can run entire workflow this way

CyberShake Study 15.4 CyberShake Study at 1 Hz Included new locations, for 336 in total New, more complex descriptions of earthquakes Run on USC HPC NCSA Blue Waters OLCF Titan

Workflows on Blue Waters Want to run both GPU and CPU jobs 22,500 CPU, 4,700 GPU nodes Support for GRAM job submission HTCondor on submit host at USC submits jobs to Blue Waters via GRAM If job fails, retry Advantages: All the logic is in the submit host Little overhead

Workflows on Titan Want to run GPU jobs No support for GRAM 18,700 GPU nodes No support for GRAM Use “pilot jobs”: Daemon on Titan monitors HTCondor queue at USC When tasks appear, daemon submits pilot job to Titan When pilot job starts, calls back to HTCondor collector Work is assigned to pilot job Must be able to connect to collector from pilot job Titan scheduler only knows about pilot job, not workflow jobs

Workflows on Titan (cont.) Used multiple pilot jobs with different sizes, lengths Can define job type to match with 1 kind of job Dependencies enforced with qsub Can vary the number of ‘slots’ (tasks per pilot job) Balance in runtime between efficiency and failure Pilot jobs will terminate after idle time Will terminate at end of runtime, regardless of task Pilot jobs have complexity and overhead Can wrap jobs across multiple workflows Submitted jobs of up to 8000 (43%) nodes

Pegaus workflow description Execution Schematic Blue Waters USC GRAM Pegaus workflow description GRAM to PBS Blue Waters queue HTCondor queue Workflow task HTCondor Titan Pilot daemon Data products task Titan queue Pilot job Transfer job Workflow task

Study 15.4 Performance Metrics Makespan: 914.2 hrs (38.1 days) 1.4 million node-hours 4372 jobs submitted 3933 to Blue Waters via GRAM 439 pilot jobs on Titan Average of 1,962 nodes, with a max of 17,986 used 20% of Blue Waters, 80% of Titan On average, 53 workflows running concurrently Managed 550 TB of data 197 TB transferred from Titan to Blue Waters 9.8 TB staged back to USC HPC disk (~7M files) 36.5% more efficient than 0.5 Hz runs

Results Hazard curves for 336 sites For each site: 500,000 seismograms Maximum shaking measures for each seismogram List of which ruptures contribute the most Useful dataset for wide- ranging applications

Future Directions: Broadband CyberShake Building engineers looking for higher frequencies Prohibitive to simulate to 10 Hz Use stochastic technique to calculate 1-10 Hz seismograms Combine with Study 15.4 results Requires 21,000 additional tasks per site Runtimes of <1 sec to 30 min Using Pegasus-mpi-cluster to manage execution

Future Directions Larger geographic region New rupture forecast Good models exist for the Bay Area Working to improve models for Central California New rupture forecast 25x as many earthquakes to consider Some are very large; may need to approximate Earthquake forecasting Use CyberShake ground motions and long-term simulators to make forecasts Test forecasts in SCEC CSEP testing center

Questions?