CyberShake Study 16.9 Discussion

Slides:



Advertisements
Similar presentations
Grid Resource Allocation Management (GRAM) GRAM provides the user to access the grid in order to run, terminate and monitor jobs remotely. The job request.
Advertisements

Rhea Analysis & Post-processing Cluster Robert D. French NCCS User Assistance.
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
1 High Performance Computing at SCEC Scott Callaghan Southern California Earthquake Center University of Southern California.
ORNL is managed by UT-Battelle for the US Department of Energy Tools Available for Transferring Large Data Sets Over the WAN Suzanne Parete-Koon Chris.
ORNL is managed by UT-Battelle for the US Department of Energy Data Management User Guide Suzanne Parete-Koon Oak Ridge Leadership Computing Facility.
ORNL is managed by UT-Battelle for the US Department of Energy Globus: Proxy Lifetime Endpoint Lifetime Oak Ridge Leadership Computing Facility.
CyberShake Study 14.2 Technical Readiness Review.
Authors: Weiwei Chen, Ewa Deelman 9th International Conference on Parallel Processing and Applied Mathmatics 1.
OSG Public Storage and iRODS
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
NSF Geoinformatics Project (Sept 2012 – August 2014) Geoinformatics: Community Computational Platforms for Developing Three-Dimensional Models of Earth.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
CyberShake Study 15.4 Technical Readiness Review.
Katie Antypas User Services Group Lawrence Berkeley National Lab 17 February 2012 JGI Training Series.
CyberShake Study 2.3 Technical Readiness Review. Study re-versioning SCEC software uses year.month versioning Suggest renaming this study to 13.4.
Fig. 1. A wiring diagram for the SCEC computational pathways of earthquake system science (left) and large-scale calculations exemplifying each of the.
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute
CyberShake Study 15.3 Science Readiness Review. Study 15.3 Scientific Goals Calculate a 1 Hz map of Southern California Produce meaningful 2 second results.
Timeshared Parallel Machines Need resource management Need resource management Shrink and expand individual jobs to available sets of processors Shrink.
Southern California Earthquake Center CyberShake Progress Update 3 November 2014 through 4 May 2015 UGMS May 2015 Meeting Philip Maechling SCEC IT Architect.
Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF.
LSF Universus By Robert Stober Systems Engineer Platform Computing, Inc.
Funded by the NSF OCI program grants OCI and OCI Mats Rynge, Gideon Juve, Karan Vahi, Gaurang Mehta, Ewa Deelman Information Sciences Institute,
1 1.Used AWP-ODC-GPU to run 10Hz Wave propagation simulation with rough fault rupture in half-space with and without small scale heterogeneities. 2.Used.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
Southern California Earthquake Center CyberShake Progress Update November 3, 2014 – 4 May 2015 UGMS May 2015 Meeting Philip Maechling SCEC IT Architect.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part IV Workflow Mapping and Execution in Pegasus (Thanks.
Types of Operating Systems V1.0 (22/10/2005). Classification in earlier days Computer was very expensive and not user friendly. To minimize idle time,
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
SCEC CyberShake on TG & OSG: Options and Experiments Allan Espinosa*°, Daniel S. Katz*, Michael Wilde*, Ian Foster*°,
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Recent TeraGrid Visualization Support Projects at NCSA Dave.
Gridengine Configuration review ● Gridengine overview ● Our current setup ● The scheduler ● Scheduling policies ● Stats from the clusters.
Overview of Scientific Workflows: Why Use Them?
CyberShake Study 2.3 Readiness Review
CyberShake Study 16.9 Science Readiness Review
High Performance Computing at SCEC
The SCEC CSEP TESTING Center Operations Review
Condor DAGMan: Managing Job Dependencies with Condor
Simplify Your Science with Workflow Tools
Pegasus WMS Extends DAGMan to the grid world
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Belle II Physics Analysis Center at TIFR
Simplify Your Science with Workflow Tools
Seismic Hazard Analysis Using Distributed Workflows
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Scott Callaghan Southern California Earthquake Center
Artem Trunov and EKP team EPK – Uni Karlsruhe
Philip J. Maechling (SCEC) September 13, 2015
File Transfer Olivia Irving and Cameron Foss
Patrick Dreher Research Scientist & Associate Director
湖南大学-信息科学与工程学院-计算机与科学系
University of Southern California
CyberShake Study 17.3 Science Readiness Review
Initial job submission and monitoring efforts with JClarens
WMS Options: DIRAC and GlideIN-WMS
Pegasus Workflows on XSEDE
CyberShake Study 2.2: Science Review Scott Callaghan 1.
Overview of Workflows: Why Use Them?
Mats Rynge USC Information Sciences Institute
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
CyberShake Study 14.2 Science Readiness Review
Frieda meets Pegasus-WMS
Southern California Earthquake Center
CyberShake Study 18.8 Technical Readiness Review
CyberShake Study 2.2: Computational Review Scott Callaghan 1.
CyberShake Study 18.8 Planning
Job Submission Via File Transfer
The LHCb Computing Data Challenge DC06
Presentation transcript:

CyberShake Study 16.9 Discussion Scott Callaghan Southern California Earthquake Center

CyberShake background Physics-based probabilistic seismic hazard analysis Considers ~500,000 earthquakes for each site of interest Combines shaking from each earthquake with probability of earthquake to produce hazard curves Hazard curves from multiple sites interpolated to produce hazard map Hazard curve for downtown LA Level of shaking with 2% chance in 50 yrs

Past Study Study 15.4 (April 16 – May 24, 2015) Performed hazard calculations for 336 locations using Titan and Blue Waters Used 12.9M SUs Generated about 415 TB of data on Titan Used pilot jobs for resource provisioning Daemon running on head node submitted pilot jobs when jobs in workflow queue Extra overhead: idle time Extra complexity: pilot job dependencies

Study 16.9: Science Goals Expand CyberShake to Central California 438 new CyberShake sites Generate hazard results using two different velocity (earth) models 1D ‘baseline’ model 3D model constructed using tomography

Study 16.9: Computational Plan 438 sites x 2 velocity models = 876 runs Using both Titan and Blue Waters End-to-end workflows on both systems No intermediate data transfer to Blue Waters Transfer of final data products back to USC for post-processing and archival storage Each run is a separate workflow Will assign workflows to systems dynamically

Workflows on Titan Using new workflow approach developed by Pegasus- WMS group, rvgahp (reverse GAHP) Server daemon runs on Titan login node Makes SSH connection back to workflow submission host, starts proxy process on submission host When workflow jobs submitted to Condor-G queue, job is given to proxy which forwards it to the server for execution Push paradigm, so avoids overhead and complexity of pilot job approach Successfully tested on Titan Investigating Rhea for small processing jobs

Study 16.9: Technical Requirements Estimated runtime: 5 weeks, based on Study 15.4 Compute time Per site: 1940 node-hrs Strain Green Tensors: 400 node-hrs 2 GPU jobs x 200 nodes x 1 hour Seismogram synthesis: 1440 node-hrs 1 CPU job x 240 nodes x 6 hours Other jobs: ~100 node-hrs Total time: 15.9M SUs (~70% of allocation) Storage Purged: 400 TB SGTs + 3.3 TB output data products Will clean up as we go

Topics for Discussion Proxy certificate issues Quota increase Change in OLCF certificate policy since last study No longer able to authenticate remotely to OLCF data transfer nodes Required to transfer files, make directories Quota increase Plan to clean up as we go, but unsure what the high-water mark will be Request project increase to 300 TB Priority bump Possible to get boost in queue for jobs?