CyberShake Study 14.2 Technical Readiness Review.

Slides:



Advertisements
Similar presentations
1 High Performance Computing at SCEC Scott Callaghan Southern California Earthquake Center University of Southern California.
Advertisements

Processes and Resources
Parallel Reconstruction of CLEO III Data Gregory J. Sharp Christopher D. Jones Wilson Synchrotron Laboratory Cornell University.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis Thesis Defense: Ashish Nagavaram Graduate student Computer Science and Engineering.
Overview of Broadband Platform Software as used in SWUS Project Philip Maechling BBP Modelers Meeting 12 June 2013.
Authors: Weiwei Chen, Ewa Deelman 9th International Conference on Parallel Processing and Applied Mathmatics 1.
The Grid is a complex, distributed and heterogeneous execution environment. Running applications requires the knowledge of many grid services: users need.
SCEC Information Technology Overview for 2012 Philip J. Maechling Information Technology Architect Southern California Earthquake Center SCEC Board of.
NSF Geoinformatics Project (Sept 2012 – August 2014) Geoinformatics: Community Computational Platforms for Developing Three-Dimensional Models of Earth.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
1.UCERF3 development (Field/Milner) 2.Broadband Platform development (Silva/Goulet/Somerville and others) 3.CVM development to support higher frequencies.
Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.
CyberShake Study 15.4 Technical Readiness Review.
Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.
CyberShake Study 2.3 Technical Readiness Review. Study re-versioning SCEC software uses year.month versioning Suggest renaming this study to 13.4.
PHENIX and the data grid >400 collaborators Active on 3 continents + Brazil 100’s of TB of data per year Complex data with multiple disparate physics goals.
The SCEC Broadband Platform Recent Activities and Developments Philip Maechling, Fabio Silva, Scott Callaghan, Thomas H. Jordan Southern California Earthquake.
Fig. 1. A wiring diagram for the SCEC computational pathways of earthquake system science (left) and large-scale calculations exemplifying each of the.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems with Multi-programming Chapter 4.
PHENIX and the data grid >400 collaborators 3 continents + Israel +Brazil 100’s of TB of data per year Complex data with multiple disparate physics goals.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
CyberShake Study 15.3 Science Readiness Review. Study 15.3 Scientific Goals Calculate a 1 Hz map of Southern California Produce meaningful 2 second results.
Phase 1: Comparison of Results at 4Hz Phase 1 Goal: Compare 4Hz ground motion results from different codes to establish whether the codes produce equivalent.
Southern California Earthquake Center CyberShake Progress Update 3 November 2014 through 4 May 2015 UGMS May 2015 Meeting Philip Maechling SCEC IT Architect.
Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF.
Southern California Earthquake Center SCEC Collaboratory for Interseismic Simulation and Modeling (CISM) Infrastructure Philip J. Maechling (SCEC) September.
UCERF3 Uniform California Earthquake Rupture Forecast (UCERF3) 14 Full-3D tomographic model CVM-S4.26 of S. California 2 CyberShake 14.2 seismic hazard.
Funded by the NSF OCI program grants OCI and OCI Mats Rynge, Gideon Juve, Karan Vahi, Gaurang Mehta, Ewa Deelman Information Sciences Institute,
Overview of Broadband Platform Software as used in SWUS Project.
1 1.Used AWP-ODC-GPU to run 10Hz Wave propagation simulation with rough fault rupture in half-space with and without small scale heterogeneities. 2.Used.
Sunpyo Hong, Hyesoon Kim
CyberShake and NGA MCER Results Scott Callaghan UGMS Meeting November 3, 2014.
FroNtier Stress Tests at Tier-0 Status report Luis Ramos LCG3D Workshop – September 13, 2006.
Southern California Earthquake Center CyberShake Progress Update November 3, 2014 – 4 May 2015 UGMS May 2015 Meeting Philip Maechling SCEC IT Architect.
Welcome to the CME Project Meeting 2013 Philip J. Maechling Information Technology Architect Southern California Earthquake Center.
1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part IV Workflow Mapping and Execution in Pegasus (Thanks.
PEER 2003 Meeting 03/08/031 Interdisciplinary Framework Major focus areas Structural Representation Fault Systems Earthquake Source Physics Ground Motions.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
1
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
SCEC CyberShake on TG & OSG: Options and Experiments Allan Espinosa*°, Daniel S. Katz*, Michael Wilde*, Ian Foster*°,
Emulating Volunteer Computing Scheduling Policies Dr. David P. Anderson University of California, Berkeley May 20, 2011.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Recent TeraGrid Visualization Support Projects at NCSA Dave.
Overview of Scientific Workflows: Why Use Them?
CyberShake Study 2.3 Readiness Review
CyberShake Study 16.9 Science Readiness Review
Condor DAGMan: Managing Job Dependencies with Condor
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Simplify Your Science with Workflow Tools
Seismic Hazard Analysis Using Distributed Workflows
Scott Callaghan Southern California Earthquake Center
The SCEC Broadband Platform: Computational Infrastructure For Transparent And Reproducible Ground Motion Simulation Philip J. Maechling [1], Fabio Silva.
CyberShake Study 16.9 Discussion
Job workflow Pre production operations:
High-F Project Southern California Earthquake Center
Philip J. Maechling (SCEC) September 13, 2015
Capriccio – A Thread Model
CyberShake Study 17.3 Science Readiness Review
CyberShake Study 2.2: Science Review Scott Callaghan 1.
Overview of Workflows: Why Use Them?
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
CyberShake Study 14.2 Science Readiness Review
Southern California Earthquake Center
Southern California Earthquake Center
CyberShake Study 18.8 Technical Readiness Review
CyberShake Study 2.2: Computational Review Scott Callaghan 1.
CyberShake Study 18.8 Planning
Presentation transcript:

CyberShake Study 14.2 Technical Readiness Review

Study 14.2 Scientific Goals Compare impact of velocity models on Los Angeles-area hazard maps with various velocity models CVM-S4.26, BBP 1D, CVM-H 11.9, no GTL Compare to CVM-S, CVM-H 11.9 with GTL Investigate impact of GTL Compare 1D reference model Compare tomographic inversion results 286 sites (10 km mesh + points of interest)

Study 14.2 Technical Goals Run both SGT and post-processing workflows on Blue Waters Plan to measure CyberShake application makespan Equivalent to the makespan of all of the workflows (All jobs complete) – (first workflow submitted) Includes hazard curve calculation time Includes system downtime, workflow stoppages Will estimate time-to-solution by adding estimate of setup-time and analysis time. Compare performance, queue times, results of GPU and CPU AWP-ODC-SGT

Performance Enhancements New version of seismogram synthesis code to reduce read I/O Reads in set of extracted SGTs Synthesizes multiple RVs (using 5 in production) Reduce number of subworkflows to 6 (from 8) Fewer jobs, less queuing time For CPU SGTs, increase core count Each processor has ~64x50x50 chunk of grid points For GPU SGTs, decrease processor count Volume must be multiple of 20 in X and Y 10 x 10 x 1 GPUs, regardless of volume

Proposed Study sites (286)

Study 14.2 Data Products 2 CVM-S4.26 Los Angeles-area hazard models 1 BBP 1D Los Angeles-area hazard model 1 CVM-H 11.9, no GTL Los Angeles-area hazard model Hazard curves for 286 sites x 4 conditions, at 3s, 5s, 10s 1144 sets of 2-component SGTs Seismograms for all ruptures (~470M) Peak amplitudes in DB for 3s, 5s, 10s

Study 14.2 Notables First CVM-S4.26 hazard models First CVM-H, no GTL hazard model First 1D hazard model First study using AWP-SGT-GPU First CyberShake Study using a single workflow on one system (Blue Waters)

Study 14.2 Parameters 0.5 Hz, deterministic 200 m spacing CVMs Vs min = 500 m/s UCERF 2 Graves & Pitarka (2010) rupture variations

Verification 4 sites (USC, PAS, WNGC, SBSM) AWP-SGT-CPU, CVM-S4.26 AWP-SGT-GPU, CVM-S4.26 AWP-SGT-CPU, BBP 1D AWP-SGT-GPU, CVM-H 11.9, no GTL Plotted with previously calculated curves

CVM-S4.26 (CPU)

CVM-H, no GTL (CPU)

Changes to SGT Software Stack Velocity Mesh generation Switched from 2 jobs (create, then merge) to 1 job SGTs AWP-ODC-SGT CPU v14.2 Has wrapper because of issue with getting exit code back AWP-ODC-SGT GPU v14.2 Has wrapper to read in parameter file and construct command-line arguments Nan Check Always had NaN check for RWG SGTs, now for AWP SGTs also

Changes to PP Software Stack Seismogram Synthesis / PSA Calculation Modified to synthesize multiple seismograms per invocation Will use 5 rupture variations per invocation Reduces read I/O by factor of 5 Needed to avoid congestion protection events All codes tagged in SVN before study begins

Changes to Workflows Changed workflow hierarchy 1 integrated workflow per site, per Added ability to select SGT core count dynamically Put volume creation job into top-level workflow to reduce hierarchy to 2 levels Reduced number of post-processing sub- workflows to 6 Fewer jobs in queue Will not keep job output if job succeeds Reduce size of workflow logs

Workflow Hierarchy Integrated Workflow (1 per model per site) PreCVM (creates volume) Generate SGT Workflow SGT Workflow PP Pre Workflow PP subwf 0PP subwf 1PP subwf 5 … DB workflow More details on next slide

Distributed Processing Cron job on shock.usc.edu creates/plans/runs full workflows Pegasus 4.4, from Git repository Condor Globus Jobs submitted to Blue Waters via GRAM Results staged back to shock, DB populated, curves generated Alternate CPU and GPU workflows for best queue performance

Computational Requirements Computational time: 275K node-hrs SGT Computational time: 180K node-hrs CPU: 150 node-hrs/site x 286 sites x 2 models = 86K node-hrs (XE, 32 cores/node) GPU: 90 node hrs/site x 286 sites x 2 models = 52K node-hrs (XK) Study 13.4 had 29% overrun on SGTs PP Computational time: 95K node-hrs 60 node-hrs/site x 286 sites x 4 models = 70K node-hrs (XE, 32 cores/node) Study 13.4 had 35% overrun on PP Current allocation has 3.0M node-hrs remaining

Blue Waters Storage Requirements Planned unpurged disk usage: 45 TB SGTs: 40 GB/site x 286 sites x 4 models = 45 TB, archived on Blue Waters Planned purged disk usage: 783 TB Seismograms: 11 GB/site x 286 sites x 4 models = 12.3 TB, staged back to SCEC PSA files: 0.2 GB/site x 286 sites x 4 models = 0.2 TB, staged back to SCEC Temporary: 690 GB/site x 286 sites x 4 models = 771 TB

SCEC Storage Requirements Planned archival disk usage: 12.5 TB Seismograms: 12.3 TB (scec-04 has 19 TB) PSA files: 0.2 TB (scec-04) Curves, disagg, reports: 93 GB (99% reports) Planned database usage: 210 GB 3 rows/rupture variation x 410K rupture variations/site x 286 sites x 4 models = 1.4B rows 1.4B rows x 151 bytes/row = 210 GB (880 GB free) Planned temporary disk usage: 5.5 TB Workflow logs: 5.5 TB – possibly smaller, not saving all output anymore (scec-02 has 12 TB free)

Metrics Gathering Monitord for workflow metrics Will run after workflows have completed Python scripts Used to obtain some of the standard CyberShake metrics for comparison Cronjob on Blue Waters Core usage over time Jobs running and idle counts Will use start and end of workflow logs to perform makespan measurement

Estimated Duration Limiting factors: Queue time Especially for XK nodes, could be substantial percentage of run time Blue Waters -> SCEC transfer If Blue Waters throughput is very high, transfer could be bottleneck With queues, estimated completion is 4 weeks 1 hazard map/week Requires average of 410 nodes 603 nodes averaged during Study 13.4 With a reservation, completion depends on the reservation size

Personnel Support Scientists Tom Jordan, Kim Olsen, Rob Graves Technical Lead Scott Callaghan SGT code support Efecan Poyraz, Yifeng Cui Job Submission / Run Monitoring Scott Callaghan, David Gill, Heming Xu, Phil Maechling NCSA Support Omar Padron, Tim Bouvet Workflow Support Karan Vahi, Gideon Juve

Risks Queue times on Blue Waters In tests, at times GPU queue times have been > 1 day Congestion protection events If triggered consistently, will either need to throttle post-processing or suspend run until improvements are developed

Thanks for your time!