Download presentation
Presentation is loading. Please wait.
Published byReynard Joseph Modified over 8 years ago
1
www.ci.anl.gov www.ci.uchicago.edu SCEC CyberShake on TG & OSG: Options and Experiments Allan Espinosa*°, Daniel S. Katz*, Michael Wilde*, Ian Foster*°, Scott Callaghan §, Phil Maechling § *Computation Institute (University of Chicago & Argonne National Laboratory) °Computer Science (University of Chicago) § SCEC, University of Southern California
2
www.ci.anl.gov www.ci.uchicago.edu 2 SCEC CyberShake on TG & OSG SCEC CyberShake Part of SCEC (PI: Tom Jordan, USC) Using large scale simulation data, estimate probabilistic seismic hazard (PSHA) curves for sites in southern California (probability that ground motion will exceed some threshold over a given time period) Used by hospitals, power plants, schools, etc. as part of their risk assessment Based on Rupture Variations (RupVar) set – ~14,000 Potential ruptures, with likelihood derived from earthquake rupture forecast (ERF) Use cases: 1. Build map of area (run 1k-10k locations) based on latest RupVarset 2. Run one location, based on latest RupVar set (~270 locations run to-date by SCEC, all on USC or TG) Managing these requires effective grid workflow tools for job submission, data management and error recovery. Currently running production on TG using Pegasus (ISI) and DAGman (Wisconsin). Image courtesy of Philip Maechling
3
www.ci.anl.gov www.ci.uchicago.edu 3 SCEC CyberShake on TG & OSG SCEC CyberShake For each location, need a CyberShake run followed by roughly 840,000 parallel short jobs (420,000 synthetic seismograms, 420,000 extractions of peak ground motion) – Parallelize across locations, not individual workflows RupVar set – ~14,000 Potential ruptures, with likelihood derived from earthquake rupture forecast (ERF) o ~7000 will affect any one site – 2-1000 variations (45 average) per rupture, all equally likely (~600k rupture variations total) – Data size: 2.2 TB total o 146 MB per rupture (average) o 3.3 MB per rupture variation (average)
4
www.ci.anl.gov www.ci.uchicago.edu 4 SCEC CyberShake on TG & OSG TeraGrid baseline Pre-stage RupVar data to TG sites X, Y, Z For each location: – Run SGT on TG site X, Y, Z (Pegasus workflow) – Run PP on TG site X, Y, Z (Pegasus workflow)
5
www.ci.anl.gov www.ci.uchicago.edu 5 SCEC CyberShake on TG & OSG ExTENCI project ExTENCI funding: – 11 months of a CS grad student in year 1 (expectation that this work may continue under other funding and take advantage of ExTENCI developments) – 0.4 months of Dan in year 1, 0.5 months in year 2 (includes time as co-PI of project and lead PI for U. Chicago)
6
www.ci.anl.gov www.ci.uchicago.edu 6 SCEC CyberShake on TG & OSG Initial attempt to use OSG Pre-stage RupVar data to OSG sites A, B, C For each place: – Run SGT on TG site X, Y, Z – Run PP on OSG site A, B, C Done by SCEC/ISI team - works Redone by UC team – works Problems: – Pre-staging data is painful – takes 2 days – data may be deleted from OSG, then needs to be re-staged o Could be solved with motivated sites and storage leases – Opportunistic processors on OSG are insufficient – run time is long (no single OSG site w/ enough free resources) o Could be solved with motivated sites and computing leases
7
www.ci.anl.gov www.ci.uchicago.edu 7 SCEC CyberShake on TG & OSG Globus.org data transfer Idea: use globus.org data transfer service to move RupVar data Results: – RupVar coming from CI-PADS – Transfer time to LIGO_UWM_NEMO: ~1 day – Progress: Numbers currently being measured on other OSG resources as well Simple – user only instructs Globus.org what to do – wait for email that work is done, or script can query for completion
8
www.ci.anl.gov www.ci.uchicago.edu 8 SCEC CyberShake on TG & OSG Swift Idea: use Swift instead of Pegasus/DAGman Because – Adaptive scheduler (no advance planning) – May better suit the opportunistic nature of resources in OSG o Engage VO doesn’t have Glidein WMS – More control – Pegasus/DAGman is a black box – We can change Swift as needed Use Coasters – multi-level scheduling Compute results (all data pre-staged) – A single site (LGU) takes 30k cpu-hours (on PADS and Firefly) In progress by UC team – but need access to large numbers of CPUs to make this reasonable
9
www.ci.anl.gov www.ci.uchicago.edu 9 SCEC CyberShake on TG & OSG Swift implementation Site site = "LGU"; Sgt sgt ; Rupture rups[] = get_rupture(sgt); Foreach rup in rups { Sgt sub; sub = extract(sgt, rup); Variations vars = get_variations(site, rup); Seismogram seis[]; PeakValue peak[]; foreach var,i in vars { seis[i] = seismogram(sub, var); peak[i] = peak_calc(seis[i], var); } // end foreach over vars } // end foreach over rups
10
www.ci.anl.gov www.ci.uchicago.edu 10 SCEC CyberShake on TG & OSG Second attempt to use OSG Problems: – Not enough opportunistic OSG resources available o Ideally want large number of slots on a single resource – Not enough data storage available on OSG resources for prestaging Idea: – Change PP workflow to work in smaller chunks, make use of whatever resources we can get In progress by UC team
11
www.ci.anl.gov www.ci.uchicago.edu 11 SCEC CyberShake on TG & OSG Testing OSG-MM (matchmaking on OSG) Small PP chunks – Sample times: o Data transfer: 0.329 s o Compute: 821.377 s Test jobs are 1 node, 1 hr sleep Trying to see what throughput we can obtain for SCEC PP jobs, and how to maximize it Running on OSG sites w/ >= 1000 CPUs (10 sites), could reach throughput of 226 running jobs Running on OSG sites w/ >= 100 CPUs (22 sites), could reach throughput of 80 running jobs Options: – OSG-MM is not scaling well – We don’t understand how to use this correctly Second option seems most likely – working with Mats now…
12
www.ci.anl.gov www.ci.uchicago.edu 12 SCEC CyberShake on TG & OSG Future Explorations Understand value of OSG & TG together to SCEC over TG alone Examine use of Lustre-WAN in place of one-time data transfers – Still use globus.org data transfer service for transfers that should be cached Examine use of job overlay scheduling tools
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.