Download presentation
Presentation is loading. Please wait.
1
Application-Aware Management of Parallel Simulation Collections Siu-Man Yau, (smyau@cs.nyu.edu), New York University Steven G. Parker (sparker@cs.utah.edu), University of Utah Kostadin Damevski (damevski@cs.utah.edu), University of Utah Vijay Karamcheti (vijayk@cs.nyu.edu), New York University Denis Zorin (dzorin@cs.nyu.edu), New York University
2
Multi-Experiment Studies Computational studies require multiple runs of a simulation software
3
Multi-Experiment Studies Existing (batch-based) systems treat each execution as a ‘black box’: –Issue one simulation at a time Application-aware system: –Schedule collection of simulations as a whole –Use application-specific knowledge for scheduling and resource allocation decisions Application-awareness brings 4X improvement in response time
4
Outline Example MES: Helium Model Validation Evaluation platform: SimX System Application-specific considerations –Parallel overhead, Sampling, Result reuse, Malleability Application-Driven Scheduling and Resource Allocation Strategies Conclusion
5
Helium Model Validation Gas mixing model for fire simulation “Knobs” on model: –Prandtl number –Smagorinsky constant –Grid resolution –Inlet Velocity –etc... To validate: compare Vs real-life experiment
6
Helium Model Validation Measure velocity profile from real-life experiment Pick two “knobs” –Prandtl number –Inlet Velocity Run simulated experiments Find the combination that match the profile at both heights
7
Helium Model Validation Pareto Frontier - set of inputs that cannot be improved in all objectives
8
Evaluation platform: SimX System support for Interactive Multi- Experiment Studies (SIMECS) View computational study as a whole For parallel, distributed clusters –Workers (Simulation code & Evaluation code) –Manager (UI, Sampler, Resource Allocator) –Spatially-Indexed Shared Object Layer (SISOL)
9
SISOL API Front-end Manager Process Worker Process Pool User Interface: Visualisation & Interaction Sampler Resource Allocator FUEL Interface SISOL Server Pool Data Server Dir Server Task Queue Simulation code FUEL Interface Evaluation code Evaluation platform: SimX
10
Application-Awareness Decision: How many processes for each task? Application-specific considerations –Minimize parallelization overhead: concurrent tasks, low parallelism –Sampling strategy: task dependency: serial tasks, high parallelism –Reuse opportunities: maximize “reusable” work: serial tasks, high parallelism –Malleability: claim idle resource as beneficial Work against each other
11
Consideration: Parallel Overhead Parallel overhead from communications, load-imbalance, etc. Minimize per-task parallelism Many concurrent tasks, each using a small number of processes
12
Consideration: Sampling Active sampling: incorporated search algorithm Introduced data dependency Schedule runs from coarse to fine grid, use coarse level results to ID promising regions 1 st Level 2 nd Level3 rd Level
13
Consideration: Result Reuse Helium code terminates when KE stabilizes Start from another checkpoint –stabilizes in half the time Must have same inlet velocities (reuse classes)
14
Application-awareness Naïve approach: Assign one worker per task –Eliminate per-task parallelization overhead –Does not maximize reuse and sampling efficiency –Left over “holes” Naïve approach: Assign one task at a time to all workers –Maximize reuse potential and sampling efficiency –Maximize parallelization overhead Application-aware approach: Batching –Groups of tasks allowed to be concurrently executed
15
SISOL API Front-end Manager Process Worker Process Pool User Interface: Visualisation & Interaction Sampler Resource Allocator FUEL Interface SISOL Server Pool Data Server Dir Server Task Queue Simulation code FUEL Interface Evaluation code Simulation Container TaskQueue::AddTask(Experiment) TaskQueue:: CreateBatch(set &) TaskQueue::GetIdealGroupSize() Reconfigure(const int* assignment) Solution: Application-awareness
16
Naïve Approach Response time = 12 hr 35 mins Idle workers
17
Batch for Sampling Identify independent experiments in sampler Max. parallelism while allowing active sampling First Batch 1 st Pareto-Optimal Second Batch 1 st & 2 nd Pareto Opt. 3 rd Batch 1 st to 3 rd Pareto Opt. 4 rd Batch Pareto Frontier Prantl Number Inlet Velocity
18
Batch for Sampling Response time = 6 hrs 10 mins 1 st Batch 2 nd Batch 3 rd Batch 4 th Batch
19
Batch for Result Reuse Sub-divide each batch into 2 smaller batches: –1 st sub-batch: first in reuse class; no two belong to same reuse class –No two concurrent from- scratch experiments can reuse each other’s checkpoints (max. reuse potential) –Experiments in same batch have comparable run times (reduce holes) Prantl Number Inlet Velocity
20
Batch for Result Reuse Total time: 5 hr 10 mins 1 st Batch 2 nd Batch 3 rd Batch 4 th Batch 5 th Batch 6 th Batch
21
Preemption Helium code is malleable: –Restart a checkpointed run on different number of workers Preemption system: –Manager stores a database of idle workers in SISOL –Workers uses application knowledge to determine if it should claim idle workers –Manager creates new worker group by adding idle workers to group –Manager restarts the simulation on new group
22
Preemption Total time: 4 hr 30 mins 1 st Batch 2 nd Batch 3 rd Batch 4 th Batch 5 th Batch 6 th Batch
23
Evaluation: Resource Allocation Knowledge used Total timeUtilization Rate Avg. time per run Improvement None (run on 1 worker) 12 hr 35 min56.3%6 hr 17 minN/A None (run 1 experiment) 20 hr 35 min100%34.3 minN/A + Active Sampling 6 hr 10 min71.1%63.4 min51% / 70% + Reuse classes 5 hr 10 min71.3%39.7 min59% / 75% + Preemption4 hr 30 min91.8%34.5 min64% / 78%
24
Related Work Scheduling Policies on traditional batch systems: –Fair Share –Dynamic Re-partitioning –Affinity Scheduling Multi-Processor Scheduling (MPS) Problem –Theoretical results for various heuristics Grid-based parameter sweep infrastructures –Nimrod, Condor, Globus, NetSolve, Virtual Instrument
25
Conclusion Application-awareness yields up to 4+ times improvement in response time Conclusions: –View from application level important –Domain knowledge important –System API and infrastructure to exploit domain knowledge important Task Queue API for batching SISOL & Resource Allocator API for pre-emption
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.