Application-Aware Management of Parallel Simulation Collections Siu-Man Yau, New York University Steven G. Parker

Application-Aware Management of Parallel Simulation Collections Siu-Man Yau, (smyau@cs.nyu.edu), New York University Steven G. Parker (sparker@cs.utah.edu), University of Utah Kostadin Damevski (damevski@cs.utah.edu), University of Utah Vijay Karamcheti (vijayk@cs.nyu.edu), New York University Denis Zorin (dzorin@cs.nyu.edu), New York University

Multi-Experiment Studies Computational studies require multiple runs of a simulation software

Multi-Experiment Studies Existing (batch-based) systems treat each execution as a ‘black box’: –Issue one simulation at a time Application-aware system: –Schedule collection of simulations as a whole –Use application-specific knowledge for scheduling and resource allocation decisions Application-awareness brings 4X improvement in response time

Outline Example MES: Helium Model Validation Evaluation platform: SimX System Application-specific considerations –Parallel overhead, Sampling, Result reuse, Malleability Application-Driven Scheduling and Resource Allocation Strategies Conclusion

Helium Model Validation Gas mixing model for fire simulation “Knobs” on model: –Prandtl number –Smagorinsky constant –Grid resolution –Inlet Velocity –etc... To validate: compare Vs real-life experiment

Helium Model Validation Measure velocity profile from real-life experiment Pick two “knobs” –Prandtl number –Inlet Velocity Run simulated experiments Find the combination that match the profile at both heights

Helium Model Validation Pareto Frontier - set of inputs that cannot be improved in all objectives

Evaluation platform: SimX System support for Interactive Multi- Experiment Studies (SIMECS) View computational study as a whole For parallel, distributed clusters –Workers (Simulation code & Evaluation code) –Manager (UI, Sampler, Resource Allocator) –Spatially-Indexed Shared Object Layer (SISOL)

SISOL API Front-end Manager Process Worker Process Pool User Interface: Visualisation & Interaction Sampler Resource Allocator FUEL Interface SISOL Server Pool Data Server Dir Server Task Queue Simulation code FUEL Interface Evaluation code Evaluation platform: SimX

Application-Awareness Decision: How many processes for each task? Application-specific considerations –Minimize parallelization overhead: concurrent tasks, low parallelism –Sampling strategy: task dependency: serial tasks, high parallelism –Reuse opportunities: maximize “reusable” work: serial tasks, high parallelism –Malleability: claim idle resource as beneficial Work against each other

Consideration: Parallel Overhead Parallel overhead from communications, load-imbalance, etc. Minimize per-task parallelism Many concurrent tasks, each using a small number of processes

Consideration: Sampling Active sampling: incorporated search algorithm Introduced data dependency Schedule runs from coarse to fine grid, use coarse level results to ID promising regions 1 st Level 2 nd Level3 rd Level

Consideration: Result Reuse Helium code terminates when KE stabilizes Start from another checkpoint –stabilizes in half the time Must have same inlet velocities (reuse classes)

Application-awareness Naïve approach: Assign one worker per task –Eliminate per-task parallelization overhead –Does not maximize reuse and sampling efficiency –Left over “holes” Naïve approach: Assign one task at a time to all workers –Maximize reuse potential and sampling efficiency –Maximize parallelization overhead Application-aware approach: Batching –Groups of tasks allowed to be concurrently executed

SISOL API Front-end Manager Process Worker Process Pool User Interface: Visualisation & Interaction Sampler Resource Allocator FUEL Interface SISOL Server Pool Data Server Dir Server Task Queue Simulation code FUEL Interface Evaluation code Simulation Container TaskQueue::AddTask(Experiment) TaskQueue:: CreateBatch(set &) TaskQueue::GetIdealGroupSize() Reconfigure(const int* assignment) Solution: Application-awareness

Naïve Approach Response time = 12 hr 35 mins Idle workers

Batch for Sampling Identify independent experiments in sampler Max. parallelism while allowing active sampling First Batch 1 st Pareto-Optimal Second Batch 1 st & 2 nd Pareto Opt. 3 rd Batch 1 st to 3 rd Pareto Opt. 4 rd Batch Pareto Frontier Prantl Number Inlet Velocity

Batch for Sampling Response time = 6 hrs 10 mins 1 st Batch 2 nd Batch 3 rd Batch 4 th Batch

Batch for Result Reuse Sub-divide each batch into 2 smaller batches: –1 st sub-batch: first in reuse class; no two belong to same reuse class –No two concurrent from- scratch experiments can reuse each other’s checkpoints (max. reuse potential) –Experiments in same batch have comparable run times (reduce holes) Prantl Number Inlet Velocity

Batch for Result Reuse Total time: 5 hr 10 mins 1 st Batch 2 nd Batch 3 rd Batch 4 th Batch 5 th Batch 6 th Batch

Preemption Helium code is malleable: –Restart a checkpointed run on different number of workers Preemption system: –Manager stores a database of idle workers in SISOL –Workers uses application knowledge to determine if it should claim idle workers –Manager creates new worker group by adding idle workers to group –Manager restarts the simulation on new group

Preemption Total time: 4 hr 30 mins 1 st Batch 2 nd Batch 3 rd Batch 4 th Batch 5 th Batch 6 th Batch

Evaluation: Resource Allocation Knowledge used Total timeUtilization Rate Avg. time per run Improvement None (run on 1 worker) 12 hr 35 min56.3%6 hr 17 minN/A None (run 1 experiment) 20 hr 35 min100%34.3 minN/A + Active Sampling 6 hr 10 min71.1%63.4 min51% / 70% + Reuse classes 5 hr 10 min71.3%39.7 min59% / 75% + Preemption4 hr 30 min91.8%34.5 min64% / 78%

Related Work Scheduling Policies on traditional batch systems: –Fair Share –Dynamic Re-partitioning –Affinity Scheduling Multi-Processor Scheduling (MPS) Problem –Theoretical results for various heuristics Grid-based parameter sweep infrastructures –Nimrod, Condor, Globus, NetSolve, Virtual Instrument

Conclusion Application-awareness yields up to 4+ times improvement in response time Conclusions: –View from application level important –Domain knowledge important –System API and infrastructure to exploit domain knowledge important Task Queue API for batching SISOL & Resource Allocator API for pre-emption

Application-Aware Management of Parallel Simulation Collections Siu-Man Yau, New York University Steven G. Parker

Similar presentations

Presentation on theme: "Application-Aware Management of Parallel Simulation Collections Siu-Man Yau, New York University Steven G. Parker"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Application-Aware Management of Parallel Simulation Collections Siu-Man Yau, New York University Steven G. Parker

Similar presentations

Presentation on theme: "Application-Aware Management of Parallel Simulation Collections Siu-Man Yau, New York University Steven G. Parker"— Presentation transcript:

Similar presentations

About project

Feedback