Download presentation
Presentation is loading. Please wait.
Published byEdgar Cobb Modified over 9 years ago
1
Loads Balanced with CQoS Nicole Lemaster, Damian Rouson, Jaideep Ray Sandia National Laboratories Sponsor: DOE CCA Meeting – January 22, 2009
2
Computational Quality of Service Definition: The ability to change a simulation code (a collection of CCA components) on-the-fly in order to maintain optimality –Optimality: Determined by a user-defined cost function of simulation behavior and/or solution properties Choose components to make the resultant code –Fast –Robust –Accurate Component behavior/performance depends on the problem at hand (i.e. the input) –Requirement: quantification of problem “difficulty” –Requirement: metrics for component performance and input difficulty Competing constraints
3
Theoretical Needs What metrics define performance? –FLOPS, iterations to convergence, % load imbalance What metrics define robustness? –Convergence failure, bad load-imbalance, etc. What metrics define accuracy? –Global, local, statistical, deterministic, etc. All metrics are functions of the component’s input We need a model that, given component inputs and machine characteristics, predicts the component’s performance
4
Practical Needs To make performance models, we need –A collection of components to choose from –A test harness for components –A performance measurement tool – TAU –A database for storing empirical performance data –Statistical tools for model making To use performance models and make adaptive codes, we need a control system that contains –An optimization system to choose the “best” component –A feedback system that can take corrective action if a bad component is chosen by the optimization system
5
Our Interests Problem: Create a control system that can choose the best load-balancer for a simulation –Load per grid cell varies in both space and time over Cartesian mesh What numerical techniques lead to imbalance? –Adaptive Cartesian meshes –Some operator-split constructions What applications show such behavior? –Hydrocarbon combustion –Astrophysics – Type II supernovae Net result: Simulation becomes load-imbalanced
6
Considerations Solution: Repartition frequently using fast, dynamic load-balancer –Speed achieved mainly by sacrificing partition quality –Some are partial to load-balance, others minimize communication time or data migration during repartition Physics and numerics determine if the simulation is computation- or communication-dominated, so –Same load-balancer may not work throughout the run –We need to choose load-balancers anew every time we repartition
7
Control System Configuration What would a control infrastructure for an analytical control law look like? Partitioner-C DriverPartitioner Mesh Partitions Driver Meta- Partitioner (if-then-else) Mesh Characterizer Partitioner-B Partitioner-A Mesh PartitionsMesh Mesh, Control law
8
Load-Balancer Selection Model the simulation to formulate metrics that depend on the current state –e.g., communication/computation cost Characterize the dynamic load-balancers with simplified metrics –e.g., communication time, data migration effort, grid shape, runtime Develop rules to pair simulation state with appropriate partitioner –Implement a “meta-partitioner” to select a load- balancing partitioner using the rules Essentially, the code adapts to the problem!
9
Control Systems Research Mostly done by J. Steensland & H. Johansson –Johansson H.; Design and Implementation of a Dynamic and Adaptive Meta-Partitioner for Parallel SAMR Grid Hierarchies Have a set of parameterized load-balancers Modeled relationship between mesh characteristics and load-balancer inputs that lead to optimal partitions Performed tests to predict if, given a mesh, the model can predict the best load-balancer –It cannot predict it reliably, but –Provides set of (~10) candidates that contains best one –Brute-force solution: test all candidates and select best (takes ~10 seconds)
10
Required Components Essentially, things in the CCA toolkit –Simulation components: a mesh, some integrators, some linear solvers, some physics components, etc. –A variety of load-balancers And a control system to choose load-balancers
11
2D mesh already exists Part of the tutorial and toolkit Parallel capabilities Works in Bocca Used in reaction-diffusion problems, with multiple integration techniques Can accommodate slab-wise and block- wise decomposition No connection to load-balancers yet –Does its own simple domain decomposition Great for tutorials, but too simple for CQoS Mesh Component Status
12
Over the next 6 months... Extend mesh to 3D to tackle harder problems Lemaster, Stone, & Gardiner (2007)
13
Over the next 6 months... Extend mesh to 3D to tackle harder problems Extend it to incorporate domain decomposition beyond slab- and block-wise
14
Over the next 6 months... Extend mesh to 3D to tackle harder problems Extend it to incorporate domain decomposition beyond slab- and block-wise –Sub-domains consisting of a disjoint set of abutting rectangles Design ports to load-balancers Identify more interesting applications for use in CQoS testing –Construct any extra components needed –Solve the problem; quantify the degree of difficulty
15
Results to come! Contact info: Nicole Lemaster Sandia National Labs mnlemas@sandia.gov
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.