Presentation is loading. Please wait.

Presentation is loading. Please wait.

Department of Computer Science at Florida State LFTI: A Performance Metric for Assessing Interconnect topology and routing design Background ‒ Innovations.

Similar presentations


Presentation on theme: "Department of Computer Science at Florida State LFTI: A Performance Metric for Assessing Interconnect topology and routing design Background ‒ Innovations."— Presentation transcript:

1 Department of Computer Science at Florida State LFTI: A Performance Metric for Assessing Interconnect topology and routing design Background ‒ Innovations in interconnect topology and routing design is essential for future generation ultra-scale supercomputers. ‒ Current methods for evaluating topology and routing design are not ideal.

2 Department of Computer Science at Florida State Current methods for evaluating interconnect topology and routing design Topology and routing are evaluated separately Topology ‒ Diameter, bisection bandwidth, nodal degree, etc ‒ Not directly related to application level performance Routing with topology ‒ Simulation to get throughput and packet latency ‒ Limited network sizes and numbers of scenarios ‒ Simulation sees the tree, but not the forest. Two kinds of metrics: simple metrics that do not directly relate to performance and detailed metrics that are too expensive to obtain.

3 Department of Computer Science at Florida State Impact of evaluation methods Evaluation methods set the design optimization objective Recently proposals (dragonfly, jellyfish) all have large bisection bandwidth and support certain traffic patterns effectively. –Think of how the designs are justified!! ‒ Excellently designs with traditional metrics. ‒ Are these designs good for typical HPC workloads? ‒ There is no metric that can be used to compare across different topology and routing designs for HPC workloads.

4 Department of Computer Science at Florida State What kind of metrics are we looking for? Desirable properties: o Reflect overall network performance o Simple enough that it can be computed quickly – we do not want to do simulation. A related attempt -- effective bisection bandwidth: summarize network performance by the average performance for all bisection communication patterns. ‒ Is this metric reflective?

5 Department of Computer Science at Florida State LFTI: LANL-FSU throughput indices A metric for throughput performance High level ideas −Use modeling the obtain the average throughput for one communication pattern. −Find the set of representative communication patterns to be used in the metrics ‒ Summary the overall network performance using the average throughput performance for a large number of communication patterns common to HPC applications

6 Department of Computer Science at Florida State LFTI: LANL-FSU throughput indices High level ideas ‒ Once the patterns to be included is determined, LFTI can be derived from most topology and routing specifications without detailed simulation. If an interconnect can achieve high overall performance for many common HPC patterns, it is likely that it will provide high performance for HPC workloads. −Unlike some other metrics, LFTI is much harder to cheat.

7 Department of Computer Science at Florida State LFTI: LANL-FSU throughput index LFTI is the summary of the throughput of an interconnect for a large number of common communication patterns in HPC applications. ‒ For each communication pattern, a metric (sustained throughput) is used that is closely related to the application level performance for that pattern to quantify the performance of the interconnect. ‒ For a class of patterns (e.g. 2DNN patterns), the expected sustained throughput is used to quantify the performance. ‒ LFTI is the aggregate of the performance of many classes of patterns.

8 Department of Computer Science at Florida State Computing the sustained throughput for a pattern (single path routing) Compute the link load (number of flows going through each link) The sustained throughput for each flow is its share of the throughput on the bottleneck link or Max-Min fairness. The sustained throughput for the pattern is the aggregate throughput of all flows in the pattern. ‒ Normalized with per flow throughput divided by the input link bandwidth.

9 Department of Computer Science at Florida State Computing the throughput index for a class of patterns A throughput index for a class of patterns (e.g. 2DNN patterns) is the expected sustained throughput across all patterns of that class. ‒ The index can be obtained by randomly sampling of a large number of patterns (e.g. 10000 patterns) ‒ May apply some statistical method to obtain the index with confidence without sampling a large number of patterns.

10 Department of Computer Science at Florida State Communication Patterns in LFTI indices ‒ Patterns with history ‒ All to all, ‒ Bisect – effective bisection bandwidth ‒ Low-dimensional stencil patterns 2DNN, 2DNN_DIAG, 3DNN, 3DNN_DIAG ‒ Random patterns – for applications with unstructure mesh, adaptive mesh refinement methods RANDOM 50, RANDOM N50 ‒ Commonly used sub-communication patterns Permutation, shift

11 Department of Computer Science at Florida State LFTI categories Trying to reflect how the machine is used Whole system direct map LFTI Whole system random map LFTI Job allocation trace-based LFTI Largest job based on some job traces

12 Department of Computer Science at Florida State Evaluating interconnect using LFTI Fat-tree (ftree), dragonfly (dfly), hypercube(hcube) 6D torus (6D), 3D torus (3D), jellyfish (jfish) of 25K-35K nodes – the size of the next generation supercomputer.

13 Department of Computer Science at Florida State Throughput index and communication time

14 Department of Computer Science at Florida State Whole system direct map LFTI

15 Department of Computer Science at Florida State Whole system direct map LFTI

16 Department of Computer Science at Florida State Whole system random map LFTI

17 Department of Computer Science at Florida State Whole system random map LFTI

18 Department of Computer Science at Florida State Job allocation based

19 Department of Computer Science at Florida State Job allocation based

20 Department of Computer Science at Florida State LFTI summary

21 Department of Computer Science at Florida State Conclusion Traditional performance metrics such as bisection bandwidth and effective bisection bandwidth are not indicative for interconnect’s performance. Optimizing for BB and EBB may not lead to high performance interconnects. LFTI is indicative of application level performance, yet can be derived rapidly without detailed simulation. ‒ It is a much better metric than the current metrics.

22 Department of Computer Science at Florida State LFTI weakness Communication patterns and weights ─Heavily concentrating on simulation types of applications ─Not much for data intensive applications ─Calls for performance characterization work ─To find the truly “representative” workload to be included in the index.

23 Department of Computer Science at Florida State LFTI weakness LFTI relies on fast modeling of throughput performance from each communication patterns o Depending on the routing algorithm, the modeling can be problematic Indirect adaptive routing is an example – no effective model method than simulation. o Needs to develop new models for all existing and future routing schemes, and whatever can affect the “sustained throughput”


Download ppt "Department of Computer Science at Florida State LFTI: A Performance Metric for Assessing Interconnect topology and routing design Background ‒ Innovations."

Similar presentations


Ads by Google