Presentation is loading. Please wait.

Presentation is loading. Please wait.

A CASE STUDY IN USING MASSIVELY PARALLEL SIMULATION FOR EXTREME- SCALE TORUS NETWORK CO-DESIGN Misbah Mubarak, Rensselaer Polytechnic Institute Christopher.

Similar presentations


Presentation on theme: "A CASE STUDY IN USING MASSIVELY PARALLEL SIMULATION FOR EXTREME- SCALE TORUS NETWORK CO-DESIGN Misbah Mubarak, Rensselaer Polytechnic Institute Christopher."— Presentation transcript:

1 A CASE STUDY IN USING MASSIVELY PARALLEL SIMULATION FOR EXTREME- SCALE TORUS NETWORK CO-DESIGN Misbah Mubarak, Rensselaer Polytechnic Institute Christopher D. Carothers, Rensselaer Polytechnic Institute Robert B. Ross, Argonne National Laboratory Philip Carns, Argonne National Laboratory

2 Agenda Introduction Extreme-scale systems design challenges CODES project ROSS discrete-event simulator Torus network model Exploring torus network designs Torus network simulation performance Conclusion & Future work 2

3 Extreme-scale systems design challenges High bandwidth & messaging rate Low latency Low diameter Increased locality Today’s supercomputers Concurrency 1.5 * 10 6 System Peak ~ 20 Pflops/s Node Interconnect BW 20 GB/s System Size (nodes) 100,000 Future supercomputers Concurrency O (1 billion) System Peak 1 Eflops/s Node Interconnect BW 200-400 GB/s System Size (nodes) O(100,000) or O(1M) * Source: Jack Dongarra. On the future of high-performance computing: how to think for peta and exascale computing. Hong Kong University of Science and Technology, 2012 Design requirements of an ideal interconnect 3

4 CODES: Enabling CO-Design of Multi-layer Exascale Storage Architectures Plan Develop an infrastructure that accurately reflects relevant HPC system properties Leverage this infrastructure as a tool for investigating storage architecture design Eventually, provide a simulation toolkit to the community to enable broader investigation of the design space 4 Scientific Application workloads Scientific Application workloads I/O workload generator Network workload generator Figure: Components of the CODES simulation toolkit

5 Discrete event simulation (DES): a computer model for a system where changes in the state of the system occur at discrete points in simulation time Parallel DES allows execution of simulation on a parallel platform Idea: use current HPC systems/ supercomputers + Parallel DES  to simulate future supercomputers Rensselaer Optimistic Simulator System (ROSS) provides PDES capability for CODES Optimistically schedules events. Rollback realized via reverse computation Logical processes (LPs) model state of the system Enabling CODES: parallel discrete-event simulation Cost (Time + Memory) Fidelity + Accuracy Scalability Simulation pre-reqs for co-design * Source: Christopher D. Carothers, Misbah Mubarak, Robert B. Ross, Philip Carns and Jeffrey Vetter in Workshop on Modeling & Simulation of Exascale Systems & Applications, (Modsim) 2013 5

6 Agenda Introduction Extreme-scale systems design challenges CODES project ROSS discrete-event simulator Torus network model Exploring torus network designs Torus network simulation performance Conclusion & Future work 6

7 Model-net: CODES interconnect component Abstraction layer that allows network models to send messages across components A consistent API that sends messages across either dragonfly, torus or other network models without changing the higher-level code Model-net API Torus model Dragonfly model Simple-net model Interconnect component Network workload component drives Compute interconnect IO interconnect Compute-IO interconnect 7 Figure: CODES interconnect component

8 Torus network topology: Introduction A 3-ary 3-d torus network N-ary k-cube network Each torus node is connected to 2*n other nodes Uses physical locality to produce high nearest- neighbor throughput Well suited for applications involving extensive nearest neighbor communication Large network diameter  high hop count  can limit bisection bandwidth Widely used: IBM Blue Gene, Cray XT5 and Cray XE6 systems 8

9 Extreme-scale torus simulation Design and model A packet-chunk level detailed simulation. Uses virtual channel flow control to regulate buffer space. Deterministic routing. Validate Validated ROSS torus model performance against existing Blue Gene/P and Blue Gene/Q architectures Scale Use existing Blue Gene/Q architecture to model current & future interconnects Explore, examine results 9

10 Torus network model: Simulation design 10

11 Torus network model: Validation study on Blue Gene We validated the point-to-point messaging of the ROSS torus model with the Blue Gene/P and Blue Gene/Q architectures Used the mpptest performance benchmark (developed at Argonne) to measure MPI performance on Blue Gene. Configured ROSS torus model with Blue Gene/P and Blue Gene/Q torus bandwidth & delay parameters Configured MPI messages on mpptest and ROSS torus model to traverse a fixed number of torus hops Got a close latency agreement of the ROSS torus model with Blue Gene/P and Q 3-D and 5-D tori MPI message latency of ROSS torus model vs. mpptest on Argonne BG/P (1 mid-plane) and RPI CCI BG/Q (1 rack) networks, 8 hops, 1 MPI rank per node MPI message latency of ROSS torus model vs. mpptest on RPI CCNI BG/Q (1 rack) networks, 11 hops, 1 MPI rank per node 11

12 Agenda Introduction Extreme-scale systems design challenges CODES project ROSS discrete-event simulator Torus network model Exploring torus network designs Torus network simulation performance Conclusion & Future work 12

13 Extreme scale torus simulation: configuration & traffic Properties: dimensionality + link bandwidth Key: Have the right balance between dimensionality and link bandwidth Tuned the torus model by having a fixed bandwidth per torus node: A torus node with more dimensions and narrower bandwidth channels A torus node with fewer dimensions and wider bandwidth channels Traffic patterns: Nearest neighbor traffic: a measure of local network communication Diagonal traffic pattern: a measure of network bisection bandwidth In the simulation, we assumed a mapping of 1 MPI rank per node Packets are stopped injecting at a certain point during the simulation Node 2.0 1.43 1.11 (a) 5-D torus plus neighbors with wider channels @ 2.0 GB/secs (b) 7-D torus plus neighbors with wide channels @ 1.43 GB/secs (c) 9-D torus plus neighbors with narrower channels @ 1.11 GB/secs 2.0 Node 13

14 Nearest neighbor traffic: Impact of job size Average and maximum latency of a 1,024 node 5-D, 7-D and 9-D torus model simulation with MPI rank communicating with 18 other ranks Average and maximum latency of a 1,310,720 node 5-D, 7-D and 9-D torus model simulation with MPI rank communicating with 18 other ranks Are small-scale simulations indicative of network behavior at an exascale size? Communication patterns of scientific applications are based on their own physics Chose a fixed traffic pattern and mapped it on different torus configurations At small scale (1K node), a 5-D torus performs better than a 7-D and for some cases, better than a 9-D torus At large scale (1M node), 9-D torus gives better performance than 7-D and 5-D tori 14

15 Nearest neighbor traffic: impact of job size Average and maximum latency of a 1,310,720 node 5-D, 7-D and 9-D torus simulation models with MPI rank communicating with 10 other ranks Does a 9-D torus still perform better than a 7-D and a 5-D if an application communicates with a limited set of neighbors only? A 5-D and 7-D torus perform better in this case Some links of a 9-D torus are under-utilized while some others are over-utilized 9-D torus has less bandwidth per link than a 5-D and 7-D torus 15

16 Diagonal traffic pattern: impact of job size Average and maximum latency of a 1,024 node 5-D, 7-D and 9-D torus model with 9-D diagonal traffic pattern Average and maximum latency of a 1,310,720 node 5-D, 7-D and 9-D torus model with 9-D diagonal traffic pattern Does torus dimensionality affect traffic patterns that communicate with far-end of the network? Executed diagonal traffic pattern on a 5-D, 7-D and 9-D torus with 1K and 1.3M nodes Performance picture of 1K node is different than 1.3M nodes At 1K node, the 5-D torus has the highest bisection bandwidth whereas at 1.3M node, 9-D torus has a higher bisection bandwidth Network performance at small-scale is different than performance at a large-scale 16

17 Agenda Introduction Extreme-scale systems design challenges CODES project ROSS discrete-event simulator Torus network model Exploring torus network designs Torus network simulation performance Conclusion & Future work 17

18 Simulation performance in ROSS Metrics to determine ROSS performance Simulation runtime (seconds) Event rate per second (# of events) Event efficiency (%) To reduce state-saving overheads, ROSS employs an event roll-back mechanism which determines simulation efficiency Event efficiency is inversely proportional to number of events rolled back 18

19 Simulation performance: Million-node torus model -Maximum simulation time does not exceed 70s on 1.3M simulated torus nodes. ROSS event rate goes to 750 million events per second -Bigger picture: How far are we from real time simulations? Performance of ROSS torus model on Mira Blue Gene/Q 65,536 MPI tasks – nearest neighbor traffic Performance of ROSS torus model on Mira Blue Gene/Q 65,536 MPI tasks – diagonal pairing taffic 19

20 Agenda Introduction Extreme-scale systems design challenges CODES project ROSS discrete-event simulator Torus network model Exploring torus network designs Torus network simulation performance Conclusion & Future work 20

21 Conclusion Applied parallel discrete-event simulation to model high-fidelity torus interconnect for CODES at an extreme scale Validated the torus network model against the Blue Gene architecture Used relevant HPC traffic patterns to explore the behavior of various extreme-scale torus configurations Found that large-scale simulations are critical in the design of exascale systems Can simulate large-scale torus network topology in a reasonable amount of time on today’s HPC systems Work is in progress to replay scientific applications workloads on CODES network models 21

22 Future work Model collective communication algorithms with dragonfly and torus networks Carry out ROSS simulation performance study for modeling of collective algorithms Perform experiments on the network models with network traces from Design forward, PHASTA and ROSS PHOLD models 22

23 Acknowledgements & Developer Access 23 We gratefully acknowledge the support of this work by the U.S. Department of Energy (DOE) ROSS download: https://github.com/carothersc/ROSS CODES download: http://www.mcs.anl.gov/projects/codes/developer-access/


Download ppt "A CASE STUDY IN USING MASSIVELY PARALLEL SIMULATION FOR EXTREME- SCALE TORUS NETWORK CO-DESIGN Misbah Mubarak, Rensselaer Polytechnic Institute Christopher."

Similar presentations


Ads by Google