A CASE STUDY IN USING MASSIVELY PARALLEL SIMULATION FOR EXTREME- SCALE TORUS NETWORK CO-DESIGN Misbah Mubarak, Rensselaer Polytechnic Institute Christopher.

Slides:

Advertisements

Similar presentations

Interconnection Networks: Flow Control and Microarchitecture.

Advertisements

Misbah Mubarak, Christopher D. Carothers

Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.

Queuing Network Models for Delay Analysis of Multihop Wireless Ad Hoc Networks Nabhendra Bisnik and Alhussein Abouzeid Rensselaer Polytechnic Institute.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Mapping Communication Layouts to Network Hardware Characteristics on Massive-Scale Blue Gene Systems Pavan Balaji*, Rinku Gupta*, Abhinav Vishnu + and.

Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems A. Chan, P. Balaji, W. Gropp, R. Thakur Math. and Computer.

MANETs Routing Dr. Raad S. Al-Qassas Department of Computer Science PSUT

Packet-Switched vs. Time-Multiplexed FPGA Overlay Networks Kapre et. al RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh.

1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)

Improving Robustness in Distributed Systems Jeremy Russell Software Engineering Honours Project.

Rotary Router : An Efficient Architecture for CMP Interconnection Networks Pablo Abad, Valentín Puente, Pablo Prieto, and Jose Angel Gregorio University.

Automating Topology Aware Task Mapping on Large Parallel Machines Abhinav S Bhatele Advisor: Laxmikant V. Kale University of Illinois at Urbana-Champaign.

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)

Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.

Measuring Network Performance of Multi-Core Multi-Cluster (MCMCA) Norhazlina Hamid Supervisor: R J Walters and G B Wills PUBLIC.

1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,

Storage area network and System area network (SAN)

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Back-end Timing Models Core Models.

MIT Lincoln Laboratory XYZ 3/11/2005 Automatic Extraction of Software Models for Exascale Hardware/Software Co-Design Damian Dechev 1,2, Amruth.

Quasi Fat Trees for HPC Clouds and their Fault-Resilient Closed-Form Routing Technion - EE Department; *and Mellanox Technologies Eitan Zahavi* Isaac Keslassy.

Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.

Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.

Interconnect Networks

Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.

Extreme scale Lack of decomposition for insight Many services have centralized designs Impacts of service architectures  an open question Using Simulation.

Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,

1 Enabling Large Scale Network Simulation with 100 Million Nodes using Grid Infrastructure Hiroyuki Ohsaki Graduate School of Information Sci. & Tech.

Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Department of Computer Science at Florida State LFTI: A Performance Metric for Assessing Interconnect topology and routing design Background ‒ Innovations.

Extreme scale parallel and distributed systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward.

QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.

Trace Generation to Simulate Large Scale Distributed Application Olivier Dalle, Emiio P. ManciniMar. 8th, 2012.

Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Scalable Reconfigurable Interconnects Ali Pinar Lawrence Berkeley National Laboratory joint work with Shoaib Kamil, Lenny Oliker, and John Shalf CSCAPES.

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

 Collectives on Two-tier Direct Networks EuroMPI – 2012 Nikhil Jain, JohnMark Lau, Laxmikant Kale 26 th September, 2012.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

A Distributed Algorithm for 3D Radar Imaging PATRICK LI SIMON SCOTT CS 252 MAY 2012.

LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.

Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.

Network Computing Laboratory 1 Vivaldi: A Decentralized Network Coordinate System Authors: Frank Dabek, Russ Cox, Frans Kaashoek, Robert Morris MIT Published.

Breakout Group: Debugging David E. Skinner and Wolfgang E. Nagel IESP Workshop 3, October, Tsukuba, Japan.

Interconnection network network interface and a case study.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.

Toward Runtime Power Management of Exascale Networks by On/Off Control of Links Ehsan Totoni University of Illinois-Urbana Champaign, PPL Charm++ Workshop,

Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Tackling I/O Issues 1 David Race 16 March 2010.

Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

1 Scalability and Accuracy in a Large-Scale Network Emulator Nov. 12, 2003 Byung-Gon Chun.

Lecture 23: Interconnection Networks

Toward a Unified HPC and Big Data Runtime

Storage area network and System area network (SAN)

On the Role of Burst Buffers in Leadership-Class Storage Systems

Presentation transcript:

A CASE STUDY IN USING MASSIVELY PARALLEL SIMULATION FOR EXTREME- SCALE TORUS NETWORK CO-DESIGN Misbah Mubarak, Rensselaer Polytechnic Institute Christopher D. Carothers, Rensselaer Polytechnic Institute Robert B. Ross, Argonne National Laboratory Philip Carns, Argonne National Laboratory

Agenda Introduction Extreme-scale systems design challenges CODES project ROSS discrete-event simulator Torus network model Exploring torus network designs Torus network simulation performance Conclusion & Future work 2

Extreme-scale systems design challenges High bandwidth & messaging rate Low latency Low diameter Increased locality Today’s supercomputers Concurrency 1.5 * 10 6 System Peak ~ 20 Pflops/s Node Interconnect BW 20 GB/s System Size (nodes) 100,000 Future supercomputers Concurrency O (1 billion) System Peak 1 Eflops/s Node Interconnect BW GB/s System Size (nodes) O(100,000) or O(1M) * Source: Jack Dongarra. On the future of high-performance computing: how to think for peta and exascale computing. Hong Kong University of Science and Technology, 2012 Design requirements of an ideal interconnect 3

CODES: Enabling CO-Design of Multi-layer Exascale Storage Architectures Plan Develop an infrastructure that accurately reflects relevant HPC system properties Leverage this infrastructure as a tool for investigating storage architecture design Eventually, provide a simulation toolkit to the community to enable broader investigation of the design space 4 Scientific Application workloads Scientific Application workloads I/O workload generator Network workload generator Figure: Components of the CODES simulation toolkit

Discrete event simulation (DES): a computer model for a system where changes in the state of the system occur at discrete points in simulation time Parallel DES allows execution of simulation on a parallel platform Idea: use current HPC systems/ supercomputers + Parallel DES  to simulate future supercomputers Rensselaer Optimistic Simulator System (ROSS) provides PDES capability for CODES Optimistically schedules events. Rollback realized via reverse computation Logical processes (LPs) model state of the system Enabling CODES: parallel discrete-event simulation Cost (Time + Memory) Fidelity + Accuracy Scalability Simulation pre-reqs for co-design * Source: Christopher D. Carothers, Misbah Mubarak, Robert B. Ross, Philip Carns and Jeffrey Vetter in Workshop on Modeling & Simulation of Exascale Systems & Applications, (Modsim)

Agenda Introduction Extreme-scale systems design challenges CODES project ROSS discrete-event simulator Torus network model Exploring torus network designs Torus network simulation performance Conclusion & Future work 6

Model-net: CODES interconnect component Abstraction layer that allows network models to send messages across components A consistent API that sends messages across either dragonfly, torus or other network models without changing the higher-level code Model-net API Torus model Dragonfly model Simple-net model Interconnect component Network workload component drives Compute interconnect IO interconnect Compute-IO interconnect 7 Figure: CODES interconnect component

Torus network topology: Introduction A 3-ary 3-d torus network N-ary k-cube network Each torus node is connected to 2*n other nodes Uses physical locality to produce high nearest- neighbor throughput Well suited for applications involving extensive nearest neighbor communication Large network diameter  high hop count  can limit bisection bandwidth Widely used: IBM Blue Gene, Cray XT5 and Cray XE6 systems 8

Extreme-scale torus simulation Design and model A packet-chunk level detailed simulation. Uses virtual channel flow control to regulate buffer space. Deterministic routing. Validate Validated ROSS torus model performance against existing Blue Gene/P and Blue Gene/Q architectures Scale Use existing Blue Gene/Q architecture to model current & future interconnects Explore, examine results 9

Torus network model: Simulation design 10

Torus network model: Validation study on Blue Gene We validated the point-to-point messaging of the ROSS torus model with the Blue Gene/P and Blue Gene/Q architectures Used the mpptest performance benchmark (developed at Argonne) to measure MPI performance on Blue Gene. Configured ROSS torus model with Blue Gene/P and Blue Gene/Q torus bandwidth & delay parameters Configured MPI messages on mpptest and ROSS torus model to traverse a fixed number of torus hops Got a close latency agreement of the ROSS torus model with Blue Gene/P and Q 3-D and 5-D tori MPI message latency of ROSS torus model vs. mpptest on Argonne BG/P (1 mid-plane) and RPI CCI BG/Q (1 rack) networks, 8 hops, 1 MPI rank per node MPI message latency of ROSS torus model vs. mpptest on RPI CCNI BG/Q (1 rack) networks, 11 hops, 1 MPI rank per node 11

Agenda Introduction Extreme-scale systems design challenges CODES project ROSS discrete-event simulator Torus network model Exploring torus network designs Torus network simulation performance Conclusion & Future work 12

Extreme scale torus simulation: configuration & traffic Properties: dimensionality + link bandwidth Key: Have the right balance between dimensionality and link bandwidth Tuned the torus model by having a fixed bandwidth per torus node: A torus node with more dimensions and narrower bandwidth channels A torus node with fewer dimensions and wider bandwidth channels Traffic patterns: Nearest neighbor traffic: a measure of local network communication Diagonal traffic pattern: a measure of network bisection bandwidth In the simulation, we assumed a mapping of 1 MPI rank per node Packets are stopped injecting at a certain point during the simulation Node (a) 5-D torus plus neighbors with wider 2.0 GB/secs (b) 7-D torus plus neighbors with wide 1.43 GB/secs (c) 9-D torus plus neighbors with narrower 1.11 GB/secs 2.0 Node 13

Nearest neighbor traffic: Impact of job size Average and maximum latency of a 1,024 node 5-D, 7-D and 9-D torus model simulation with MPI rank communicating with 18 other ranks Average and maximum latency of a 1,310,720 node 5-D, 7-D and 9-D torus model simulation with MPI rank communicating with 18 other ranks Are small-scale simulations indicative of network behavior at an exascale size? Communication patterns of scientific applications are based on their own physics Chose a fixed traffic pattern and mapped it on different torus configurations At small scale (1K node), a 5-D torus performs better than a 7-D and for some cases, better than a 9-D torus At large scale (1M node), 9-D torus gives better performance than 7-D and 5-D tori 14

Nearest neighbor traffic: impact of job size Average and maximum latency of a 1,310,720 node 5-D, 7-D and 9-D torus simulation models with MPI rank communicating with 10 other ranks Does a 9-D torus still perform better than a 7-D and a 5-D if an application communicates with a limited set of neighbors only? A 5-D and 7-D torus perform better in this case Some links of a 9-D torus are under-utilized while some others are over-utilized 9-D torus has less bandwidth per link than a 5-D and 7-D torus 15

Diagonal traffic pattern: impact of job size Average and maximum latency of a 1,024 node 5-D, 7-D and 9-D torus model with 9-D diagonal traffic pattern Average and maximum latency of a 1,310,720 node 5-D, 7-D and 9-D torus model with 9-D diagonal traffic pattern Does torus dimensionality affect traffic patterns that communicate with far-end of the network? Executed diagonal traffic pattern on a 5-D, 7-D and 9-D torus with 1K and 1.3M nodes Performance picture of 1K node is different than 1.3M nodes At 1K node, the 5-D torus has the highest bisection bandwidth whereas at 1.3M node, 9-D torus has a higher bisection bandwidth Network performance at small-scale is different than performance at a large-scale 16

Agenda Introduction Extreme-scale systems design challenges CODES project ROSS discrete-event simulator Torus network model Exploring torus network designs Torus network simulation performance Conclusion & Future work 17

Simulation performance in ROSS Metrics to determine ROSS performance Simulation runtime (seconds) Event rate per second (# of events) Event efficiency (%) To reduce state-saving overheads, ROSS employs an event roll-back mechanism which determines simulation efficiency Event efficiency is inversely proportional to number of events rolled back 18

Simulation performance: Million-node torus model -Maximum simulation time does not exceed 70s on 1.3M simulated torus nodes. ROSS event rate goes to 750 million events per second -Bigger picture: How far are we from real time simulations? Performance of ROSS torus model on Mira Blue Gene/Q 65,536 MPI tasks – nearest neighbor traffic Performance of ROSS torus model on Mira Blue Gene/Q 65,536 MPI tasks – diagonal pairing taffic 19

Agenda Introduction Extreme-scale systems design challenges CODES project ROSS discrete-event simulator Torus network model Exploring torus network designs Torus network simulation performance Conclusion & Future work 20

Conclusion Applied parallel discrete-event simulation to model high-fidelity torus interconnect for CODES at an extreme scale Validated the torus network model against the Blue Gene architecture Used relevant HPC traffic patterns to explore the behavior of various extreme-scale torus configurations Found that large-scale simulations are critical in the design of exascale systems Can simulate large-scale torus network topology in a reasonable amount of time on today’s HPC systems Work is in progress to replay scientific applications workloads on CODES network models 21

Future work Model collective communication algorithms with dragonfly and torus networks Carry out ROSS simulation performance study for modeling of collective algorithms Perform experiments on the network models with network traces from Design forward, PHASTA and ROSS PHOLD models 22

Acknowledgements & Developer Access 23 We gratefully acknowledge the support of this work by the U.S. Department of Energy (DOE) ROSS download: CODES download: