Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance analysis of a Pose application -- BigNetSim Nilesh Choudhury.

Similar presentations


Presentation on theme: "Performance analysis of a Pose application -- BigNetSim Nilesh Choudhury."— Presentation transcript:

1 Performance analysis of a Pose application -- BigNetSim Nilesh Choudhury

2 BigNetSim ● A parallel simulator for performance prediction of parallel machines ● Two components: – Processor performance modelling – Interconnection Network modelling ● The two components could be used individually or in synergy. ● Two modes of operation: – Direct Simulation (on-line mode) – Trace-driven simulation (off-line mode)

3 Architecture of BigNetSim

4 Which part is most relevant to Pose Performance ? ● The Interconnection Network simulation

5 Network Simulator Modelling ● Very detailed model: – Switch modelled as: ● Collection of Input/Output ports ● Arbitration strategies to serve incoming requests fairly ● Detailed Virtual Channel selection strategies – Input VC and Output VC ● Switch delays and arbitration costs modelled here ● Switch load and contention measures computed and updated to assist adaptive routing strategies and fault- tolerant routing ● Virtual cut-through routing; store-and-forward routing ● Number of posers per switch = # ports!

6 Network Simulator Modelling – Network Information Card (NIC) ● 'Send NIC' packetizes and sends a message at CPU's request ● 'Recv Nic' unpacks, reassembles and delivers a message to the CPU on receiving incoming packets. ● Network card send and receive latencies modelled here ● Number of posers per NIC = 2 – Channel ● Doesn't need to be very sophisticated ● Models a simple channel delay and receives a packet from a switch/Nic and delivers it to the corresponding switch/Nic it is connected to. ● Number of posers per channel = 1

7 Topologies and Routing Algorithms ● Topology and Routing strategies provide functions which the network uses ● Extrmely modular design ● Write your own routing strategies ● Write your own topology ● We have some available: – KaryNcube; KaryNmesh; KaryNtree; Nmesh; fattree; hypercbe and some hybrid variations

8 Routing Algorithms ● Minimal deadlock-free; Non-minimal and Fault- tolerant variations ● K-ary-N-mesh / N-mesh ● Direction Ordered; ● Planar Routing; ● Static Direction Reversal Routing ● Optimally Fully Adaptive Routing (modified too) – K-ary-N-tree ● UpDown (modified, non-minimal) – HyperCube ● Hamming ● P-Cube (modified too)

9 Input/Output VC selection ● Input Virtual Channel Selection – RoundRobin; – Shortest Length Queue – Output Buffer length ● Output Virtual Channel Selection – Max. available buffer length – Max. available buffer bubble VC – Output Buffer length

10 Building up a machine ● Involves selecting the processor capabilities ● Selecting the Interconnection network – Available set of topologies, routing algorithms, virtual channel selection strategies – Easy to build an interconnection network closely modelling the target machine – All these modules are easily extendable to create and plug in new topology, routing algorithm, etc ● Some preconfigured machines include: – Bluegene; RedStorm; lemieux; etc – Generalized hypercube, fat-tree, torii and mesh architectures

11 Hardware support for Collectives ● You could also model a network with hardware collectives for multicast, reduction and broadcast ● Collective Manager is interfaced with the basic network units ● You need to define the collective manager operations for the corresponding topology ● Already available for: – Hypercube; fattree; densegraph and hybrid variations

12 Network configuration Parameters ● Apart from Routing algorithm; Topology; virtual channel selection; switch size (number of ports); number of virtual channels associated to a port – Size of network – Channel bandwidth; Switch bandwidth – NIC send/recv packet latencies – NIC packetization costs – Switch buffer size – Size of a single packet – Delays in various components – DMA delay; Processor send overhead; etc

13 How does this modelling translate in the POSE framework ● If we model the following machine: – 'n' nodes; – 's' switches; – 'p' ports per switch ● There are: – 4*n + 2p*s posers. – A proc, co-proc, send-NIC, recv-NIC per machine – 'p' ports and 'p' channels per switch

14 An example ● Suppose we model a 2048 node bluegene network connected as a 3D torus: – n=2048; – s=2048; – p=6; – Total number of posers = 4*2048+2*6*2048 = 16*2048 = 32,768 posers. – Ample virtualization to run this simulation on 100 processors.

15 Factors related to Performance ● Number of GVT synchronizations: – Gives an insight of the amount of parallelism within the threshold controlled by the simulation ● Large number of sync – possibly little work within allowable limits ● Phase Time – real time elapsed between consecutive GVT synchronizations – Indicates the amount of parallelism ● Rollback fraction – Proportion of time for undoing speculative work – Implies too many strict dependencies in the simulation

16 contd... ● Communication fraction: – Fraction of total time spent communicating ● Simulation dependencies: – Posers should be distributed on processors such that it minimizes dependency ● Simulation strategy to use: – Optimistic; Adept; etc – Control the amount of throttling – speculative window ● Speedup with sequential simulation: – Sequential simulation is faster as it gets rid of all synchronization, provided it fits in memory

17 DetailedSim – performance case study ● DetailedSim (with switch modelled as a single poser) – running on 16 processors – simulating a 2048 node hypercube network – random traffic generated at each processing node ● Specculation still within reasonable limits (<20%) ● Phase time very small (<5ms)

18 contd... ● Poor real speedup ● Breakeven with sequential at 12 procs ● Increasing number of processors worsened the problem – Synchronization more expensive ● Did not scale

19 Identify the problem ● Large switch poser – Trying to do a lot of activities – Hence had a very complex state – Handles a disproportionally large number of events – Faces large number of rollbacks – Leading to frequent synchronizations – Not allowing the GVT to advance – Large state size caused each check-point to be expensive – Large number of events meant frequently check- pointing its state

20 The Solution ● Decompose switch into fine-grained posers ● Ports are logical parallel entities in a switch. ● Refactor switch in a number of ports ● Smaller state; infrequent events ● Meticulosly refactor, so as not to increase the number of events ● Output Buffered switches were refactored ● Input Buffered switches need a complex arbitration mechanism involving a central switch state

21 Improved Results ● Phase time up ● # GVT iterations down ● Rollback fraction ok ● Simulation time half ● We still had a problem: – Could not scale!! ● Expedited GVT calculation – First idle processor triggers a gvt calculation, and everyone has an updated GVT, not waiting for the phase to finish – GVT computation gets highest priority, if any processor is idle

22 Load Imbalances ● transient load imbalance went down ● # GVT computations up ● Improved scaling But, small cyclic imbalance Application specific dependencies Distribute posers to minimize simulation dependencies Partition input problem randomly

23 Communication load ● Important consideration for fine-grained simulation is communication ● partition along the min-cut of the application communication graph – decreases communiation – might increase inherent appliation dependencies among various partitions

24 Performance results ● Hypercube networks ● Run on Turing ● Reached over 2.5 million events/sec on 128 processors

25 Communication Challenges ● A 8192 node hypercube network across 128 procs – Fits in memory comfortably – Communication – 50MB/s per processor – Small messages (msg size ~250 bytes) – Myrinet just about handles this ● A step further: – 16384 node hypercube on 128 procs ● Still fits in memory ● Myrinet starts dropping packets at an alarming rate ● NIC freezes ● Runs out of execution time

26 Conclusion ● Virtualization and fine decomposition coupled with adaptive synchronization strategies help to address the challenges of large-scale fine-grained PDES ● Excellent problem-size and self scaling ● Careful decomposition of complex objects required ● Modelling posers correctly is essential for the simulation to have good performance and scale

27 Download charm / POSE ● Charm++ / POSE / BigNetSim all freely downloadable at http://charm.cs.uiuc.edu/http://charm.cs.uiuc.edu/ ● For more information on the research projects http://charm/cs.uiuc.edu/research/ http://charm/cs.uiuc.edu/research/ ● POSE: http://charm.cs.uiuc.edu/research/posehttp://charm.cs.uiuc.edu/research/pose ● BigNetSim: http://charm.cs.uiuc.edu/research/BigNetSim http://charm.cs.uiuc.edu/research/BigNetSim


Download ppt "Performance analysis of a Pose application -- BigNetSim Nilesh Choudhury."

Similar presentations


Ads by Google