Download presentation
Presentation is loading. Please wait.
Published byAubrie Murphy Modified over 9 years ago
1
Host Side Dynamic Reconfiguration with InfiniBand TM By Wei Lin Guay*, Sven-Arne Reinemo*, Olav Lysne*, Tor Skeie*, Bjørn Dag Johnsen^ and Line Holen^ *Simula Research Laboratory ^Sun Microsystems
2
Introduction The quest for ever increasing computing power drives the state-of-art large scale clusters. In Top500 list, more than 20 sites have > 10k processors supercomputers. The increased cluster size is challenging the reliability of interconnects – InfiniBand.
3
Introduction What are the available fault tolerance mechanisms? Check-point/restart: Halted and restarted from the last checkpoint. Disadvantages: non-application transparent. Deadlock-free re-routing Application transparent. Disadvantages: Inflexible. Network Dynamic reconfiguration is the trend!
4
Network Dynamic Reconfiguration Network dynamic reconfiguration. Move from one routing function to another while system is up and running. Application transparent. More flexible. Challenges of network dynamic reconfiguration Deadlock freedom in the transition phase. Assume that the network interface attributes have not been changed.
5
Host Side Dynamic Reconfiguration Host Side Dynamic Reconfiguration. Migrate the attributes of the connection (Queue Pair) from the old routing structure to the new one. Fault tolerance mechanism. Live Migration Policy Changes – Cluster Maintenance. Challenges of Host Side Dynamic Reconfiguration. Which component to trigger the changes of routing path during the fault happened? Setup prior alternative paths? Network manager responsible to find new path?
6
Challenges of Dynamic Reconfiguration RC connection established between A and B.
7
Challenges of Dynamic Reconfiguration RC connection established between A and B. During the transmission, a link fails!
8
Challenges of Dynamic Reconfiguration RC connection established between A and B. During the transmission, a link fails! SM regenerated a deadlock free routing table.
9
Challenges of Dynamic Reconfiguration RC connection established between A and B. During the transmission, a link fails! SM regenerated a deadlock free routing table. Predefined deadlock free and shortest path for every paths are very difficult!
10
Host Side Dynamic Reconfiguration
14
1 1
15
2 2
16
3 3
17
Host Reconfiguration Keep track active QPs created in each host stack
18
Host Reconfiguration Keep track active QPs created in each host stack Modify QP’s context in RTS state Reset Queue Pair
19
Host Reconfiguration Keep track active QPs created in each host stack Modify QP’s context in RTS state Reset Queue Pair Send Queue Drain(SQD)
20
Host Reconfiguration Keep track active QPs created in each host stack Modify QP’s context in RTS state Reset Queue Pair Send Queue Drain(SQD) Auto. Path Mig.(APM)
21
Performance Evaluation Synthetic Traffic Patterns. 6-3:5-2:4-1:3-6:2-5:1-4 Application traffic patterns HPCC b_eff
22
Performance Evaluation Micro benchmark − Setup Phase: No additional overhead!
23
Performance Evaluation Synthetic traffic patterns
24
Performance Evaluation HPCC b_eff Without dynamic reconfiguration Benchmark will not complete once the first fault happened. Deadlock happened!
25
Conclusion Novel fault tolerance mechanism Feedback from SM. Application Transparent. Evaluation of Scalability. Event notification. Live Migration of Virtualization. Future Work
26
Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.