Host Side Dynamic Reconfiguration with InfiniBand TM By Wei Lin Guay*, Sven-Arne Reinemo*, Olav Lysne*, Tor Skeie*, Bjørn Dag Johnsen^ and Line Holen^

Slides:



Advertisements
Similar presentations
CSE 413: Computer Networks
Advertisements

EdgeNet2006 Summit1 Virtual LAN as A Network Control Mechanism Tzi-cker Chiueh Computer Science Department Stony Brook University.
Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.
Design of a reliable communication system for grid-style traffic light networks Junghoon Lee Dept. of Computer science and statistics Jeju National University.
1 Routing Protocols I. 2 Routing Recall: There are two parts to routing IP packets: 1. How to pass a packet from an input interface to the output interface.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
TELE202 Lecture 7 X.25 1 Lecturer Dr Z. Huang Overview ¥Last Lecture »Routing in WAN »Source: chapter 10 ¥This Lecture »X.25 »Source: chapter 10 ¥Next.
© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,
Packet Switching COM1337/3501 Textbook: Computer Networks: A Systems Approach, L. Peterson, B. Davie, Morgan Kaufmann Chapter 3.
DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in Multi-core Chips Andrew DeOrio †, Konstantinos Aisopos ‡§ Valeria Bertacco †, Li-Shiuan.
Bandwidth Management Framework for IP based Mobile Ad Hoc Networks Khalid Iqbal ( ) Supervisor: Dr. Rajan Shankaran ITEC810 June 05, 2009.
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
Consensus Routing: The Internet as a Distributed System John P. John, Ethan Katz-Bassett, Arvind Krishnamurthy, and Thomas Anderson Presented.
High Performance Router Architectures for Network- based Computing By Dr. Timothy Mark Pinkston University of South California Computer Engineering Division.
A Survey and Comparison of Overlay Multicast Ching-Feng Li.
INSENS: Intrusion-Tolerant Routing For Wireless Sensor Networks By: Jing Deng, Richard Han, Shivakant Mishra Presented by: Daryl Lonnon.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.
1 Computer Networks Switching Technologies. 2 Switched Network Long distance transmission typically done over a network of switched nodes End devices.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Wide-Area Traffic Management COS 597E: Software Defined Networking.
SMUCSE 8344 Constraint-Based Routing in MPLS. SMUCSE 8344 Constraint Based Routing (CBR) What is CBR –Each link a collection of attributes (performance,
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Switching, routing, and flow control in interconnection networks.
Computer Measurement Group, India Reliable and Scalable Data Streaming in Multi-Hop Architecture Sudhir Sangra, BMC Software Lalit.
Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,
29-Aug-154/598N: Computer Networks Switching and Forwarding Outline –Store-and-Forward Switches.
Presentation Title Subtitle Author Copyright © 2002 OPNET Technologies, Inc. TM Introduction to IP and Routing.
NETWORK Topologies An Introduction.
Dynamic Network Emulation Security Analysis for Application Layer Protocols.
Introduction to Routing and Routing Protocols By Ashar Anwar.
Software-Defined Networks Jennifer Rexford Princeton University.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Current major high performance networking technologies InfiniBand 10G-Ethernet.
Infiniband subnet management Discuss the Infiniband subnet management system Discuss fat tree and subnet management in an Infiniband with a fat tree topology.
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
Improving Capacity and Flexibility of Wireless Mesh Networks by Interface Switching Yunxia Feng, Minglu Li and Min-You Wu Presented by: Yunxia Feng Dept.
The Way Networks Work Computer Networks Kwangwoon University.
Switching Techniques Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Computer Networks with Internet Technology William Stallings
March 9, 2015 San Jose Compute Engineering Workshop.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Lecture 4: Sun: 23/4/1435 Distributed Operating Systems Lecturer/ Kawther Abas CS- 492 : Distributed system & Parallel Processing.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Versatile Low Power Media Access for Wireless Sensor Networks Sarat Chandra Subramaniam.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Jose Miguel Montanana (NII, Japan) Michihiro Koibuchi (NII, Japan ) Hiroki Matsutani ( U of Tokyo, Japan ) Hideharu Amano ( Keio U/ NII, Japan ) Stabilizing.
INSIGNIA : A QOS ARCHITECTURAL FRAMEWORK FOR MANETS Course:-Software Architecture & Design Team Members 1.Sameer Agrawal 2.Vivek Shankar Ram.R.
(Slide set by Norvald Stol/Steinar Bjørnstad
By Chi-Chang Chen.  Cluster computing is a technique of linking two or more computers into a network (usually through a local area network) in order.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
OFED 1.2 Management Update Hal Rosenstock.
CS440 Computer Networks 1 Packet Switching Neil Tang 10/6/2008.
Using Ant Agents to Combine Reactive and Proactive strategies for Routing in Mobile Ad Hoc Networks Fredrick Ducatelle, Gianni di caro, and Luca Maria.
By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
Sine-Wave Application v2.0 Pavel Čírtek. Sine-Wave Application v2.0 2 The Aim of the Work Create representative prototype of highly dependable synthetic.
Fall, 2001CS 6401 Switching and Routing Outline Routing overview Store-and-Forward switches Virtual circuits vs. Datagram switching.
Data and Computer Communications Ninth Edition by William Stallings Chapter 10 – Circuit Switching and Packet Switching Data and Computer Communications,
In the name of God.
rain technology (redundant array of independent nodes)
Module 16: Distributed System Structures
XenFS Sharing data in a virtualised environment
CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #
Distributed computing deals with hardware
On-time Network On-chip
Resource Allocation for Distributed Streaming Applications
Authors: Jinliang Fan and Mostafa H. Ammar
Presentation transcript:

Host Side Dynamic Reconfiguration with InfiniBand TM By Wei Lin Guay*, Sven-Arne Reinemo*, Olav Lysne*, Tor Skeie*, Bjørn Dag Johnsen^ and Line Holen^ *Simula Research Laboratory ^Sun Microsystems

Introduction The quest for ever increasing computing power drives the state-of-art large scale clusters. In Top500 list, more than 20 sites have > 10k processors supercomputers. The increased cluster size is challenging the reliability of interconnects – InfiniBand.

Introduction What are the available fault tolerance mechanisms?  Check-point/restart: Halted and restarted from the last checkpoint. Disadvantages: non-application transparent.  Deadlock-free re-routing Application transparent. Disadvantages: Inflexible.  Network Dynamic reconfiguration is the trend!

Network Dynamic Reconfiguration Network dynamic reconfiguration.  Move from one routing function to another while system is up and running.  Application transparent.  More flexible. Challenges of network dynamic reconfiguration  Deadlock freedom in the transition phase.  Assume that the network interface attributes have not been changed.

Host Side Dynamic Reconfiguration Host Side Dynamic Reconfiguration.  Migrate the attributes of the connection (Queue Pair) from the old routing structure to the new one.  Fault tolerance mechanism.  Live Migration  Policy Changes – Cluster Maintenance. Challenges of Host Side Dynamic Reconfiguration.  Which component to trigger the changes of routing path during the fault happened?  Setup prior alternative paths?  Network manager responsible to find new path?

Challenges of Dynamic Reconfiguration RC connection established between A and B.

Challenges of Dynamic Reconfiguration RC connection established between A and B. During the transmission, a link fails!

Challenges of Dynamic Reconfiguration RC connection established between A and B. During the transmission, a link fails! SM regenerated a deadlock free routing table.

Challenges of Dynamic Reconfiguration RC connection established between A and B. During the transmission, a link fails! SM regenerated a deadlock free routing table. Predefined deadlock free and shortest path for every paths are very difficult!

Host Side Dynamic Reconfiguration

1 1

2 2

3 3

Host Reconfiguration Keep track active QPs created in each host stack

Host Reconfiguration Keep track active QPs created in each host stack Modify QP’s context in RTS state  Reset Queue Pair

Host Reconfiguration Keep track active QPs created in each host stack Modify QP’s context in RTS state  Reset Queue Pair  Send Queue Drain(SQD)

Host Reconfiguration Keep track active QPs created in each host stack Modify QP’s context in RTS state  Reset Queue Pair  Send Queue Drain(SQD)  Auto. Path Mig.(APM)

Performance Evaluation Synthetic Traffic Patterns.  6-3:5-2:4-1:3-6:2-5:1-4 Application traffic patterns  HPCC b_eff

Performance Evaluation Micro benchmark − Setup Phase: No additional overhead!

Performance Evaluation Synthetic traffic patterns

Performance Evaluation HPCC b_eff Without dynamic reconfiguration  Benchmark will not complete once the first fault happened.  Deadlock happened!

Conclusion Novel fault tolerance mechanism  Feedback from SM.  Application Transparent. Evaluation of Scalability.  Event notification. Live Migration of Virtualization. Future Work

Thanks!