Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, Songwu Lu SIGCOMM 2008 Presented by Ye Tian for Course CS05112.

Slides:



Advertisements
Similar presentations
Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, Songwu Lu Microsoft Research Asia, Tsinghua University, UCLA 1 DCell: A Scalable and Fault-Tolerant.
Advertisements

Communication Networks Recitation 3 Bridges & Spanning trees.
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric
COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.
BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang2,
Cs/ee 143 Communication Networks Chapter 6 Internetworking Text: Walrand & Parekh, 2010 Steven Low CMS, EE, Caltech.
Packet Switching COM1337/3501 Textbook: Computer Networks: A Systems Approach, L. Peterson, B. Davie, Morgan Kaufmann Chapter 3.
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric. Presented by: Vinuthna Nalluri Shiva Srivastava.
MANETs Routing Dr. Raad S. Al-Qassas Department of Computer Science PSUT
Datacenter Network Topologies
Oct 26, 2004CS573: Network Protocols and Standards1 IP: Routing and Subnetting Network Protocols and Standards Autumn
Dec 6, 2007CS573: Network Protocols and Standards1 Transparent Bridging Network Protocols and Standards Winter
Dissemination protocols for large sensor networks Fan Ye, Haiyun Luo, Songwu Lu and Lixia Zhang Department of Computer Science UCLA Chien Kang Wu.
Oct 21, 2004CS573: Network Protocols and Standards1 IP: Addressing, ARP, Routing Network Protocols and Standards Autumn
CS335 Networking & Network Administration Tuesday, May 11, 2010.
EE 4272Spring, 2003 Protocols & Architecture A Protocol Architecture is the layered structure of hardware & software that supports the exchange of data.
BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang2,
Chuanxiong Guo, Haitao Wu, Kun Tan,
CS335 Networking & Network Administration Tuesday, April 20, 2010.
Department of Computer Science, Jinan University, Guangzhou, P.R. China Lijun Lyu, Junjie Xie, Yuhui Deng, Yongtao Zhou ICA3PP 2014: The 14th International.
Introduction to Computer Networks 09/23 Presenter: Fatemah Panahi.
A Scalable, Commodity Data Center Network Architecture Mohammad AI-Fares, Alexander Loukissas, Amin Vahdat Presented by Ye Tao Feb 6 th 2013.
Network Topologies.
Lect3..ppt - 09/12/04 CIS 4100 Systems Performance and Evaluation Lecture 3 by Zornitza Genova Prodanoff.
ICMP (Internet Control Message Protocol) Computer Networks By: Saeedeh Zahmatkesh spring.
Chapter 4: Managing LAN Traffic
G64INC Introduction to Network Communications Ho Sooi Hock Internet Protocol.
© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking Multicast routing.
Lecture 2 TCP/IP Protocol Suite Reference: TCP/IP Protocol Suite, 4 th Edition (chapter 2) 1.
Presenter: Po-Chun Wu. Outline Introduction BCube Structure BCube Source Routing (BSR) Other Design Issues Graceful degradation Implementation.
VIRTUAL ROUTER Kien A. Hua Data Systems Lab School of EECS University of Central Florida.
Network Aware Resource Allocation in Distributed Clouds.
Routing & Architecture
1 Department of Computer Science, Jinan University 2 School of Computer Science & Technology, Huazhong University of Science & Technology Junjie Xie 1,
Introduction to Networks CS587x Lecture 1 Department of Computer Science Iowa State University.
CMPT 471 Networking II Address Resolution IPv4 ARP RARP 1© Janice Regan, 2012.
1 Transparent Bridging Advanced Computer Networks.
Multicast Routing Algorithms n Multicast routing n Flooding and Spanning Tree n Forward Shortest Path algorithm n Reversed Path Forwarding (RPF) algorithms.
1 The Internet and Networked Multimedia. 2 Layering  Internet protocols are designed to work in layers, with each layer building on the facilities provided.
LAN Switching and Wireless – Chapter 1
Dynamic Source Routing (DSR) Sandeep Gupta M.Tech - WCC.
Data Communications and Networking Chapter 11 Routing in Switched Networks References: Book Chapters 12.1, 12.3 Data and Computer Communications, 8th edition.
AODV: Introduction Reference: C. E. Perkins, E. M. Royer, and S. R. Das, “Ad hoc On-Demand Distance Vector (AODV) Routing,” Internet Draft, draft-ietf-manet-aodv-08.txt,
DSR: Introduction Reference: D. B. Johnson, D. A. Maltz, Y.-C. Hu, and J. G. Jetcheva, “The Dynamic Source Routing Protocol for Mobile Ad Hoc Networks,”
Lecture 4 Overview. Ethernet Data Link Layer protocol Ethernet (IEEE 802.3) is widely used Supported by a variety of physical layer implementations Multi-access.
IP addresses IPv4 and IPv6. IP addresses (IP=Internet Protocol) Each computer connected to the Internet must have a unique IP address.
SecondNet: A Data Center Network Virtualization Architecture with Bandwidth Guarantees Chuanxiong Guo 1, Guohan Lu 1, Helen J. Wang 2, Shuang Yang 3, Chao.
CS470 Computer Networking Protocols
1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 1 Module 10 Routing Fundamentals and Subnets.
Univ. of TehranComputer Network1 Advanced topics in Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr. Nasser Yazdani.
1 Transport Layer: Basics Outline Intro to transport UDP Congestion control basics.
Ad Hoc On-Demand Distance Vector Routing (AODV) ietf
1 Switching and Forwarding Sections Connecting More Than Two Hosts Multi-access link: Ethernet, wireless –Single physical link, shared by multiple.
NETWORKING (2) Dr. Andy Wu BCIS 4630 Fundamentals of IT Security.
CubicRing ENABLING ONE-HOP FAILURE DETECTION AND RECOVERY FOR DISTRIBUTED IN- MEMORY STORAGE SYSTEMS Yiming Zhang, Chuanxiong Guo, Dongsheng Li, Rui Chu,
Performance Comparison of Ad Hoc Network Routing Protocols Presented by Venkata Suresh Tamminiedi Computer Science Department Georgia State University.
Advanced Computer Networks
Chapter 9 Introduction To Data-Link Layer 9.# 1
IP: Addressing, ARP, Routing
CIS 700-5: The Design and Implementation of Cloud Networks
Data Center Network Architectures
Chuanxiong Guo, et al, Microsoft Research Asia, SIGCOMM 2008
Layered Architectures
BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang2,
Chuanxiong Guo, Haitao Wu, Kun Tan,
ECE 544 Protocol Design Project 2016
NTHU CS5421 Cloud Computing
PRESENTATION COMPUTER NETWORKS
1 Multi-Protocol Label Switching (MPLS). 2 MPLS Overview A forwarding scheme designed to speed up IP packet forwarding (RFC 3031) Idea: use a fixed length.
Virtual LAN (VLAN).
Presentation transcript:

Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, Songwu Lu SIGCOMM 2008 Presented by Ye Tian for Course CS05112

Overview DCN motivation DCell Network Structure Routing in DCell Simulation Results Implementation and Experiments Review

Data Center Networking (DCN) Increasing scale Google has 450,000 servers in 2006 Microsoft doubles its number of servers in 14 months The expansion rate exceeds Moore’s Law Two motivations First, data center is growing large and the number of servers is increasing at an exponential rate. Second, many infrastructure services request for higher bandwidth due to operations such as file replications in GFS and all-to-all communications in MapReduce. Therefore, network bandwidth is often a scarce resource

Existing Tree Structure Existing tree structure does not scale First, the servers are typically in a single layer-2 broadcast domain. (the network is broadcast in nature) Second, core switches, as well as the rack switches, pose as the bandwidth bottlenecks. The tree structure is also vulnerable to “single-point-of-failure"

Three Design Goals Scaling: It must physically interconnect hundreds of thousands or even millions of servers at small cost It has to enable incremental expansion by adding more servers into the already operational structure. Fault tolerance: There are various server, link, switch, rack failures due to hardware, software, and power outage problems. Fault tolerance in DCN requests for both redundancy in physical connectivity and robust mechanisms in protocol design.

Three Design Goals High network capacity: Distributed file system: When a server disk fails, re- replication is performed. File replication and re- replication are two representative, bandwidth- demanding one-to-many and many-to-one operations. MapReduce: a Reduce worker needs to fetch intermediate files from many servers. The traffic generated by the Reduce workers forms an all-to- all communication pattern.

Overview DCN motivation DCell Network Structure Routing in DCell Simulation Results Implementation and Experiments Review

DCell Physical Structure DCell 0 is the building block to construct larger DCells. It has n servers and a mini-switch. All servers in DCell 0 are connected to the mini-switch. n is a small integer (say, n=4). DCell 1 has n+1 =5 DCell 0 s. DCell 1 connects the 5 DCell 0 s as follows. Assign each server a 2-tuple [a 1, a 0 ], where a 1 and a 0 are the level-1 and level-0 IDs, respectively. Then two servers with 2-tuples [i, j-1] and [j, i] are connected with a link for every i and every j > i.

DCell Physical Structure Dcell_0 Server Mini-switch n servers in a DCell_0 n=2, k=0

DCell Physical Structure DCell_1 n=2, k=1

DCell Physical Structure DCell_2 n=2, k=2

DCell Physical Structure

For building DCell k, if we have built DCell k-1 and each DCell k-1 has t k-1 servers, then we can create a maximum t k of DCell k-1 s. The number of DCell k-1 s in a DCell k, g k, and the total number of servers in a DCell k (i.e., t k ) are g k =t k-1 +1; t k =g k *t k-1

DCell Physical Structure Each server in a DCell k is assigned a (k+1)-tuple [a k, a k-1, …, a 1, a 0 ], where a i <g i We further denote [a k, a k-1, …, a i+1 ] (i > 0) as the prefix to indicate the DCell i this node belongs to. Each server can be equivalently identified by a unique ID uid k, taking a value from [0, t k ). A server in DCell k is denoted as [a k, uid k-1 ], where a k is the DCell k-1 this server belongs to, and uid k-1 is the unique ID of the server inside this DCell k-1.

Build DCells

Part I checks whether it constructs a DCell 0. Part II recursively constructs g l number of DCell l-1 s. Part III interconnects these DCell l-1 s, where any two DCell l-1 s are connected with one link. Each server in a DCell k network has k+1 links. The first link, called a level-0 link, connects to a switch that interconnects DCell 0. The second link, a level-1 link, connects to a node in the same DCell 1 but in a different DCell 0. Similarly, the level-i link connects to a different DCell i-1 within the same Dcell i ; Connect [pref, i, j-1] and [pref, j, i]

Properties of DCell Scalability: The number of servers scales doubly exponentially Where number of servers in a DCell 0 is 8 (n=8) and the number of server ports is 4 (i.e., k=3) -> N=27,630,792 Fault-tolerance: The bisection width is larger than Bisection width denotes the minimal number of links to be removed to partition a network into two parts of equal size.

Overview DCN motivation DCell Network Structure Routing in DCell Simulation Results Implementation and Experiments Review

DCellRouting Consider two nodes src and dst that are in the same DCell k but in two different DCell k-1 s. When computing the path from src to dst in a DCell k, we first calculate the intermediate link (n 1, n 2 ) that inter- connects the two DCell k-1 s. Routing is then divided into how to find the two sub- paths from src to n 1 and from n 2 to dst. The final path of DCellRouting is the combination of the two sub-paths and (n 1, n 2 ).

DCellRouting n1 src dst n2

Routing Properties Path length: The maximum path length in DCellRouting is at most Network diameter: The maximum path length using DCellRouting in a DCell k is at most But: 1.DCellRouting is NOT a shortest-path routing 2. is NOT a tight diameter bound for DCell Yet: 1.DCellRouting is close to shortest-path routing 2.DCellRouting is much simpler: O(k) steps to decide the next hop

Routing Properties nkNShortest-pathDCellRouting MeanMaxMeanMax , , ,263,

Traffic Distribution in DCellRouting All-to-all communication model: Consider an all-to-all communication model for DCell k where any two servers have one flow between them. The number of flows carried on a level-i link is less than t k 2 k-i when using DCellRouting. k is small, so the difference is not large

Traffic Distribution in DCellRouting One-to-Many and Many-to-One communication models: Given a node src, the other nodes in a DCell k can be classified into groups based on which link node src uses to reach them. The nodes reached by the level-i link of src belong to Group i. The number of nodes in Group i is When src communicates with m other nodes, it can pick a node from each of Group 0, Group 1, etc. The maximum aggregate bandwidth at src is min(m, k+1)

DCellBroadcast In DCellBroadcast, a sender delivers the broadcast packet to all its k+1 neighbors when broadcasting a packet in a DCell k. The receiver drops a duplicate packet but broadcasts a new packet to its other k links. Limit the broadcast scope by encoding a scope value k into each broadcast message. The message is broadcasted only within the DCell k network that contains the source node.

Fault-tolerant Routing DFR handles three types of failures: server failure, rack failure, and link failure. DFR uses three techniques of local reroute, local link-state, and jump-up to address link failure, server failure, and rack failure, respectively.

Link-state Routing Use link-state routing (with Dijkstra algorithm) for intra-DCell b routing and DCellRouting and local reroute for inter-DCell b routing. In a DCell b, each node uses DCellBroadcast to broadcast the status of all its (k+1) links periodically or when it detects link failure. A node thus knows the status of all the outgoing/incoming links in its DCell b.

Local-reroute and Proxy Let nodes src and dst be in the same DCell k. Compute a path from src to dst using DCellRouting. Assume an intermediate link (n 1, n 2 ) has failed. Local-reroute is performed at n 1 as: First calculates the level of (n 1, n 2 ), denoted by l. Then n 1 and n 2 are known to be in the same DCell l but in two different DCell l-1 s. Since there are g l DCell l-1 subnetworks inside this DCell l, it can always choose a DCell l-1. There must exist a link, denoted as (p 1, p 2 ), that connects this DCell l-1 and the one where n 1 resides. Local-reroute then chooses p 2 as its proxy and re-routes packets from n 1 to the selected proxy p 2.

Link-state Routing Using m 2 as an example: m 2 uses DCellRouting to calculate the route to the destination node dst. It then obtains the first link that reaches out its own DCell b (i.e., (n 1, n 2 )). m 2 then uses intra-DCell b routing, a local link- state based Dijkstra routing scheme, to decide how to reach n 1. Upon detecting that (n 1, n 2 ) is broken, m 2 invokes local-reroute to choose a proxy. It chooses a link (p 1, p 2 ) with the same level as (n 1, n 2 ) and sets p 2 as the proxy.

Jump-up for Rack Failure Upon receiving the rerouted packet (implying (n 1, n 2 ) has failed), p 2 checks whether (q 1, q 2 ) has failed or not. If (q 1, q 2 ) also fails, it is a good indication that the whole i 2 failed. p 2 then chooses a proxy from DCells with higher level (i.e., it jumps up). Therefore, with jump-up, the failed DCell i 2 can be bypassed. If dst is in i 2, packet should be dropped First, a retry count is added in the packet header. Second, each packet has a time-to-live (TTL) field, which is decreased by one at each intermediate node.

Local-reroute and Proxy p1p1 q2q2 i3i3 DCell b p2p2 p2p2 q1q1 Proxy src dst m1m1 m2m2 n2n2 n1n1 r1r1 DCell b i1i1 i2i2 L L Proxy L+1 s2s2 s2s2 s1s1 Servers in a same share local link-state 31

Overview DCN motivation DCell Network Structure Routing in DCell Simulation Results Implementation and Experiments Review

Simulations Compare DFR with the Shortest-Path Routing (SPF), which offers a performance bound. Two DCells: n=4, k=3 -> N=176,820 n=5, k=3 -> N=865,830 Node failures, DFR performs almost identical to SPF; moreover, DFR performs even better as n gets larger.

Simulations Rack failure The impact of rack failure on the path length is smaller than that of node failure.

Simulations Link failure The path failure ratio of DFR increases with the link failure ratio. DFR cannot achieve such performance since it is not globally optimal. When the failure ratio is small, DFR is still very close to SPF.

Overview DCN motivation DCell Network Structure Routing in DCell Simulation Results Implementation and Experiments Review

Protocol Suite Addressing: use 32-bit uid to identify a server. The most significant bit (bit-0) is used to identify the address type. 0 for server; 1 for multicast. Header: borrows heavily from IP Neighbor maintenance: two mechanisms Node transmits heart-beat messages over all out-bound links periodically Use link-layer medium sensing to detect neighbor states.

Layer-2.5 DCN Prototyping On Windows Server 2003 The DCN protocol suite is implemented as a kernel-mode driver, which offers a virtual Ethernet interface to the IP layer and manages several underlying physical Ethernet interfaces. Operations of routing and packet forwarding are handled by CPU. TCP/IP APP DCN (routing, forwarding, address mapping, ) Ethernet

Testbed Testbed of a DCell 1 with over 20 server nodes. This DCell 1 is composed of 5 DCell 0 s, each of which has 4 servers. Each server also installs an Intel PRO/1000 PT Quad Port Ethernet adapter. Intel® PRO/1000 PT Quad Port Server Adapter

Testbed The Ethernet switches used to form the DCell 0 s are D-Link 8-port Gigabit switches DGS-1008D, $50 each.

Experiment Fault-tolerance: Set up a TCP connection between servers [0,0] and [4,3] in the topology. Unplugged the link ([0,3], [4,0]) at time 34s. Routing path is changed to [0,0], [1,0], [1,3], [4,1], [4,3].

Experiment Throughput: target to MapReduce Scenario Each server established a TCP connection to each of the remaining 19 servers, and each TCP connection sent 5GB data. The transmission in DCell completed at 2270 seconds, but lasted for 4460 seconds in the tree structure. The 20 one-hop TCP connections using the level-1 link had the highest throughput and completed first at the time of 350s. Tree approach: The top-level switch is the bottleneck and soon gets congested.

Experiment

Review What is the physical structure of DCell? DCell properties: Scalability Fault-tolerence How DCell route data flows? How to handle different types of failures.