Approaches to Improve Data Center Performance through Networking - Gurubaran.

Slides:



Advertisements
Similar presentations
Data Center Networking with Multipath TCP
Advertisements

Improving Datacenter Performance and Robustness with Multipath TCP
Computer Networking A Top-Down Approach Chapter 4.7.
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric
COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.
BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang2,
Packet Switching COM1337/3501 Textbook: Computer Networks: A Systems Approach, L. Peterson, B. Davie, Morgan Kaufmann Chapter 3.
Multicast in Wireless Mesh Network Xuan (William) Zhang Xun Shi.
Multipath TCP Costin Raiciu University Politehnica of Bucharest Joint work with: Mark Handley, Damon Wischik, University College London Olivier Bonaventure,
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric. Presented by: Vinuthna Nalluri Shiva Srivastava.
Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya, and Amin Vahdat Department.
Data Center Fabrics. Forwarding Today Layer 3 approach: – Assign IP addresses to hosts hierarchically based on their directly connected switch. – Use.
1 Exploring Efficient and Scalable Multicast Routing in Future Data Center Networks Dan Li, Jiangwei Yu, Junbiao Yu, Jianping Wu Tsinghua University Presented.
Applying NOX to the Datacenter Arsalan Tavakoli, Martin Casado, Teemu Koponen, and Scott Shenker 10/22/2009Hot Topics in Networks Workshop 2009.
Improving Datacenter Performance and Robustness with Multipath TCP Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik,
Network Layer Routing Issues (I). Infrastructure vs. multi-hop Infrastructure networks: Infrastructure networks: ◦ One or several Access-Points (AP) connected.
Utilizing Datacenter Networks: Dealing with Flow Collisions Costin Raiciu Department of Computer Science University Politehnica of Bucharest.
Datacenter Network Topologies
CMPE 150- Introduction to Computer Networks 1 CMPE 150 Fall 2005 Lecture 22 Introduction to Computer Networks.
Slide Set 15: IP Multicast. In this set What is multicasting ? Issues related to IP Multicast Section 4.4.
Portland: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric Offense Kai Chen Shih-Chi Chen.
Chuanxiong Guo, Haitao Wu, Kun Tan,
CS335 Networking & Network Administration Tuesday, April 20, 2010.
PortLand Presented by Muhammad Sadeeq and Ling Su.
MULTICASTING Network Security.
A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean.
CS401 presentation1 Effective Replica Allocation in Ad Hoc Networks for Improving Data Accessibility Takahiro Hara Presented by Mingsheng Peng (Proc. IEEE.
A Scalable, Commodity Data Center Network Architecture
A Scalable, Commodity Data Center Network Architecture.
1 25\10\2010 Unit-V Connecting LANs Unit – 5 Connecting DevicesConnecting Devices Backbone NetworksBackbone Networks Virtual LANsVirtual LANs.
1 Algorithms for Bandwidth Efficient Multicast Routing in Multi-channel Multi-radio Wireless Mesh Networks Hoang Lan Nguyen and Uyen Trang Nguyen Presenter:
BUDAPEST UNIVERSITY OF TECHNOLOGY AND ECONOMICS DEPARTMENT OF TELECOMMUNICATIONS AND MEDIA INFORMATICS BUDAPEST UNIVERSITY OF TECHNOLOGY AND ECONOMICS.
ElasticTree: Saving Energy in Data Center Networks 許倫愷 2013/5/28.
Multipath TCP design, and application to data centers Damon Wischik, Mark Handley, Costin Raiciu, Christopher Pluntke.
VL2 – A Scalable & Flexible Data Center Network Authors: Greenberg et al Presenter: Syed M Irteza – LUMS CS678: 2 April 2013.
Network Aware Resource Allocation in Distributed Clouds.
Routing & Architecture
ATM SWITCHING. SWITCHING A Switch is a network element that transfer packet from Input port to output port. A Switch is a network element that transfer.
DARD: Distributed Adaptive Routing for Datacenter Networks Xin Wu, Xiaowei Yang.
IEEE Globecom 2010 Tan Le Yong Liu Department of Electrical and Computer Engineering Polytechnic Institute of NYU Opportunistic Overlay Multicast in Wireless.
SOS: Security Overlay Service Angelos D. Keromytis, Vishal Misra, Daniel Rubenstein- Columbia University ACM SIGCOMM 2002 CONFERENCE, PITTSBURGH PA, AUG.
Chapter 22 Network Layer: Delivery, Forwarding, and Routing Part 5 Multicasting protocol.
Congestion control for Multipath TCP (MPTCP) Damon Wischik Costin Raiciu Adam Greenhalgh Mark Handley THE ROYAL SOCIETY.
Load-Balancing Routing in Multichannel Hybrid Wireless Networks With Single Network Interface So, J.; Vaidya, N. H.; Vehicular Technology, IEEE Transactions.
VL2: A Scalable and Flexible Data Center Network Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David.
Dual Centric Data Center Network Architectures DAWEI LI, JIE WU (TEMPLE UNIVERSITY) ZHIYONG LIU, AND FA ZHANG (CHINESE ACADEMY OF SCIENCES) ICPP 2015.
High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.
Authors: Xiaoqiao Meng, Vasileio Pappas and Li Zhang
Tufts Wireless Laboratory School Of Engineering Tufts University Paper Review “An Energy Efficient Multipath Routing Protocol for Wireless Sensor Networks”,
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Optimization Problems in Wireless Coding Networks Alex Sprintson Computer Engineering Group Department of Electrical and Computer Engineering.
Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,
1 Traffic Engineering By Kavitha Ganapa. 2 Introduction Traffic engineering is concerned with the issue of performance evaluation and optimization of.
1 Three ways to (ab)use Multipath Congestion Control Costin Raiciu University Politehnica of Bucharest.
MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,
VL2: A Scalable and Flexible Data Center Network
CIS 700-5: The Design and Implementation of Cloud Networks
Data Center Network Architectures
Advanced Computer Networks
ECE 544: Traffic engineering (supplement)
Improving Datacenter Performance and Robustness with Multipath TCP
What Are Routers? Routers are an intermediate system at the network layer that is used to connect networks together based on a common network layer protocol.
Chapter 4 Data Link Layer Switching
Improving Datacenter Performance and Robustness with Multipath TCP
NTHU CS5421 Cloud Computing
Multipath TCP Yifan Peng Oct 11, 2012
NTHU CS5421 Cloud Computing
VL2: A Scalable and Flexible Data Center Network
Data Center Architectures
2019/5/13 A Weighted ECMP Load Balancing Scheme for Data Centers Using P4 Switches Presenter:Hung-Yen Wang Authors:Peng Wang, George Trimponias, Hong Xu,
Presentation transcript:

Approaches to Improve Data Center Performance through Networking - Gurubaran

Outline Data Center Architectures Common Issues Improving performance via Multipath TCP Improving performance through Multicast Routing Conclusion

DataCenter Architectures Main challenge is how to build a scalable DCN that delivers significant aggregate bandwidth Example: BCube, DCell, PortLand, VL2, Helois, and c-Through

Server Centric DataCeters Servers act not only as end hosts but also as relay nodes for multihop communications Ex: Bcube A BCube0 is simply n servers connecting to an n-port switch. A BCube1 is constructed from n BCube0s and n n-port switches

Server Centric DataCeters In BCube, two servers are neighbors 1) if they connect to the same switch. 2) If and only if their address arrays differ in one digit BCube build its routing path by “correcting” one digit at one hop from the source to the destination

Server Centric DataCeters: BCube

Switch Centric DataCenters In switch-centric DCNs, switches are the only relay nodes. PortLand and VL2 belong to this category. Generally, they use a special instance of a Clos topology called Fattree to interconnect commodity Ethernet switches

Switch Centric DataCenters PortLand includes core, aggregate, and edge switches. Pseudo MAC (PMAC) addresses encodes the location of the host 48bit:pod.position.port.vmid Pod (16 bit): pod number of the edge switch Position (8 bit): position in the pod Port (8 bit): the port number it connects to Vmid (16 bit): VM id of the host

Switch Centric DataCenters PortLand switches forward a packet based on its destination PMAC address.

Common Issues The top-level switches are the bandwidth bottleneck, and high-end high-speed switches have to be used. Moreover, a high-level switch shows as a singlepoint failure spot for its subtree branch. Using redundant switches does not fundamentally solve the problem but incurs even higher cost.

Improving performance via Multipath TCP

Improving Performance Via Multipath TCP Datacenter apps are distributed across thousands of machines Want any machine to play any role To achieve this: Use dense parallel datacenter topologies Map each flow to a path Problem: Naive random allocation gives poor performance Improving performance adds complexity

The Two Key Questions MPTCP can greatly improve performance in today’s data centers. Under which circumstances does it do so, how big are the benefits, and on what do they depend? If MPTCP were deployed, how the data centers can be designed differently in the future to take advantage of its capabilities?

Main Components Data Center networking architecture Physical topology Routing over the topology Selection between multiple paths supplied by routing Congestion control of traffic on the selected paths

Topology Denseness of interconnection they provide poses its own problems Determine how traffic should be routed

Fat Tree Topology [Fares et al., 2008; Clos, 1953] K=4 1Gbps Aggregation Switches K Pods with K Switches each Racks of servers

Fat Tree Topology [Fares et al., 2008; Clos, 1953] K=4 Aggregation Switches K Pods with K Switches each Racks of servers

Collisions

Single-path TCP collisions reduce throughput

Routing Dense interconnection topologies provide many possible parallel paths between each pair of hosts Routing system must spread traffic across these paths Simplest solution - use randomized load balancing If each switch uses a link-state routing protocol to provide ECMP forwarding then, based on a hash of the five-tuple in each packet, flows will be split roughly equally across equal length paths

Path Selection ECMP or multiple VLANs provide the basis for randomized load balancing as the default path selection mechanism Randomized load balancing cannot achieve the full bisectional bandwidth and is not fair. Allows hot spot to develop To address these issues, the use of a centralized flow scheduler has been proposed Scheduler running every 500ms has similar performance to randomized load balancing when these assumptions do not hold.

Collision

Not fair

No matter how you do it, mapping each flow to a path is the wrong goal

Instead, pool capacity from different links

Multipath Transport

Multipath Transport can pool datacenter networks – Instead of using one path for each flow, use many random paths – Don’t worry about collisions. – Just don’t send (much) traffic on colliding paths

MPTCP is a drop in replacement for TCP MPTCP spreads application data over multiple subflows Multipath TCP Primer [IETF MPTCP WG]

Congestion Control MPTCP can establish multiple subflows on different paths between the same pair of endpoints for a single TCP connection. “By linking the congestion control dynamics on these multiple subflows, MPTCP can explicitly move traffic off more congested paths and place it on less congested ones”

Congestion Control Given sufficiently many randomly chosen paths, MPTCP will find at least one good unloaded path, and move most of its traffic that way. Relieve congestion on links that got more than their fair share of ECMP balanced flows. Allow the competing flows to achieve their full potential, maximizing the bisectional bandwidth of the network and also improving fairness

Congestion Control Each MPTCP subflow has its own sequence space Each subflow maintains its own congestion window For each ACK on subflow r, increase the window wr by min(a/wtotal, 1/wr). For each loss on subflow r, decrease the window wr by wr/2.

Multipath TCP: Congestion Control [NSDI, 2011]

MPTCP better utilizes the FatTree network

MPTCP on EC2 Amazon EC2: infrastructure as a service – We can borrow virtual machines by the hour – These run in Amazon data centers worldwide – We can boot our own kernel A few availability zones have multipath topologies – 2-8 paths available between hosts not on the same machine or in the same rack – Available via ECMP

Amazon EC2 Experiment 40 medium CPU instances running MPTCP For 12 hours, they sequentially ran all-to-all iperf cycling through: – TCP – MPTCP (2 and 4 subflows)

MPTCP improves performance on EC2 Same Rack

Analysis Examine how MPTCP performs in a range of topologies and with a varying number of subflows

What do the benefits depend on? How many subflows are needed? How does the topology affect results? How does the traffic matrix affect results?

At most 8 subflows are needed Total Throughput TCP

MPTCP improves fairness in VL2 topologies VL2 Fairness is important: Jobs finish when the slowest worker finishes Fairness is important: Jobs finish when the slowest worker finishes

MPTCP improves throughput and fairness in BCube BCube

Oversubscribed Topologies To saturate full bisectional bandwidth:  There must be no traffic locality  All hosts must send at the same time  Host links must not be bottlenecks It makes sense to under-provision the network core  This is what happens in practice  Does MPTCP still provide benefits?

Performance improvements depend on traffic matrix OverloadedUnderloaded Sweet Spot Increase Load

What is an optimal datacenter topology for multipath transport?

In single homed topologies:  Hosts links are often bottlenecks  ToR switch failures wipe out tens of hosts for days Multi-homing servers is the obvious way forward

Fat Tree Topology

ToR Switch Servers Upper Pod Switch

Dual Homed Fat Tree Topology ToR Switch Servers Upper Pod Switch

Is DHFT any better than Fat Tree? Not for traffic matrices that fully utilize the core Let’s examine random traffic patterns – Other TMs in the paper

Core Overloaded Core Underloaded DHFT provides significant improvements when core is not overloaded

Improving performance through Multicast Routing

Introduction Multicast benefits group communications in saving network traffic and improving application throughput The technical trend of future data center design poses new challenges for efficient and scalable Multicast routing.

Challenges The densely connected networks make traditional receiver-driven Multicast routing protocols inefficient in Multicast tree formation It is difficult for the low-end switches (used in data centers) to hold the routing entries of massive Multicast groups.

Approach Use a source-to-receiver expansion approach to build efficient Multicast trees, excluding many unnecessary intermediate switches used in receiver-driven Multicast For scalable Multicast routing, combine both in-packet Bloom Filters and in-switch entries to make the tradeoff between the number of Multicast groups supported and the additional bandwidth overhead

General Multicast

Multicast Trees

Multicast Tree Formation Building a Multicast tree with the lowest cost covering a given set of nodes on a general graph is well-known as the Steiner Tree problem. The problem is NP-Hard and there are many approximate solutions

Multicast Tree Formation in Data Centers For data center Multicast, Bcube proposes an algorithm to build server-based Multicast trees, and switches are used only as dummy crossbars. But obviously network-level Multicast with switches involved can save much more bandwidth than the server-based one. In VL2, traditional IP Multicast protocols are used for tree building.

Scalable Multicast Routing For data center networks where low-end switches with limited routing space are used, the problem of scalable Multicast Routing is very challenging. One possible solution is to aggregate a number of Multicast routing entries into a single one, as used in Unicast. Bloom Filter can be used to compress in- switch Multicast routing entries

Scalable Multicast Routing Encode the tree information into in-packet Bloom Filter, and thus there is no need to install any Multicast routing entries in network equipment. But the in-packet Bloom Filter field brings network bandwidth cost. In this paper, they achieve scalable Multicast routing by making the tradeoff between the number of Multicast groups supported and the additional bandwidth overhead

Bloom Filters A Bloom filter is a probabilistic data structure designed to tell, rapidly and memory- efficiently, whether an element is present in a set. Tells that if the element is either definitely is not in the set or may be in the set.

Efficient Multicast Tree Building: The Problem Densely connected data center networks implies a large number of tree candidates for a group. Given multiple equal-cost paths between servers/switches - undesirable to run traditional receiver-driven Multicast routing protocols such as PIM for tree building. Why? Because independent path selection by receivers can result in many unnecessary intermediate links.

Bcube Example

Assume the receiver set is {v5,v6, v9, v10} and the sender is v0. Using receiver-driven Multicast routing, the resultant Multicast tree can have 14 links as follows (Representing the tree as the paths from the sender to each receiver): v0 -> w0 -> v1 -> w5 -> v5 v0 -> w4 -> v4 -> w1 -> v6 v0 -> w4 -> v8 -> w2 -> v9 v0 -> w0 -> v2 -> w6 -> v10

Bcube Example However, an efficient Multicast tree for this case includes only 9 links if we construct in the following way: v0 -> w0 -> v1 -> w5 -> v5 v0 -> w0 -> v2 -> w6 -> v6 v0 -> w0 -> v1 -> w5 -> v9 v0 -> w0 -> v2 -> w6 -> v10

The Approach: Continued Receivers send join/leave requests to a Multicast Manager. The Multicast Manager then calculates the Multicast tree based on the data center topology and group membership distribution. Data center topologies are regular graphs and the Multicast Manager can easily maintain the topology information (with failure management). The problem is then translated as how to calculate an efficient Multicast tree on the Multicast Manager.

Source Driven Tree Building Recently proposed data center architectures (BCube, PortLand, VL2) use several levels of switches for server interconnection, and switches within the same level are not directly connected. Hence, they are multistage graphs. Group Spanning graph: The possible paths from the Multicast source to all receivers can be expanded as a directed multistage graph with d +1 stages. For example: The sender is v0 and the receiver set is {v1,v5,v9,v10,v11,v12,v14}

Group Spanning Graph: BCube

Source Driven Tree Building A covers B: For any two node sets A and B in a group spanning graph, A covers B if and only if for each node j € B, there exists a directed path from a node i € A to j in the group spanning graph. A strictly covers B: If A covers B and any subset of A does not cover B, then A strictly covers B

Source Driven Tree Building They propose to build Multicast tree in a source-to- receiver expansion way upon the group spanning graph, with the tree node set from each stage strictly covering downstream receivers. The merits are two fold. 1) Many unnecessary intermediate switches used in receiver-driven Multicast routing are eliminated. 2) The source-to-receiver latency is bounded by the number of stages of the group spanning graph, e.g., the diameter of data center topology, which favors delay-sensitive applications such as redirecting search queries to indexing servers.

Source Driven Tree Building: BCube BCube: The tree node selection in a BCube network can be conducted in an iterative way on the group spanning graph. For a BCube(n,k) with the sender s, first select the set of servers from stage 2 of the group spanning graph, which are covered by both s and a single switch in stage 1. Assume the server set in stage 2 is E, and the switch selected in stage 1 is W. The tree node set for BCube(n,k) is the union of the tree node sets for |E|+ 1 BCube(n,k - 1)s. |E|of the BCube(n,k - 1)s has a server in E as the source p, and the receivers in stage 2 * (k +1) covered by p as the receiver set.

Source Driven Tree Building: BCube The other BCube(n,k - 1) has s as the source and the receivers in stage 2 * k which are covered by s while not covered by W as the receiver set. In the same way, get the tree node set in each BCube(n,k ¡ 1) by dividing it into several BCube(n,k ¡ 2)s. The process iterates until when getting all the BCube(n,0)s. Hence, the computation complexity is O(N), where N is the total number of servers in BCube.

Dynamic Receiver Join The dynamical receiver join/leave does not change the source-to-end paths of other receivers in the group When a new receiver rj joins an existing group in a BCube(n,k ), first recompute the group spanning graph involving rj. Then in the group spanning graph, check whether there is a BCube(n,0) that can hold rj when calculating the previous Multicast tree. If so, add rj to the BCube(n,0). Otherwise, try to find the BCube(n,1) when calculating the previous Multicast tree that can hold rj and add a BCube(n,0) to it containing rj. If not able find such a BCube(n,1), try to find a BCube(n,2) and add a corresponding BCube(n,1), so on and so forth until rj is successfully added in the Multicast tree. In this way, the final tree obeys the proposed tree-building algorithm, and there is no need to change the source-to-end paths for existing receivers in the Multicast group.

Dynamic Receiver Leave Given a receiver rl leaves the group in a BCube(n,k ), regenerate the group spanning graph by eliminating rl. Then, if the deletion of rl results in zero BCube(n,m-1 )s in a BCube(n,m ) from the group spanning graph when calculating the previous Multicast tree, eliminate the nodes in the BCube(n,m ). Of course this process does not change the source-to-end paths for other receivers, either

Results: Number of Links

Computation Time

Bandwidth Overhead Ratio Ratio of additional traffic caused by in-packet Bloom Filter over the actual payload traffic to carry. Assume the packet length (including the Bloom Filter field) is p, the length of the in-packet Bloom Filter field is f, the number of links in the Multicast tree is t, and the number of actual links covered by Bloom Filter based forwarding is c, then

Bandwidth Overhead Ratio To reduce the bandwidth overhead of in-packet Bloom Filter, either control the false positive ratio during packet forwarding, or limit the size of Bloom Filter field. Inferences: When the Bloom Filter length is shorter than the optimal value, false-positive forwarding is the major factor for bandwidth overhead ratio. But when the length grows larger than the optimal value, the Bloom Filter field itself dominates the bandwidth overhead

Bandwidth Overhead Ratio

Summary “One flow, one path” thinking has constrained datacenter design – Collisions, unfairness, limited utilization Multipath transport enables resource pooling in datacenter networks: – Improves throughput – Improves fairness – Improves robustness “One flow, many paths” frees designers to consider topologies that offer improved performance for similar cost. Receiver-driven Multicast routing does not perform well in densely connected data center networks. For scalable Multicast routing on low-end data center switches, combine both in-packet Bloom Filters and inswitch entries. This can save 40% - 50% of network traffic.