Data Center Networking

Data Center Networking
Dan LI CS Department, Tsinghua University 2018/11/21

Outline Data Center Introduction Data Center Network Architectures
Fat-Tree VL2 DCell BCube FiConn 2018/11/21

Data Centers Cloud Computing Data Centers 2018/11/21
Collaboration Empowered User SLA Metrics Global Availability Reg. Compliance Cloud Computing Data Centers While cloud computing is considered as a logical concept, it is in fact constructed by data centers. A data center is a facility used to house computer systems and associated components, such as telecommunications and storage systems. It generally includes redundant or backup power supplies, redundant data communications connections, environmental controls (e.g., air conditioning, fire suppression) and security devices. Power & Cooling Asset Utilization Provisioning Security Threats Bus. Continuance 2018/11/21

History of Data Centers
Data centers have their roots in the huge computer rooms of the early ages of the computing industry. In 1980s, companies grew aware of the need to control IT resources. Data centers started to gain popular recognition in 1990s. Early computer systems were complex to operate and maintain, and required a special environment in which to operate. Many cables were necessary to connect all the components, and methods to accommodate and organize these were devised, such as standard racks to mount equipment, elevated floors, and cable trays (installed overhead or under the elevated floor). Also, old computers required a great deal of power, and had to be cooled to avoid overheating. Security was important – computers were expensive, and were often used for military purposes. Basic design guidelines for controlling access to the computer room were therefore devised. During the boom of the microcomputer industry, and especially during the 1980s, computers started to be deployed everywhere, in many cases with little or no care about operating requirements. However, as information technology (IT) operations started to grow in complexity, companies grew aware of the need to control IT resources. With the advent of client-server computing, during the 1990s, microcomputers (now called "servers") started to find their places in the old computer rooms. The availability of inexpensive networking equipment, coupled with new standards for network cabling, made it possible to use a hierarchical design that put the servers in a specific room inside the company. The use of the term "data center," as applied to specially designed computer rooms, started to gain popular recognition about this time, 2018/11/21

History of Data Centers (Cont.)
The boom of data centers came during the dot-com bubble. Since then, many companies started building very large facilities, called Internet data centers (IDCs), which provide businesses with a range of solutions for systems deployment and operation. As of 2007, data center design, construction, and operation is a well-known discipline. The boom of data centers came during the dot-com bubble. Companies needed fast Internet connectivity and nonstop operation to deploy systems and establish a presence on the Internet. Installing such equipment was not viable for many smaller companies. Many companies started building very large facilities, called Internet data centers (IDCs), which provide businesses with a range of solutions for systems deployment and operation. New technologies and practices were designed to handle the scale and the operational requirements of such large-scale operations. These practices eventually migrated toward the private data centers, and were adopted largely because of their practical results. As of 2007, data center design, construction, and operation is a well-known discipline. Standard documents from accredited professional groups, specify the requirements for data center design. Well-known operational metrics for data center availability can be used to evaluate the business impact of a disruption. There is still a lot of development being done in operation practice, and also in environmentally-friendly data center design. Data centers are typically very expensive to build and maintain. For instance, Amazon.com's new 116,000 sq ft (10,800 m2) data center in Oregon is expected to cost up to $100 million. 2018/11/21

Requirements for modern data centers
A data center must keep high standards for assuring the integrity and functionality of its hosted computer environment. Effective data center operation requires a balanced investment in both the facility and the housed equipment. Information security is also a concern, and for this reason a data center has to offer a secure environment A data center must keep high standards for assuring the integrity and functionality of its hosted computer environment. If a system becomes unavailable, company operations may be impaired or stopped completely. It is necessary to provide a reliable infrastructure for IT operations, in order to minimize any chance of disruption. Effective data center operation requires a balanced investment in both the facility and the housed equipment. The first step is to establish a baseline facility environment suitable for equipment installation. Standardization and modularity can yield savings and efficiencies in the design and construction of telecommunications data centers. Information security is also a concern, and for this reason a data center has to offer a secure environment which minimizes the chances of a security breach. A data center must therefore keep high standards for assuring the integrity and functionality of its hosted computer environment. This is accomplished through redundancy of both fiber optic cables and power, which includes emergency backup power generation. 2018/11/21

Data center classification
Tier Level Requirements 1 Single non-redundant distribution path serving the IT equipment Non-redundant capacity components Basic site infrastructure guaranteeing % availability 2 Fulfils all Tier 1 requirements Redundant site infrastructure capacity components guaranteeing % availability 3 Fulfils all Tier 1 & Tier 2 requirements Multiple independent distribution paths serving the IT equipment All IT equipment must be dual-powered and fully compatible with the topology of a site's architecture Concurrently maintainable site infrastructure guaranteeing % availability 4 Fulfils all Tier 1, Tier 2 and Tier 3 requirements All cooling equipment is independently dual-powered, including chillers and Heating, Ventilating and Air Conditioning (HVAC) systems Fault tolerant site infrastructure with electrical power storage and distribution facilities guaranteeing % availability The TIA-942:Data Center Standards Overview describes the requirements for the data center infrastructure. The simplest is a Tier 1 data center, which is basically a server room, following basic guidelines for the installation of computer systems. The most stringent level is a Tier 4 data center, which is designed to host mission critical computer systems, with fully redundant subsystems and compartmentalized security zones controlled by biometric access controls methods. Another consideration is the placement of the data center in a subterranean context, for data security as well as environmental considerations such as cooling requirements. The four levels are defined, and copyrighted, by the Uptime Institute, a Santa Fe, New Mexico-based think tank and professional services organization. The levels describe the availability of data from the hardware at a location. The higher the tier, the greater the accessibility. The levels are: 2018/11/21

Physical layout A data center can occupy one room of a building, one or more floors, or an entire building. A data center can occupy one room of a building, one or more floors, or an entire building. Most of the equipment is often in the form of servers mounted in 19 inch rack cabinets, which are usually placed in single rows forming corridors between them. This allows people access to the front and rear of each cabinet. Servers differ greatly in size from 1U servers to large freestanding storage silos which occupy many tiles on the floor. Some equipment such as mainframe computers and storage devices are often as big as the racks themselves, and are placed alongside them. Very large data centers may use shipping containers packed with 1,000 or more servers each; when repairs or upgrades are needed, whole containers are replaced (rather than repairing individual servers). 2018/11/21

Physical layout (Cont.)
The physical environment of a data center is rigorously controlled Air conditioning is used to control the temperature and humidity in the data center. Modern data centers try to use economizer cooling, where they use outside air to keep the data center cool. A temperature range of 16–24 °C (61–75 °F) and humidity range of 40–55% with a maximum dew point of 15°C as optimal for data center conditions. The temperature in a data center will naturally rise because the electrical power used heats the air. Unless the heat is removed, the ambient temperature will rise, resulting in electronic equipment malfunction. By controlling the air temperature, the server components at the board level are kept within the manufacturer's specified temperature/humidity range. Air conditioning systems help control humidity by cooling the return space air below the dew point Modern data centers try to use economizer cooling, where they use outside air to keep the data center cool. Washington state now has a few data centers that cool all of the servers using outside air 11 months out of the year. They do not use chillers/air conditioners. 2018/11/21

Physical layout (Cont.)
Backup power consists of one or more uninterruptible power supplies diesel generators . To prevent single points of failure, all elements of the electrical systems, are typically fully duplicated. Data centers typically have raised flooring made up of 60 cm (2 ft) removable square tiles. Backup power consists of one or more uninterruptible power supplies and/or diesel generators. To prevent single points of failure, all elements of the electrical systems, including backup systems, are typically fully duplicated, and critical servers are connected to both the "A-side" and "B-side" power feeds. This arrangement is often made to achieve N+1 Redundancy in the systems. Static switches are sometimes used to ensure instantaneous switchover from one supply to the other in the event of a power failure. Data centers typically have raised flooring made up of 60 cm (2 ft) removable square tiles. The trend is towards 80–100 cm (31–39 in) void to cater for better and uniform air distribution. These provide a plenum for air to circulate below the floor, as part of the air conditioning system, as well as providing space for power cabling. 2018/11/21

Systems & Power Density
Estimating datacenter power density difficult (15+ year horizon) Power is 40% of DC costs Power + Mechanical: 55% of cost Shell is roughly 15% of DC cost Cheaper to waste floor than power Typically 100 to 200 W/sq ft Rarely as high as 350 to 600 W/sq ft Modular DC eliminates impossible shell to power trade-off Add modules until power is absorbed 480VAC to container High efficiency DC distribution within High voltage to rack can save >5% over 208VAC Over 20% of entire DC costs is in power redundancy Batteries able to supply up to 12 min at some facilities N+2 generation at over $2M each Instead, use more smaller, cheaper data centers Eliminate redundant power & bulk of shell costs Resource equalization 2018/11/21 1/21/2007 11

Data Center Networking
Major Theme: What are new networking issues posed by large-scale data centers? Network Architecture? Topology design? Addressing? Routing? Forwarding?

Data Center Interconnection Structure
Nodes in the system: racks of servers How are the nodes (racks) inter-connected? Typically a hierarchical inter-connection structure Today’s typical data center structure Cisco recommended data center structure: starting from the bottom level rack switches 1-2 layers of (layer-2) aggregation switches access routers core routers Is such an architecture good enough?

Cisco Recommended DC Structure: Illustration
Internet CR AR … S LB Data Center Layer 3 A Layer 2 Key: CR = L3 Core Router AR = L3 Access Router S = L2 Switch LB = Load Balancer A = Rack of 20 servers with Top of Rack switch

Data Center Design Requirements
Data centers typically run two types of applications outward facing (e.g., serving web pages to users) internal computations (e.g., MapReduce for web indexing) Workloads often unpredictable: Multiple services run concurrently within a DC Demand for new services may spike unexpected Spike of demands for new services mean success! But this is when success spells trouble (if not prepared)! Failures of servers are the norm Recall that GFS, MapReduce, etc., resort to dynamic re-assignment of chunkservers, jobs/tasks (worker servers) to deal with failures; data is often replicated across racks, … “Traffic matrix” between servers are constantly changing

Data Center Costs Amortized Cost* Component Sub-Components ~45% Servers CPU, memory, disk ~25% Power infrastructure UPS, cooling, power distribution ~15% Power draw Electrical utility costs Network Switches, links, transit *3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money Total cost varies upwards of $1/4 B for mega data center server costs dominate network costs significant Long provisioning timescales: new servers purchased quarterly at best

Overall Data Center Design Goal
Agility – Any service, Any Server Turn the servers into a single large fungible pool Let services “breathe” : dynamically expand and contract their footprint as needed We already see how this is done in terms of Google’s GFS, BigTable, MapReduce Benefits Increase service developer productivity Lower cost Achieve high performance and reliability These are the three motivators for most data center infrastructure projects!

Achieving Agility … Workload Management Storage Management
means for rapidly installing a service’s code on a server dynamical cluster scheduling and server assignment  E.g., MapReduce, Bigtable, … virtual machines, disk images  Storage Management means for a server to access persistent data distributed file systems (e.g., GFS)  Network Management Means for communicating with other servers, regardless of where they are in the data center Achieve high performance and reliability

Networking Objectives
Uniform high capacity Capacity between servers limited only by their NICs No need to consider topology when adding servers => In other words, high capacity between two any servers no matter which racks they are located ! Performance isolation Traffic of one service should be unaffected by others Ease of management: “Plug-&-Play” (layer-2 semantics) Flat addressing, so any server can have any IP address Server configuration is the same as in a LAN Legacy applications depending on broadcast must work

Is Today’s DC Architecture Adequate?
Hierarchical network; 1+1 redundancy Equipment higher in the hierarchy handles more traffic more expensive, more efforts made at availability  scale-up design Servers connect via 1 Gbps UTP to Top-of-Rack switches Other links are mix of 1G, 10G; fiber, copper Uniform high capacity? Performance isolation? typically via VLANs Agility in terms of dynamically adding or shrinking servers? Agility in terms of adapting to failures, and to traffic dynamics? Ease of management? Internet CR AR … S LB Data Center Layer 3 A Layer 2 Key: CR = L3 Core Router AR = L3 Access Router S = L2 Switch LB = Load Balancer A = Top of Rack switch

Modern Data Center Network Architectures
Switch-centric Servers use only 1 port for connection Interconnection and routing intelligence put into switches Fat-Tree, PortLand, VL2 Server-centric Servers use multiple ports Participate in packet forwarding DCell, FiConn, BCube, MDCube, CamCube 11/21/2018 Tsinghua University

A Scalable, Commodity Data Center Network Architecture
Main Goal: addressing the limitations of today’s data center network architecture single point of failure Over-subscription of links higher up in the topology Key Design Considerations/Goals Allows host communication at line speed no matter where they are located! Backwards compatible with existing infrastructure no changes in application & support of layer 2 (Ethernet) Cost effective cheap infrastructure and low power consumption & heat emission

Fat-Tree Based DC Architecture
Inter-connect racks (of servers) using a fat-tree topology Fat-Tree: a special type of Clos Networks (after C. Clos) K-ary fat tree: three-layer topology (edge, aggregation and core) each pod consists of (k/2)2 servers & 2 layers of k/2 k-port switches each edge switch connects to k/2 servers & k/2 aggr. switches each aggr. switch connects to k/2 edge & k/2 core switches (k/2)2 core switches: each connects to k pods Fat-tree with K=2

Fat-Tree Based Topology (Cont.)
Why Fat-Tree? Fat tree has identical bandwidth at any bisections Each layer has the same aggregated bandwidth Can be built using cheap devices with uniform capacity Each port supports same speed as end host All devices can transmit at line speed if packets are distributed uniform along available paths Great scalability k-port switch supports k3/4 servers

Cost of Maintaining Switches
Netgear ~ 3K Procurve – 4.5K

Fat-tree Topology is Great, But …
Does using fat-tree topology to inter-connect racks of servers in itself sufficient? What routing protocols should we run on these switches? Layer 2 switch algorithm: data plane flooding! Layer 3 IP routing: shortest path IP routing will typically use only one path despite the path diversity in the topology if using equal-cost multi-path routing at each switch independently and blindly, packet re-ordering may occur; further load may not necessarily be well-balanced Aside: control plane flooding!

FAT-Tree Modified Enforce a special (IP) addressing scheme in DC
unused.PodNumber.switchnumber.Endhost Allows host attached to same switch to route only through switch Allows intra-pod traffic to stay within pod Use two level look-ups to distribute traffic and maintain packet ordering First level is prefix lookup used to route down the topology to servers Second level is a suffix lookup used to route up towards core maintain packet ordering by using same ports for same server

Diffusion Optimizations
Flow classification Eliminates local congestion Assign traffic to ports on a per-flow basis instead of a per-host basis Flow scheduling Eliminates global congestion Prevent long lived flows from sharing the same links Assign long lived flows to different links

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric
PortLand is a single “logical layer 2” data center network fabric that scales to millions of endpoints PortLand internally separates host identity from host location Uses IP address as host identifier introduces “Pseudo MAC” (PMAC) addresses internally to encode endpoint location PortLand runs on commodity switch hardware with unmodified hosts

Design Goals for Network Fabric
Support for Agility! Easy configuration and management: plug-&-play Fault tolerance, routing and addressing: scalability Commodity switch hardware: small switch state Virtualization support: seamless VM migration What are the limitations of current layer-2 and layer-3? layer-2 (Ethernet w/ flat-addressing) vs. layer-3 (IP w/ prefix-based addressing): plug-&-play? scalability? small switch state? seamless VM migration?

PortLand Solution Assuming: a Fat-tree network topology for DC
Introduce “pseudo MAC addresses” to balance the pros and cons of flat- vs. topology-dependent addressing PMACs are “topology-dependent,” hierarchical addresses But used only as “host locators,” not “host identities” IP addresses used as “host identities” (for compatibility w/ apps) Pros: small switch state & Seamless VM migration Pros: “eliminate” flooding in both data & control planes But requires a IP-to-PMAC mapping and name resolution a location directory service And location discovery protocol & fabric manager for support of “plug-&-play”

PMAC Addressing Scheme
PMAC (48 bits): pod.position.port.vmid Pod: 16 bits; position and port (8 bits); vmid: 16 bits Assign only to servers (end-hosts) – by switches

Location Discovery Protocol
Location Discovery Messages (LDMs) exchanged between neighboring switches Switches self-discover location on boot up Location Characteristics Technique Tree-level (edge, aggr. , core) auto-discovery via neighbor connectivity Position # aggregation switch help edge switches decide Pod # request (by pos. 0 switch only) to fabric manager

PortLand: Name Resolution
Edge switch listens to end hosts, and discover new source MACs Installs <IP, PMAC> mappings, and informs fabric manager

PortLand: Name Resolution (Cont.)
Edge switch intercepts ARP messages from end hosts Send request to fabric manager, which replies with PMAC

PortLand: Fabric Manager
fabric manager: logically centralized, multi-homed server maintains topology and <IP,PMAC> mappings in “soft state”

Loop-free Forwarding and Fault-Tolerant Routing
Switches build forwarding tables based on their position edge, aggregation and core switches Use strict “up-down semantics” to ensure loop-free forwarding Load-balancing: use any ECMP path via flow hashing to ensure packet ordering Fault-tolerant routing: Mostly concerned with detecting failures Fabric manager maintains logical fault matrix with per-link connectivity info; inform affected switches Affected switches re-compute forwarding tables

VL2: Motivation The network is a bottleneck to Data Center computation. Today’s data center network has several issues: Tree architecture Congestion and computation hot spot Traffic Independence IP configuration Complexity Migration Tradeoff Reliability and utilization

Conventional Data Center Architecture
packet flooding and ARP broadcasts

Design of VL2 Location-specific IP address(LAs)(public)
For all switches and interfaces or external server Application-specific IP addresses (AAs)(private) For application servers VL2 Directory System Stores the mapping of AAs to LAs Access control

Scale-out Topologies Aggr : Int = n:m ToR : Aggr = 1:2

Scale-out Topologies Benefit
Risk balancing The failure of a Int. reduces the bisection bandwidth by only 1/m Routing is extremely simple on this topology Random path up and random path down.

Randomly select Int. by ECMP
Example Randomly select Int. by ECMP Create by VL2 Agent

VL2 agent VL2 agent’s work flow
Intercepts the ARP request for the destination AA Converts ① to an unicast query to the VL2 directory system Intercepts packets from the host Encapsulates the packet with the LA address from ②. Caches the mapping from AA to LA addresses

VL2 Addressing and Routing
Switches run link-state routing and maintain only switch-level topology Directory Service LAs … x  ToR2 y  ToR3 z  ToR3 … x  ToR2 y  ToR3 z  ToR4 ToR1 . . . ToR2 . . . ToR3 . . . ToR4 Here is how VL2 addressing and routing look like: Switches run link-state routing and maintain only switch-level topology ToR3 y payload Lookup & Response x y y, z z ToR4 ToR3 z z payload payload AAs Servers use flat names

Random Traffic Spreading over Multiple Paths
Links used for up paths Links used for down paths IANY IANY IANY To oer hot-spot-free performance for arbitrary traffic matrices, VL2 uses two related mechanisms: VLB and ECMP..e goals of both are similar — VLB distributes traffic across a set of inter-mediate nodes and ECMP distributes across equal-cost paths—but each is needed to overcome limitations in the other. VL2 uses flows, rather than packets, as the basic unit of traffic spreading and thus avoids out-of-order delivery. T1 T2 T3 T4 T5 T6 IANY T5 T3 y z payload payload x y z

VL2 Directory System . . . RSM RSM Servers Directory Servers DS Agent
2. Reply 1. Lookup “Lookup” 5. Ack 2. Set 4. Ack (6. Disseminate) 3. Replicate 1. Update “Update” VL2 consists of (1) a modest number ( servers for 100K servers) of read-optimized, replicated directory servers that cache AA-to-LA mappings and handle queries fromVL2 agents, and (2) a small number (5-10 servers) of write-optimized, asynchronous replicated state machine (RSM) servers that offer a strongly consistent, reliable store of AA-to-LA mappings. .e directory servers ensure low latency, high throughput, and high availability for a high lookup rate. Meanwhile, the RSM servers ensure strong consistency and durability.

Advantage of VL2 Load Balance
Randomizing to Cope with Volatility: Building on proven networking technology: link-state routing, equal-cost multi-path(ECMP) forwarding, IP anycasting, IP multicasting Simple Migration Static AAs Only need to update AAs & LAs mapping Eliminating the ARP and DHCP scaling bottlenecks

Evaluation: testbed 80 servers Intermediate switches*3
5 for directory system Intermediate switches*3 24 10Gbps Ethernet ports(3 for Aggr) Aggregation switches*3 24 10Gbps Ethernet ports(3 for Aggr, 3 for ToR) ToR*4 24 1Gbps ports

Evaluation Uniform high capacity: All-to-all data shuffle stress test:
75 servers, deliver 500MB Maximal achievable goodput is 62.3 VL2 network efficiency as 58.8/62.3 = 94% This figure shows the aggregate goodput during a 2.7TB shuffle among 75 servers. We can see that the Maximal achievable goodput is 62.3, and VL2 network efficiency as 58.8/62.3 = 94%

Evaluation Fairness: 75 nodes Real data center workload
Plot Jain’s fairness index for traffics to intermediate switches Time (s) 1.00 0.98 0.96 0.94 Fairness Index Aggr Aggr Aggr3 Fairness: 75 nodes Real data center workload Plot Jain’s fairness index for traffics to intermediate switches

Evaluation Performance isolation: Two types of services:
Service one: 18 servers do single TCP transfer all the time Service two: 19 servers starts a 8GB transfer over TCP every 2 seconds Service two: 19 servers burst short TCP connections Performance isolation: Two types of services: Service one: 18 servers do single TCP transfer all the time Service two: 19 servers starts a 8GB transfer over TCP every 2 seconds Service two: 19 servers burst short TCP connections

Evaluation Convergence after link failures 75 servers
All-to-all data shuffle Disconnect links between intermediate and aggregation switches Aggregate goodput as all links to switches Interme-diate1 and Intermediate2 are unplugged in succession and then reconnected in succession. Approximate times of link manipulation marked with vertical lines. Network re-converges in < 1s after each failure and demonstrates graceful degradation.

Critique The extra servers are needed to support the VL2 directory system,: Brings more cost on devices Hard to be implemented for data centers with tens of thousands of servers. All links and switches are working all the times, not power efficient No evaluation of real time performance. But there are also some critiques: The extra servers are needed to support the VL2 directory system,: Brings more cost on devices Hard to be implemented for data centers with tens of thousands of servers. All links and switches are working all the times, not power efficient No evaluation of real time performance.

DCell DCell Existing tree structure does not scale
Expensive high-end switches to scale up Single point of failure and bandwidth bottleneck Experiences from real systems DCell 2018/11/21 59

DCell Ideas #1: Use mini-switches to scale out
#2: Leverage servers be part of the routing infrastructure Servers have multiple ports and need to forward packets #3: Use recursion to scale and build complete graph to increase capacity The basic ideas of DCell can be summarized as follows: #1: Use mini-switches to scale out #2: Leverage servers be part of the routing infrastructure Servers have multiple ports and need to forward packets #3: Use recursion to scale and build complete graph to increase capacity 2018/11/21

DCell: the Construction
n=2, k=2 DCell_1 n=2, k=1 Dcell_0 Server Mini-switch n servers in a DCell_0 n=2, k=0 End recursion by building DCell0 Build sub-DCells This slide show the Dcell architecture and the procedure to build a DCell network. Connect sub-DCells to form complete graph

DCell: The Properties Scalability: The number of servers scales doubly exponentially Where number of servers in a DCell0 is 8 (n=8) and the number of server ports is 4 (i.e., k=3) -> N=27,630,792 Fault-tolerance: The bisection width is larger than No severe bottleneck links: Under all-to-all traffic pattern, the number of flows in a level-i link is less than For tree, under all-to-all traffic pattern, the max number of flows in a link is in proportion to DCell has the following properties: The number of servers scales doubly exponentially The bisection width is sufficient large There is no sever bottleneck links in DCell.

Routing without Failure: DCellRouting
src n1 n2 Routing in a DCell-based DCN cannot use the global link-state routing scheme (e.g., OSPF) since DCell is designed to network up to millions of servers. It also needs to handle failures (due to hardware, software, and power problems), which are common in data centers. DCell proposes a routing scheme with-out failures and a broadcast scheme. dst Time complexity: 2k+1 steps to get the whole path k+1 to get the next hop

DCellRouting (cont.) Network diameter: The maximum path length using DCellRouting in a DCellk is at most But: DCellRouting is NOT a shortest-path routing is NOT a tight diameter bound for DCell The mean and max path lengths of shortest-path and DCellRouting n k N Shortest-path DCellRouting Mean Max 4 2 420 4.87 7 5.16 5 930 5.22 5.50 6 1806 5.48 5.73 3 176,820 9.96 15 11.29 865,830 10.74 11.98 3,263,442 11.31 12.46 The maximum path length using DCellRouting in a Dcell k is at most 2^k+1 – 1 But: DCellRouting is NOT a shortest-path routing 2^k+1 – 1 is NOT a tight diameter bound for DCell However: DCellRouting is close to shortest-path routing DCellRouting is much simpler: O(k) steps to decide the next hop Yet: DCellRouting is close to shortest-path routing DCellRouting is much simpler: O(k) steps to decide the next hop

DFR: DCell Fault-tolerant Routing
Design goal: Support millions of servers Advantages to take: DCellRouting and DCell topology Ideas #1: Local-reroute and Proxy to bypass failed links Take advantage of the complete graph topology #2: Local Link-state To avoid loops with only local-reroute #3: Jump-up for rack failure To bypass a whole failed rack DFR is a distributed, fault-tolerant routing algorithm for DCell networks without global link state. It uses two algorithms of DCellRouting and DCellBroadcast as building blocks. DFR handles three types of failures: node failure, rack failure, and link failure. Rack failure occurs when all the machines in a rack fail (e.g., due to power outage). Link failure is a basic one since all the failures result in link failure. Hence, link failure management is a basic part of DFR.

DFR: DCell Fault-tolerant Routing
src dst m1 m2 n2 n1 r1 DCellb i1 i2 L p1 q2 i3 DCellb p2 q1 Proxy DFR uses three techniques of local reroute, local link-state, and jump-up to address link failure, node failure, and rack failure, respectively. We now present the three techniques and then describe our fully distributed DFR protocol. L Proxy L+1 s2 s1 Servers in a same share local link-state

DFR Simulations: Server failure
Two DCells: n=4, k=3 -> N=176,820 n=5, k=3 -> N=865,830 This figure plots the path failure ratio versus the node failure ratio under node failures. We observe that, DFR achieves results very close to SPF. Even when the node failure ratio is as high as 20%, DFR achieves 22.3% path failure ratio for n = 4 while the bound is 20%. When the node failure ratio is lower than 0.1, DFR performs almost identical to SPF. Moreover, DFR also performs even better as n gets larger.

DFR Simulations: Link failure
Two DCells: n=4, k=3 -> N=176,820 n=5, k=3 -> N=865,830 This figure plots the path failure ratio under link failures, which occur when wiring is broken. We see that the path failure ratio of DFR increases with the link failure ratio. However, the path failure ratio of SPF is almost 0. This is because very few of the nodes is disconnected from the graph (which shows the robust- ness of our DCell structure). However, DFR cannot achieve such performance since it is not globally opti-mal. When the failure ratio is small (say, less than 5%), the performance of DFR is still very close to SPF.

DCN (routing, forwarding, address mapping, )
Implementation DCell Protocol Suite Design Apps only see TCP/IP Routing is in DCN (IP addr can be flat) Software implementation A 2.5 layer approach Use CPU for packet forwarding Next: Offload packet forwarding to hardware APP TCP/IP DCN (routing, forwarding, address mapping, ) DCell has been implementated by MSRA: DCell Protocol Suite Design Apps only see TCP/IP Routing is in DCN (IP addr can be flat) Software implementation A 2.5 layer approach Use CPU for packet forwarding Next: Offload packet forwarding to hardware Ethernet Intel® PRO/1000 PT Quad Port Server Adapter

Testbed DCell1: 20 servers, 5 DCell0s DCell0: 4 servers Ethernet wires
A testbed is also setup with: DCell1: 20 servers, 5 DCell0s DCell0: 4 servers Ethernet wires 8-port mini-switches, 50$ each

Fault Tolerance DCell fault-tolerant routing can handle various failures Link failure Server/switch failure Rack failure Link failure Server shutdown DCell fault-tolerant routing can handle various failures Link failure Server/switch failure Rack failure The TCP throughput is plotted in this figure.

Related Work Here is the comparison of different network structures.
We can see from the table that DCell has achieved a very good performance.

BCube design goals High network capacity for various traffic patterns
One-to-one unicast One-to-all and one-to-several reliable groupcast All-to-all data shuffling Only use low-end, commodity switches Graceful performance degradation Performance degrades gracefully as servers/switches failure increases The design of BCube is to meet the following goals: High network capacity for various traffic patterns Only use low-end, commodity switches Graceful performance degradation

BCube structure Connecting rule A BCubek network supports servers
<1,0> A BCubek network supports servers n is the number of servers in a BCube0 k is the level of that BCube A server is assigned a BCube addr (ak,ak-1,…,a0) where ai [0,k] Neighboring server addresses differ in only one digit <1,1> <1,2> <1,3> Level-1 BCube0 <0,0> 00 01 02 03 <0,1> 10 11 12 13 <0,2> 20 21 22 23 <0,3> 30 31 32 33 Level-0 This slide shows the Bcube structure. BCube is a recursively defined structure There are two types of devices in BCube: Servers with multiple ports, and switches that connect a constant num-ber of servers. Connecting rule The i-th server in the j-th BCube0 connects to the j-th port of the i-th level-1 switch switch server

BCube: Server centric network
MAC03 MAC13 1 MAC23 2 MAC33 3 port Switch <1,3> MAC table <1,0> Server-centric BCube Switches never connect to other switches and only act as L2 crossbars Servers control routing, load balancing, fault-tolerance <1,1> <1,2> <1,3> MAC23 MAC03 20 03 data MAC20 MAC21 1 MAC22 2 MAC23 3 port Switch <0,2> MAC table BCube0 <0,0> 00 01 02 03 <0,1> 10 11 12 13 <0,2> 20 21 22 23 <0,3> 30 31 32 33 MAC20 MAC23 20 03 data Bcube is a server centric network. In Bcube: Switches never connect to other switches and only act as L2 crossbars Servers control routing, load balancing, fault-tolerance dst src MAC20 MAC23 20 03 data MAC addr Bcube addr MAC23 MAC03 20 03 data

Multi-paths for one-to-one traffic
Theorem 1. The diameter of a BCubek is k+1 Theorem 2. There are k+1 parallel paths between any two servers in a BCubek <1,0> <1,1> <1,2> <1,3> This slide shows the multi-paths for one-to-one traffic in Bcube. Two parallel paths between a source server and a destination server exist if they are node-disjoint, i.e., the intermediate servers and switches on one path do not appear on the other. Two theorems hold for Bcube: Theorem 1. The diameter of a BCubek is k+1 Theorem 2. There are k+1 parallel paths between any two servers in a BCubek In practice, k is a small integer, typically at most 3. There- fore, BCube is a low-diameter network. <0,0> <0,1> 10 11 12 13 <0,2> <0,3> 30 31 32 33 00 01 02 03 20 21 22 23

Speedup for one-to-several traffic
Theorem 3. Server A and a set of servers {di|di is A’s level-i neighbor} form an edge disjoint complete graph of diameter 2 <1,0> <1,1> <1,2> <1,3> <0,0> <0,1> <0,2> <0,3> 30 31 32 33 Edge-disjoint complete graphs with k + 2 servers can be efciently constructed in a BCubek. These complete graphs can speed up data replications in distributed file systems like GFS. 00 01 02 03 10 11 12 13 20 21 22 23 P1 P1 P1 P2 P2 P2

Speedup for one-to-all traffic
Theorem 4. There are k+1 edge-disjoint spanning trees in a Bcubek src 00 01 02 03 10 11 12 13 The one-to-all and one-to-several SPTs can be implemented by TCP unicast to achieve reliability 20 This slide shows The two edge-disjoint server spanning trees with server 00 as the source for the BCube1 network. The one-to-all and one-to-several SPTs can be implemented by TCP unicast to achieve reliability. 21 22 23 30 31 32 33

BCube Source Routing (BSR)
Server-centric source routing Source server decides the best path for a flow by probing a set of parallel paths Source server adapts to network condition by re-probing periodically or due to failures BSR design rationale Network structural property Scalability Routing performance It is required that the routing protocol be able to fully utilize the high capacity (e.g., multi-path) of BCube and automat-ically load-balance the traffic. Existing routing protocols such as OSPF and IS-IS cannot meet these require-ments. Furthermore, it is unlikely that OSPF and IS-IS can scale to several thousands of routers. As a result, Bcube uses a source routing protocol called BSR by leverag-ing BCube's topological property. BSR achieves load bal-ance and fault tolerance, and enables graceful performance degradation.

Path compression and fast packet forwarding
Traditional address array needs 16 bytes: Path(00,13) = {02,22,23,13} Forwarding table of server 23 NHI Output port MAC 0:0 Mac20 0:1 Mac21 0:2 Mac22 1:0 1 Mac03 1:1 Mac13 1:3 Mac33 The Next Hop Index (NHI) Array needs 4 bytes: Path(00,13)={0:2,1:2,0:3,1:1} <1,0> <1,1> <1,2> <1,3> Fwd node 2 3 Next hop 1 3 <0,0> This slide shows how path compression and fast packet forwarding is done in Bcube. <0,1> <0,2> <0,3> 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33

Routing to external networks
Ethernet has two levels link rate hierarchy 1G for end hosts and 10G for uplink aggregator 10G <1,3> <1,0> <1,1> <1,1> <1,2> <1,3> <1,1> 1G So far, we focus on how to route packets inside a Bcube network. Now, we consider how Internal servers communicates with exter-nal computers in the Internet or other containers. BCube uses aggregator and gateway for external communication. An aggregator is simply a commodity layer-2 switch with 10G uplinks. <0,0> <0,1> 10 11 12 13 <0,2> 20 21 22 23 <0,3> 30 31 32 33 01 11 21 31 00 01 02 03 gateway gateway gateway gateway

TCP/IP protocol driver
Implementation BCube configuration app software TCP/IP protocol driver kernel Intermediate driver BCube driver Packet send/recv BSR path probing & selection packet fwd Neighbor maintenance Flow-path cache Ava_band calculation Intel® PRO/1000 PT Quad Port Server Adapter Like DCell, BCube alos has its implementation. The Implementation is done by MSRA. The implementation implemented the stack as a kernel driver in the Windows Servers 2003 and The BCube stack locates between the TCP/IP protocol driver and the Ethernet NDIS (Net-work Driver Interface Speci¯cation) driver. The Bcube driver is located at 2.5 layer: to the TCP/IP driver, it is a NDIS driver; to the real Ethernet driver, it is a protocol driver. TCP/IP applications therefore are compatible with BCube since they see only TCP/IP. Ethernet miniport driver IF 0 IF 1 IF k hardware packet fwd Neighbor maintenance server ports NetFPGA Ava_band calculation

Bandwidth-intensive application support
Per-server throughput This slides shows the per-server throughput of the bandwidth- intensive application support experiments under dif-ferent traffic patterns. We can see from the figure that Bcube performs much better than tree structure.

Support for all-to-all traffic
Total throughput for all-to-all This slide shows the Total throughput for all-to-all traffic. The green curve represents BCube and the red curve represents Tree structure.

Related work Speedup Here is the performance comparison between BCube and other data center network structures. We can see from the table that BCube has a very good performance and outputs other data center network structures.

Motivation Data center networking (DCN) Applications on DCN
Interconnecting thousands of or even hundreds of thousands of servers at a data center Amazon Google Microsoft Applications on DCN On-line applications: web search, web mail… Infrastructural on DC: GFS, Map-reduce… The research of data center networking is to design how to interconnect thousands of or even hundreds of thousands of servers within a data center, and the corresponding routing protocols on top of it. Companies such as Amazon, Google and Microsoft run large scale of data centers. The applications on data center networking include the on-line services such as web search, web mail, and many infrastructural computations, like GFS and Map-reduce.

Motivation (Cont.) If we can build a data center with commodity servers that have only two ports and low-end switches It will be easier to build data center test bed More academic research will be involved Existing solutions can solve it? – No! Tree-based solution requires expensive high-end switches at top levels Fat tree is limited by the number of ports at a switch DCell typically requires ~4 ports each server We observe that current commodity servers have two networking ports, one for network connection and one for backup. So in this work, we try to study a simple technical problem: Can we build a data center with commodity servers that have only two ports and low-end switches? If it is the case, then it will be easier for us to build data center test bed. And more academic research will be involved into the data center networking. Can existing solutions solve this problem? Our answer is No. The current practice of tree-based solution requires expensive high-end switches at top levels. The number of servers in Fat-Tree is limited by the number of ports at a switch. And Dcell typically requires about 4 ports on each server.

Outline Motivation Structure Routing Evaluation Related Work
Conclusion Now we introduce the design of FiConn structure.

Structure of FiConn Basic construction unit
FiConn_0: n servers connected to a commodity switch Recursively construct higher-level FiConn FiConn_{k-1} uses half of the available backup ports to construct FiConn_k Every server uses two ports One connected to the switch in FiConn_0 One connected to a higher-level server In FiConn, the basic construction unit is called FiConn_0. It is composed of n servers connected to a commodity switch. Then we construct higher-level FiConn in a recursive way. The construction rule is that FiConn_{k-1} uses half of the available backup ports to construct FiConn_k. So in a FiConn, every server uses two ports. One is connected to the switch in FiConn_0, and the other one is connected to a higher-level server.

This figure shows the process to construct FiConn
This figure shows the process to construct FiConn. This is a switch connected by four servers, which is the Ficonn_0 in our example. Since there are four available backup ports in the FiConn_0, we choose half of them, that is two servers to connect other FiConn_0s. So there are totally three FiConn_0s in the FiConn_1 and they are connected as a mesh. We can see that there are 12 servers now in the FiConn_1 and only 6 have available backup ports. We again choose half of them, that is three servers to connect to other FiConn_1s. And the four FiConn_1s also form a mesh to build the FiConn_2. This is the algorithm for FiConn construction. And you can refer to our paper for details.

Basic Properties Theorem 1: Theorem 2: Theorem 3:
The total number of servers in FiConn_k: Server number increases double-exponentially with k n=16, k=3, N=3,553,776 Theorem 2: The average server node degree of FiConn_k: Theorem 3: L_l denotes the number of links in level-l, there is We give some basic properties of FiConn here. First, the total number of servers in a FiConn_k is larger than 2 power k+2 multiplying n dividing 4 power 2 power k. So the server number in Ficonn increases double-exponentially with k. In other words, a small number of k can lead to huge number of server population. For example, if n is 16 and k is 3, the total server number is 3,553,776. Second, the average server node degree of a FiConn_k is 2 minus 1 dividing 2 power k. Third, if L sub l denotes the number of Links in level l, there is L sub l equals to four times L sub l+1 when l equals to 0, and equals to twice L sub l+1 when l is greater than 0. As for the proof of the theorems, please refer to our paper.

Routing in FiConn Traffic Oblivious Routing (TOR)
Simple Traffic Aware Routing (TAR) For better utilizing the link capacities based on traffic states Principles Do not depend on central server(s) Do not exchange link states among servers We separate two kinds of routings in FiConn. The first is traffic oblivious routing, which is simple and only depends on the FiConn structure. We call it TOR. The second is traffic aware routing, which is designed for better utilizing the link capacities based on traffic states. We call it TAR. The principles of our design for FiConn routing is that we do not depend on central servers for coordination and do not exchange link states among servers because the server number is so large.

TOR – Traffic Oblivious Routing
Take the level-based characteristic of FiConn First up to the lowest common FiConn level Then down to the destination Example Src: [2,1] Dst: [0,1] TOR path: [2,1],[2,0],[0,2],[0,1] First we take a look at traffic oblivious routing. TOR takes the level-based characteristic of Ficonn. It first routes up to the lowest common FiConn level of two servers, and then down to the destination server. In the simple example, the source server is [2,1] and the destination server is [0,1]. The TOR path is [2,1] to [2,0], then [0,2], and finally to the destination [0,1]. This is the algorithm for TOR.

Load Balancing under TOR
In a typical path using TOR: The number of level-l links is twice that of level-(l+1) links when l>0, and four times that of level-(l+1) links when l=0 Recall Theorem 3: Therefore TOR makes a balancing use of different levels of FiConn links We can find that in a typical path using TOR, the number of level-l links is twice that of level-(l+1) links when l>0, and is four times that of level-(l+1) links when l equals to 0. Recall Theorem 3 about the relationship among the numbers of links from different levels, we can get that TOR makes a balancing use of different levels of FiConn links, which is a nice property.

More Properties of FiConn using TOR
Theorem 4: The diameter of FiConn is at most: (typically, k<=3) Or Theorem 5: The bisection width of FiConn is at least: Using TOR, we can get more properties of FiConn. The diameter of FiConn is at most 2 power k+1 minus 1. Typically, k is less than or equal to 3. So the diameter of FiConn is a small number. Or, we can see the diameter of FiConn is O(Log N), where N is the total number of servers in FiConn. On the other hand, the bisection with of FiConn is O N dividing log N.

Drawbacks of TOR A pair of servers cannot use two ports on each to double the end-to-end throughput TOR cannot better utilize link capacities based on traffic states TAR Traffic aware routing Then we describe the motivation of traffic aware routing, TAR. There are two drawbacks in TOR. First, a pair of servers cannot use two ports on each to double the end-to-end throughput. Second, it cannot better utilize link capacities based on traffic states.

TAR – Basic Idea Working process Path establishment
A per-flow based probing packet is periodically sent out by the source server to establish the TAR path Each flow is routed by the path established by probing packet Path establishment Greedy approach Each intermediate server seeks to balance the traffic between its two outgoing links If the outgoing link from TOR is the high-level (level l) link and the available bandwidth of level-0 link is higher level-l link is bypassed using a third FiConn_{l-1} to relay The basic idea of TAR work process is as follows. First, a per-flow based probing packet is periodically sent out by the source server to establish the TAR path. Then each flow is routed by the path established by the probing packet. The path establishment by the probing packet takes a greedy approach. That is, each intermediate server seeks to balance the traffic between its two outgoing links. If the outgoing link from TOR is the high-level link and the available bandwidth of the level-0 link is higher, the level-l link is then bypassed using a third Ficonn_(l-1) to relay.

TAR – Basic Idea (Cont.) Scenario TAR path Src: [2,1] Dst: [0,1]
There is already one flow from [2,0] to [0,2] TAR path [2,1], [2,2], [1,2], [1,0], [0,0], [0,1] [1,2]: proxy server We use a simple example for illustration. In this scenario, we want to route a flow from [2,1] to [0,1], and there is already a flow from [2,0] to [0,2]. So we use FiConn_0[1] for bypassing. And the TAR path is from [2,1] to [2,2], then to [1,2], [1,0], to [0,0], and finally to [0,1]. In this example, we call the server [1,2] the proxy server, which is the entry server in the relaying FiConn_0.

TAR - Challenges Routing back Multiple bypassing Path redundancy
[2,2] back to [2,0] using TOR Multiple bypassing Assume a flow from [2,2] to [1,2] Path redundancy [2,0] removed from the TAR path Imbalance trap All subsequent flows from FiConn_0[0] to FiConn_0[2] will bypass the level-1 link of [2,0] However, there are some challenges in TAR. The first problem is routing back. That is, based on TOR, server [2,2] will route the packet back to [2,0] because the destination is [0,1]. The second problem is multiple bypassing. We assume there is also one flow from [2,2] to [1,2], when the flow arrives [2,2], it will be routed back to [2,0]. Then the flow will be falls into a loop between [2,2] and [2,0]. The third problem is path redundancy. That is, server [2,0] can actually be removed from the TAR path. The fourth problem is called imbalance trap. That is, if there is initially one flow from [2,0] to [0,2], then all subsequent flows from FiConn_0[0] to FiConn_0[2] will bypass the level-1 link of [2,0]. This actually causes imbalance.

TAR - Techniques Progressive Route (PR) - challenges 1 & 2
A PR field in the header of probing packet to record the proxy server and the bypassing times Source ReRoute (SRR) - challenge 3 When using a proxy server to bypass, the packet is backed off to source server to reroute Virtual Flow (VF) - challenge 4 When a flow bypasses the high-level link, VF counter is added by one When a flow is routed by level-0 link, VF counter is reduced by one We use some techniques to address these challenges. First, we use Progressive Route to address challenges 1 and 2. That is, we add a PR field in the header of the probing packet the record the proxy server and bypassing times. Second, we use Source ReRoute to address challenge 3. When using a proxy server to bypass a level-0 link, the packet is backed off to the source server to reroute. Third, we use Virtual Flow to address challenge 4. When a flow bypasses the high-level link, the VF counter is added by one. And when a flow is routed by level-0 link, the VF counter is reduced by one.

TAR - Algorithm This slides gives the algorithm for TAR. For details, please refer to our paper.

Simulation Setup Topology Traffic pattern
FiConn_2 with n=32 Totally N=74,528 servers Traffic pattern Random traffic (N/2 to N/2) Burst traffic (all-to-all between two FiConn_1s) Evaluate the aggregate bandwidth TAR TOR The simulation setup is as follows. Te topology is a FiConn_2 with n of 32. So totally there are 74,528 servers. We study two kinds of traffic patterns. The first is random traffic, that is, there are randomly selected N/2 servers communicating to the other N/2 servers. The second is burst traffic. That is, all to all communication between two FiConn_1s. We evaluate the aggregate bandwidth of the two kinds of traffic patterns under TAR and TOR respectively.

Evaluation Results Aggregate throughput under TOR and TAR
This is the evaluation results. We can find that in random traffic pattern, the aggregate throughput of TOR and TAR are almost the same. TOR is a little bit higher than TAR because the average path length in TAR is usually a little longer than TOR. However, under burst traffic pattern, the aggregate throughput of TAR is much higher than TOR. The reason is obvious. In TOR, only one level-2 link can be used to route all flows between the two FiConn_1s. In contrary, TAR can leverage many other paths by using the servers in the other FiConn_1s for relay. Random traffic Burst traffic

Q & A 2018/11/21

Data Center Networking

Similar presentations

Presentation on theme: "Data Center Networking"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Center Networking

Similar presentations

Presentation on theme: "Data Center Networking"— Presentation transcript:

Similar presentations

About project

Feedback