Data Center Networks Mohammad Alizadeh Fall 2018 6.829: Computer Networks Data Center Networks Mohammad Alizadeh Fall 2018 Slides adapted from presentations by Albert Greenberg and Changhoon Kim (Microsoft)
What are Data Centers? Large facilities with 10s of thousands of networked servers Compute, storage, and networking working in concert “Warehouse-Scale Computers” Huge investment: ~ 0.5 billion for large datacenter
Data Center Costs Amortized Cost* Component Sub-Components ~45% Servers CPU, memory, disk ~25% Power infrastructure UPS, cooling, power distribution ~15% Power draw Electrical utility costs Network Switches, links, transit The Cost of a Cloud: Research Problems in Data Center Networks. Sigcomm CCR 2009. Greenberg, Hamilton, Maltz, Patel. *3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money
Server Costs 30% utilization considered “good” in most data centers! Uneven application fit Each server has CPU, memory, disk: most applications exhaust one resource, stranding the others Uncertainty in demand Demand for a new service can spike quickly Risk management Not having spare servers to meet demand brings failure just when success is at hand
Goal: Agility – Any service, Any Server Turn the servers into a single large fungible pool Dynamically expand and contract service footprint as needed Benefits Lower cost (higher utilization) Increase developer productivity Achieve high performance and reliability Increase productivity because services can be deployed much faster
Provide the illusion of Datacenter Networks 10,000s of ports Compute Storage (Disk, Flash, …) Provide the illusion of “One Big Switch”
Datacenter Traffic Growth Today: Petabits/s in one DC More than core of the Internet! Source: “Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network”, SIGCOMM 2015.
Latency is King 1 user request 1000s of messages over DC network Who does she know? What has she done? Traditional Application Large-scale Web Application Data Center App Tier App Logic App. Logic App Logic Alice << 1µs latency Fabric 10μs-1ms latency Data Structures Single machine Data Tier 1 user request 1000s of messages over DC network Microseconds of latency matter Even at the tail (e.g., 99.9th percentile) Eric Minnie Pics Apps Videos Based on slide by John Ousterhout (Stanford)
Conventional DC Network Problems
Conventional DC Network Internet — L2 pros, cons? — L3 pros, cons? CR CR DC-Layer 3 . . . AR AR AR AR DC-Layer 2 Key CR = Core Router (L3) AR = Access Router (L3) S = Ethernet Switch (L2) A = Rack of app. servers S S . . . S S S S A A … A A A … A ~ 1,000 servers/pod == IP subnet Reference – “Data Center: Load balancing Data Center Services”, Cisco 2004
Reminder: Layer 2 vs. Layer 3 Ethernet switching (layer 2) Fixed IP addresses and auto-configuration (plug & play) Seamless mobility, migration, and failover Broadcast limits scale (ARP) No multipath (Spanning Tree Protocol) IP routing (layer 3) Scalability through hierarchical addressing Multipath routing through equal-cost multipath Can’t migrate w/o changing IP address Complex configuration
Conventional DC Network Problems CR CR ~ 200:1 AR AR AR AR S S S S ~ 40:1 . . . S S S S S S S S ~ 5:1 A A … A A A … A A A … A A A … A Extremely limited server-to-server capacity
Conventional DC Network Problems CR CR ~ 200:1 AR AR AR AR S S S S S S S S S S S S A A … A A A … A A A … A A A A … A IP subnet (VLAN) #1 IP subnet (VLAN) #2 Extremely limited server-to-server capacity Resource fragmentation
Complicated manual L2/L3 re-configuration And More Problems … CR CR ~ 200:1 AR AR AR AR Complicated manual L2/L3 re-configuration S S S S S S S S S S S S A A … A A A … A A A … A A A A … A IP subnet (VLAN) #1 IP subnet (VLAN) #2
VL2 Paper Measurements VL2 Design Clos topology Valiant LB Name/location separation (precursor to network virtualization) http://research.microsoft.com/en-US/news/features/datacenternetworking-081909.aspx
The Illusion of a Huge L2 Switch 3. Performance isolation VL2 Goals The Illusion of a Huge L2 Switch 1. L2 semantics 2. Uniform high capacity 3. Performance isolation A A A A A … A A A A A A … A A A A A A A A A A … A A A A A A A A A A A A … A A A A
Offer huge capacity via multiple paths (scale out, not up) Clos Topology Offer huge capacity via multiple paths (scale out, not up) VL2 Int . . . Aggr . . . . . . . . . . . . TOR . . . . . . . . . . . 20 Servers
https://code. facebook https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/ https://youtu.be/mLEawo6OzFM?t=56
Building Block: Merchant Silicon Switching Chips Switch ASIC 6 pack Facebook Wedge Image courtesy of Facebook
ECMP Load Balancing Pick among equal-cost paths by a hash of 5-tuple Randomized load balancing Preserves packet order Problems? Hash collisions H(f) % 3 = 0
Do Hash Collisions Matter? VL2 Paper: Probabilistic flow distribution would work well enough, because … Flows are numerous and not huge – no elephants Commodity switch-to-switch links are substantially thicker (~ 10x) than the maximum thickness of a flow ? ?
Impact of Fabric vs. NIC Speed Differential Three non-oversubscribed topologies: 5 racks, 100x10G servers 20×10Gbps Downlinks Uplinks 5×40Gbps 2×100Gbps Link speed is an important decision for network scale & cost. Here, we ask, what is impact on perf?
Impact of Fabric vs. NIC Speed Differential “Web search” workload Better 40/100Gbps fabric: ~ same FCT as OQ 10Gbps fabric: FCT up 40% worse than OQ
Intuition Higher speed links improve probabilistic routing 20×10Gbps Uplinks 2×100Gbps 11×10Gbps flows (55% load) Prob of 100% throughput = 3.27% 1 2 20 Prob of 100% throughput = 99.95% 1 2
DC Load Balancing Research Centralized Distributed [Hedera, Planck, Fastpass, …] In-Network Host-Based Cong. Oblivious [Presto, Juggler] Cong. Aware [MPTCP, MP-RDMA, Clove, FlowBender…] Cong. Oblivious [ECMP, WCMP, packet-spray, …] Cong. Aware [Flare, TeXCP, CONGA, DeTail, HULA, LetFlow, …]
VL2 Performance Isolation Why does VL2 need TCP for performance isolation?
Addressing and Routing: Name-Location Separation VL2 Switches run link-state routing and maintain only switch-level topology Directory Service Allows to use low cost switches Protects network from host-state churn … x ToR2 y ToR3 z ToR3 … x ToR2 y ToR3 z ToR4 ToR1 . . . ToR2 . . . ToR3 . . . ToR4 ToR3 y payload Lookup & Response x y y, z z ToR4 ToR3 z z payload payload Servers use flat names
VL2 Agent in Action VLB ECMP Why use hash for Src IP? H(ft) 10.1.1.1 dst IP src IP H(ft) dst IP 10.0.0.6 Int (10.1.1.1) 20.0.0.55 src AA 20.0.0.56 dst AA payload (10.0.0.4) ToR (20.0.0.1) (10.0.0.6) ToR (20.0.0.1) VLB Why hash? Why double encap? ECMP VL2 Agent Why use hash for Src IP? Why anycast & double encap?
VL2 Topology . . . TOR 20 Servers Int . . . . . . Aggr . . . . . . . .
VL2 Directory System Read-optimized Directory Servers for lookups Write-optimized Replicated State Machines for updates Stale mappings? Directory servers: low latency, high throughput, high availability for a high lookup rate RSM: strongly consistent, reliable store of AA-to-LA mappings Reactive cache updates: stale host mapping needs to be corrected only when that mapping is used to deliver traffic. Forward non-deliverable packets to a directory server, so directory server corrects stale mapping in source’s stale cache via unicast
Other details How does L2 broadcast work? How does Internet communication work? Broadcast? -- no ARP (directory system handles it) -- DHCP: intercepted at ToR using conventional DHCP relay agents and unicast forwarded to DHCP servers. -- IP multicast per service, rate-limited Internet? -- L3 routing fabric for network, so external traffic does not need headers rewritten. -- servers: have LA AND AA. LA is externally reachable and announced via BGP -- Internet > Intermediate Switches < Server with LA