Data Center Networks Mohammad Alizadeh Fall 2018

Data Center Networks Mohammad Alizadeh Fall 2018
6.829: Computer Networks Data Center Networks Mohammad Alizadeh Fall 2018 Slides adapted from presentations by Albert Greenberg and Changhoon Kim (Microsoft)

What are Data Centers? Large facilities with 10s of thousands of networked servers Compute, storage, and networking working in concert “Warehouse-Scale Computers” Huge investment: ~ 0.5 billion for large datacenter

Data Center Costs Amortized Cost* Component Sub-Components ~45%
Servers CPU, memory, disk ~25% Power infrastructure UPS, cooling, power distribution ~15% Power draw Electrical utility costs Network Switches, links, transit The Cost of a Cloud: Research Problems in Data Center Networks. Sigcomm CCR Greenberg, Hamilton, Maltz, Patel. *3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money

Server Costs 30% utilization considered “good” in most data centers!
Uneven application fit Each server has CPU, memory, disk: most applications exhaust one resource, stranding the others Uncertainty in demand Demand for a new service can spike quickly Risk management Not having spare servers to meet demand brings failure just when success is at hand

Goal: Agility – Any service, Any Server
Turn the servers into a single large fungible pool Dynamically expand and contract service footprint as needed Benefits Lower cost (higher utilization) Increase developer productivity Achieve high performance and reliability Increase productivity because services can be deployed much faster

Provide the illusion of
Datacenter Networks 10,000s of ports Compute Storage (Disk, Flash, …) Provide the illusion of “One Big Switch”

Datacenter Traffic Growth
Today: Petabits/s in one DC More than core of the Internet! Source: “Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network”, SIGCOMM 2015.

Latency is King 1 user request  1000s of messages over DC network
Who does she know? What has she done? Traditional Application Large-scale Web Application Data Center App Tier App Logic App. Logic App Logic Alice << 1µs latency Fabric 10μs-1ms latency Data Structures Single machine Data Tier 1 user request  1000s of messages over DC network Microseconds of latency matter Even at the tail (e.g., 99.9th percentile) Eric Minnie Pics Apps Videos Based on slide by John Ousterhout (Stanford)

Conventional DC Network Problems

Conventional DC Network
Internet — L2 pros, cons? — L3 pros, cons? CR CR DC-Layer 3 . . . AR AR AR AR DC-Layer 2 Key CR = Core Router (L3) AR = Access Router (L3) S = Ethernet Switch (L2) A = Rack of app. servers S S . . . S S S S A A … A A A … A ~ 1,000 servers/pod == IP subnet Reference – “Data Center: Load balancing Data Center Services”, Cisco 2004

Reminder: Layer 2 vs. Layer 3
Ethernet switching (layer 2) Fixed IP addresses and auto-configuration (plug & play) Seamless mobility, migration, and failover Broadcast limits scale (ARP) No multipath (Spanning Tree Protocol) IP routing (layer 3) Scalability through hierarchical addressing Multipath routing through equal-cost multipath Can’t migrate w/o changing IP address Complex configuration

CR CR ~ 200:1 AR AR AR AR S S S S ~ 40:1 . . . S S S S S S S S ~ 5:1 A A … A A A … A A A … A A A … A Extremely limited server-to-server capacity

CR CR ~ 200:1 AR AR AR AR S S S S S S S S S S S S A A … A A A … A A A … A A A A … A IP subnet (VLAN) #1 IP subnet (VLAN) #2 Extremely limited server-to-server capacity Resource fragmentation

Complicated manual L2/L3 re-configuration
And More Problems … CR CR ~ 200:1 AR AR AR AR Complicated manual L2/L3 re-configuration S S S S S S S S S S S S A A … A A A … A A A … A A A A … A IP subnet (VLAN) #1 IP subnet (VLAN) #2

VL2 Paper Measurements VL2 Design Clos topology Valiant LB
Name/location separation (precursor to network virtualization)

The Illusion of a Huge L2 Switch 3. Performance isolation
VL2 Goals The Illusion of a Huge L2 Switch 1. L2 semantics 2. Uniform high capacity 3. Performance isolation A A A A A … A A A A A A … A A A A A A A A A A … A A A A A A A A A A A A … A A A A

Offer huge capacity via multiple paths (scale out, not up)
Clos Topology Offer huge capacity via multiple paths (scale out, not up) VL2 Int . . . Aggr . . . . . . TOR . . . 20 Servers

https://code. facebook

Building Block: Merchant Silicon Switching Chips
Switch ASIC 6 pack Facebook Wedge Image courtesy of Facebook

ECMP Load Balancing Pick among equal-cost paths by a hash of 5-tuple
Randomized load balancing Preserves packet order Problems? Hash collisions H(f) % 3 = 0

Do Hash Collisions Matter?
VL2 Paper: Probabilistic flow distribution would work well enough, because … Flows are numerous and not huge – no elephants Commodity switch-to-switch links are substantially thicker (~ 10x) than the maximum thickness of a flow ? ?

Impact of Fabric vs. NIC Speed Differential
Three non-oversubscribed topologies: 5 racks, 100x10G servers 20×10Gbps Downlinks Uplinks 5×40Gbps 2×100Gbps Link speed is an important decision for network scale & cost. Here, we ask, what is impact on perf?

Impact of Fabric vs. NIC Speed Differential
“Web search” workload Better 40/100Gbps fabric: ~ same FCT as OQ 10Gbps fabric: FCT up 40% worse than OQ

Intuition Higher speed links improve probabilistic routing
20×10Gbps Uplinks 2×100Gbps 11×10Gbps flows (55% load) Prob of 100% throughput = 3.27% 1 2 20 Prob of 100% throughput = 99.95% 1 2

DC Load Balancing Research
Centralized Distributed [Hedera, Planck, Fastpass, …] In-Network Host-Based Cong. Oblivious [Presto, Juggler] Cong. Aware [MPTCP, MP-RDMA, Clove, FlowBender…] Cong. Oblivious [ECMP, WCMP, packet-spray, …] Cong. Aware [Flare, TeXCP, CONGA, DeTail, HULA, LetFlow, …]

VL2 Performance Isolation
Why does VL2 need TCP for performance isolation?

Addressing and Routing: Name-Location Separation
VL2 Switches run link-state routing and maintain only switch-level topology Directory Service Allows to use low cost switches Protects network from host-state churn … x  ToR2 y  ToR3 z  ToR3 … x  ToR2 y  ToR3 z  ToR4 ToR1 . . . ToR2 . . . ToR3 . . . ToR4 ToR3 y payload Lookup & Response x y y, z z ToR4 ToR3 z z payload payload Servers use flat names

VL2 Agent in Action VLB ECMP Why use hash for Src IP?
H(ft) dst IP src IP H(ft) dst IP Int ( ) src AA dst AA payload ( ) ToR ( ) ( ) ToR ( ) VLB Why hash? Why double encap? ECMP VL2 Agent Why use hash for Src IP? Why anycast & double encap?

VL2 Topology . . . TOR 20 Servers Int Aggr

VL2 Directory System Read-optimized Directory Servers for lookups Write-optimized Replicated State Machines for updates Stale mappings? Directory servers: low latency, high throughput, high availability for a high lookup rate RSM: strongly consistent, reliable store of AA-to-LA mappings Reactive cache updates: stale host mapping needs to be corrected only when that mapping is used to deliver traffic. Forward non-deliverable packets to a directory server, so directory server corrects stale mapping in source’s stale cache via unicast

Other details How does L2 broadcast work? How does Internet communication work? Broadcast? -- no ARP (directory system handles it) -- DHCP: intercepted at ToR using conventional DHCP relay agents and unicast forwarded to DHCP servers. -- IP multicast per service, rate-limited Internet? -- L3 routing fabric for network, so external traffic does not need headers rewritten. -- servers: have LA AND AA. LA is externally reachable and announced via BGP -- Internet > Intermediate Switches < Server with LA

Data Center Networks Mohammad Alizadeh Fall 2018

Similar presentations

Presentation on theme: "Data Center Networks Mohammad Alizadeh Fall 2018"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Center Networks Mohammad Alizadeh Fall 2018

Similar presentations

Presentation on theme: "Data Center Networks Mohammad Alizadeh Fall 2018"— Presentation transcript:

Similar presentations

About project

Feedback