Kai Chen (HKUST) Nov 1, 2013, USTC, Hefei Data Center Networking 1
Outline Introduction to Data Centers Data Center Network Architecture Data Center Routing Data Center Transport Control 2
Big Data Applications and Cloud Computing 3 Scientific: 200GB of astronomy data a night Business: 1 million customer transactions, 2.5PB of data per hour Social network: 60 billion photos in its user base, 25TB of log data per day Web search: 20PB of search data per day … …
Data Centers as Infrastructures 4 Google’s 36 world wide data centers (2008) A 125,000-square-foot Walmart’s data center in Arkansas (10,000s – 100,000s of servers) Modular data centers with 1000s of servers (20- or 40-foot container)
Data Center Network Architectures 5
Today’s Production DCN Structure 6 Aggregation switch (ToR switch) Top-of-Rack Core switch A Production DCN structure adopted from Cisco 1:1 1:5 ~ 1:20 1:240 Communication bottleneck
Research Community: Fattree 7 ToR Aggregation Core 16 Rearrangeably non-blocking, 1:1 provisioning
VL2 8 1:1 provisioning
BCube server switch 9
BCube BCube0 BCube1 server switch Level-1 Level-0 Connecting rule - The i-th server in the j-th BCube 0 connects to the j-th port of the i-th level-1 switch A BCube k network supports servers - n is the number of servers in a BCube 0 - k is the level of that BCube A server is assigned a BCube addr (a k,a k-1,…,a 0 ) where a i [0,k] Neighboring server addresses differ in only one digit 10
Helios (Optical/Electrical) 11
Many other DCN Structures DCell 2008 MDCube 2009 c-Through 2010 Flyways (wireless) 2011 Jellyfish 2012 OSA 2012 Mordia 2013 … 12
Data Center Routing 13
Fattree Routing: exploiting hierarchy on a Fattree topology Slide 14 Pod Number POD 0POD 1POD 2POD 3 14
Positional Pseudo MAC Addresses PMAC = pod.position.port.vmid; Slide :00:00:02:00:01 00:00:00:03:00:01 00:00:01:02:00:01 00:00:01:03:00:01 00:01:00:02:00:01 00:01:00:03:00:01 00:01:01:02:00:01 00:01:01:03:00:01 00:02:00:02:00:01 00:02:00:03:00:01 00:02:01:02:00:01 00:02:01:03:00:01 00:03:00:02:00:01 00:03:00:03:00:01 00:03:01:02:00:01 00:03:01:03:00:01
Installing AMAC to PMAC mappings Slide 16
Proxy based Address Resolution (ARP) Slide 17
Routing on Fattree (PortLand) Encode the topology information into PMAC, and routing based on PMAC addresses – Each host sees its own AMAC and remote host’s PMAC – Ingress switch does AMAC -> PMAC rewriting for src host – Egress switch does PMAC -> AMAC mapping for dst host – Longest prefix match routing on destination PMAC Slide 18
BCube Routing: exploiting server location in a hypercube topology BCube0 BCube1 Level-1 Level-0 19 BCube ID: indicate the location of a server in a BCube topology 00 -> 20 -> 23, or 00 -> 03 -> 23
Bcube: multi-paths for one-to-one traffic The diameter of a BCube k is k+1 There are k+1 parallel paths between any two servers in a BCube k
MAC addr Bcube addr BCube0 BCube1 MAC030 MAC131 MAC232 MAC333 port Switch MAC table MAC200 MAC211 MAC222 MAC233 port Switch MAC table BCube: Server centric and source routing MAC23MAC data MAC23MAC data dstsrc MAC20MAC data MAC20MAC data Server-centric BCube - Switches never connect to other switches and only act as L2 crossbars - Servers control routing, load balancing, fault-tolerance 21
Routing on Bcube Encode the topology information into Bcube IDs, and routing based on Bcube IDs – Source derives the routing paths based on Src-Dst Bcube IDs – Source writes the whole path information into packet headers – Forwarding is based on path information embedded in packet heads – Bcube switches are dumb, intelligence is within servers
Summary on DCN Routings VL2, Dcell, CamCube, Jellyfish… Topology-dependent – Customized to specific topology, fully explore topology characteristic – Constrained, not generalized to generic DCNs – Some of them not deployable in commodity switches SDN/OpenFlow – Flexible, explicit routing control – Scalability, OpenFlow entries: 1K – 4K – Heavy-weight, active paths swapping – Error-prone 23
Data Center Transport Control 24
TLA MLA Worker Nodes ……… Partition/Aggregate Application Structure 25 Picasso “Everything you can imagine is real.”“Bad artists copy. Good artists steal.” “It is your work in life that is the ultimate seduction.“ “The chief enemy of creativity is good sense.“ “Inspiration does exist, but it must find you working.” “I'd like to live as a poor man with lots of money.“ “Art is a lie that makes us realize the truth. “Computers are useless. They can only give you answers.” ….. 1. Art is a lie… 2. The chief… 3. … Art is a lie… 3. ….. Art is… Picasso Time is money Strict deadlines (SLAs) Missed deadline Lower quality result Deadline = 250ms Deadline = 50ms Deadline = 10ms
Generality of Partition/Aggregate The foundation for many large-scale web applications. – Web search, Social network composition, Ad selection, etc. Example: Facebook Partition/Aggregate ~ Multiget – Aggregators: Web Servers – Workers: Memcached Servers 26 Memcached Servers Internet Web Servers Memcached Protocol
Workloads Partition/Aggregate [~2KB] (Query) Short messages [50KB-1MB] ( C oordination, Control state) Large flows [1MB-50MB] ( D ata update) 27 Delay-sensitive Throughput-sensitive
Impairments Incast Queue Buildup Buffer Pressure 28
Incast 29 TCP timeout Worker 1 Worker 2 Worker 3 Worker 4 Aggregator RTO min = 300 ms Synchronized mice collide. Caused by Partition/Aggregate.
Queue Buildup 30 Sender 1 Sender 2 Receiver Big flows buildup queues. Increased latency for short flows. Measurements in Bing cluster For 90% packets: RTT < 1ms For 10% packets: 1ms < RTT < 15ms
Data Center Transport Requirements High Burst Tolerance –Incast due to Partition/Aggregate is common 2. Low Latency –Short flows, queries 3. High Throughput –Continuous data updates, large file transfers 4. Meet flow deadlines –Some flows carry deadlines (e.g., search), flows miss their deadlines will not be useful The challenge is to achieve these together.
Tension Between Requirements 32 High Burst Tolerance High Throughput Low Latency DCTCP Deep Buffers: Queuing Delays Increase Latency Shallow Buffers: Bad for Bursts & Throughput Objective: Low Queue Occupancy & High Throughput
Many Recent DCN Transport Designs Reduce RTO min 2009 Data Center TCP (DCTCP) 2010 Deadline-Aware –D–D 3 (centralized) 2011, and D 2 (decentralized) 2012 Preemptive Flow Scheduling (PQD) 2012 pFabric
Review: The TCP/ECN Control Loop 34 Sender 1 Sender 2 Receiver ECN Mark (1 bit) ECN = Explicit Congestion Notification
DCTCP: two key ideas 1.React in proportion to the extent of congestion, not its presence. Reduces variance in sending rates, lowering queuing requirements. 2.Mark based on instantaneous queue length. Fast feedback to better deal with bursts. 18 ECN MarksTCPDCTCP Cut window by 50%Cut window by 40% Cut window by 50%Cut window by 5%
Data Center TCP Algorithm Switch side: – Mark packets when Queue Length > K. 19 Sender side: – Maintain running average of fraction of packets marked (α). In each RTT: Adaptive window decreases: – Note: decrease factor between 1 and 2. B K Mark Don’t Mark
Why DCTCP Works 1.High Burst Tolerance Large buffer headroom → bursts fit. Aggressive marking → sources react before packets are dropped. 2. Low Latency Small buffer occupancies → low queuing delay. 3. High Throughput ECN averaging → smooth rate adjustments, low variance. 21
Just touch a small part of Data Center Networking, a lot of interesting things to do, a long way to go. Thanks! 38