A Scalable, Commodity Data Center Network Architecture Jingyang Zhu
Outline Motivation Background Fat Tree Architecture –Topo –Routing –Fault Tolerent Results
Motivation Large Data Shuffle Map Reduce
Intuitive Approach High End Hardware (e.g., InfiniBand) FDRFourth Data Rate EDREnhanced Data Rate HD R High Data Rate
Alternative Approach A dedicated interconnection network –Scalablilty –Cost –Compability (i.e., app, os, hardware)
Typical Topology
Clos Network (m, n, r) = (5, 3, 4) 1. strictly non- blocking (m >= 2n - 1) 2. rearrangeably non-blocking (m >= n)
Benes Network A Clos Network with 2x2 switches
Fat Tree Multi-path Routing: UpLink (right) + DownLink (left) Oversubscription: ideal BW / actual BW of host end. e.g. 1 : 1 is good; 5 : 1 is bad Node 1 (0001) -> Node 6 (0110): 2 possible paths
Topo of Data Center - Hierachy Multi-Path = 2 Conventional Topo in Data Center GigE Link 10 GigE Link
Topo of Data Center - Fat Tree Fat Tree Topo (k = 4) (k/2)^2 k-port # of hosts: k^3 / 4, e.g., k = 48 => # of hosts: (Scalability!!!) k pods k/2 k-port
Addressing - Compability!!! Pod switches: 10.pod #.switch #.1 Core Address: 10.k.j.i (k - radix, - coordinate) j,i = 1,2,...,k/
Addressing (con't) Host: 10.pod #.switch #.ID switch 0 switch 1 switch 0 switch 1 Addressing Format is for further routing purpose
2-level table routing - pod switch Downlink to Host Uplink to Core MSB 8 - LSB Traffic diffusion occurs only in the first half of a packet’s journey
Generation of routing table addPrefix (pod switch, pre, port) addSuffix (pod switch, suf, port)
1-level table routing - core switch PrefixOutput Port / / / /163
Routing Table Implementation Content Addressable Memory (CAM) Input: data; output: match / mismatch
Routing Table Implementation (con't) Host Address Match RAM Address
Routing Example: Hierarchical Tree > >
Routing Example: Fat Tree > > No Contention!!!
Dynamic Routing Up to now, the routing alg is based on static table...any improvement??? –Yes, using dynamic routing Dynamic Routing –Flow Classification –Flow Scheduling
Dynamic Routing 1 - Flow Classification Flow: A set of packets that must have its order preserved Dynamic Routing –Avoid reordering for same flow –Reassign a minimum number of flows to minimize the disparity between ports Flow Classifier: identify flows
Flow Classification Check src & dst address Balance the port load dynamically Avoid reordering Balance the port DYNAMICALLY Every t seconds to rearrange flows Max 3 flows to be rearranged Have some risks to reorder the flow!!! - For performance consideration, not for correctness
Dynamic Routing 2 - Flow Scheduling Large flows are critical - schedule the large flows independently // edge switches if (length (flow_in) > threshold) notify central schedular else route as normal // central schedular if (receive notification) foreach possible path if (path not reserved) reserve the path & notify switches along the path Data StructureFunction bool Link [LINKSIZE]link status hash Flow record reservation & clear reservation (retire the flow)
Discussion Which one is better? –Flow classification –Flow scheduling Locally, inter pod switch Globally, among all the paths and switches Locally, inter pod switch Globally, among all the paths and switches
Fault Tolerance How to know links or switches fail? Bidirectional Forwarding Detection (BFD)
Fault Tolerance (con't) Basic ideas –Mark the link unavailable when routing, e.g., marking the load inf in flow classification –Broadcast the fault to other switches and avoid routing it
Cost 1:1
Power & Heat Power and Heat for different switches 10 GigE
Performance Different Benchmarks Percentage of ideal bisection bandwidth
Conclusion Fat tree for data center interconnection –Scalable –Cost efficient –Compatible Routing details, locally & globally Fault tolerant