Kai Chen (HKUST) Nov 1, 2013, USTC, Hefei Data Center Networking 1.

Slides:



Advertisements
Similar presentations
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.
Advertisements

B 黃冠智.
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric
BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang2,
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Modified by Feng.
Lecture 18: Congestion Control in Data Center Networks 1.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Presented by Shaddi.
J. K. Kim Portland. Trend of Data Center By J. Nicholas Hoover, InformationWeek June 17, :00 AMJ. Nicholas Hoover 200 million Euro Data centers.
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric. Presented by: Vinuthna Nalluri Shiva Srivastava.
Improving Datacenter Performance and Robustness with Multipath TCP Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik,
Project 3b Stateless, Passive Firewall (3a)
Congestion Control: TCP & DC-TCP Swarun Kumar With Slides From: Prof. Katabi, Alizadeh et al.
Balajee Vamanan et al. Deadline-Aware Datacenter TCP (D 2 TCP) Balajee Vamanan, Jahangir Hasan, and T. N. Vijaykumar.
Mohammad Alizadeh Adel Javanmard and Balaji Prabhakar Stanford University Analysis of DCTCP:Analysis of DCTCP: Stability, Convergence, and FairnessStability,
Congestion Control An Overview -Jyothi Guntaka. Congestion  What is congestion ?  The aggregate demand for network resources exceeds the available capacity.
XCP: Congestion Control for High Bandwidth-Delay Product Network Dina Katabi, Mark Handley and Charlie Rohrs Presented by Ao-Jan Su.
Congestion control in data centers
Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.
BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang2,
Chuanxiong Guo, Haitao Wu, Kun Tan,
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
Information-Agnostic Flow Scheduling for Commodity Data Centers
Datacenter Networks Mike Freedman COS 461: Computer Networks
Practical TDMA for Datacenter Ethernet
Mohammad Alizadeh Stanford University Joint with: Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, Masato Yasuda HULL: High bandwidth, Ultra Low-Latency.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan by liyong Data.
TCP & Data Center Networking
Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.
Presenter: Po-Chun Wu. Outline Introduction BCube Structure BCube Source Routing (BSR) Other Design Issues Graceful degradation Implementation.
Routing & Architecture
Packet Transport Mechanisms for Data Center Networks Mohammad Alizadeh NetSeminar (April 12, 2012) Mohammad Alizadeh NetSeminar (April 12, 2012) Stanford.
Congestion control for Multipath TCP (MPTCP) Damon Wischik Costin Raiciu Adam Greenhalgh Mark Handley THE ROYAL SOCIETY.
Wei Bai with Li Chen, Kai Chen, Dongsu Han, Chen Tian, Hao Wang SING HKUST Information-Agnostic Flow Scheduling for Commodity Data Centers 1 SJTU,
Dual Centric Data Center Network Architectures DAWEI LI, JIE WU (TEMPLE UNIVERSITY) ZHIYONG LIU, AND FA ZHANG (CHINESE ACADEMY OF SCIENCES) ICPP 2015.
TimeThief: Leveraging Network Variability to Save Datacenter Energy in On-line Data- Intensive Applications Balajee Vamanan (Purdue UIC) Hamza Bin Sohail.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
1 Transport Layer: Basics Outline Intro to transport UDP Congestion control basics.
6.888: Lecture 3 Data Center Congestion Control Mohammad Alizadeh Spring
MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.
15-744: Computer Networking L-14 Data Center Networking III.
Scalable Congestion Control Protocol based on SDN in Data Center Networks Speaker : Bo-Han Hua Professor : Dr. Kai-Wei Ke Date : 2016/04/08 1.
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis.
Data Center TCP (DCTCP)
Networking in Datacenters EECS 398 Winter 2017
Data Center TCP (DCTCP)
CIS 700-5: The Design and Implementation of Cloud Networks
Incast-Aware Switch-Assisted TCP Congestion Control for Data Centers
Lecture 2: Leaf-Spine and PortLand Networks
OTCP: SDN-Managed Congestion Control for Data Center Networks
HyGenICC: Hypervisor-based Generic IP Congestion Control for Virtualized Data Centers Conference Paper in Proceedings of ICC16 By Ahmed M. Abdelmoniem,
Improving Datacenter Performance and Robustness with Multipath TCP
TCP Congestion Control at the Network Edge
Improving Datacenter Performance and Robustness with Multipath TCP
NTHU CS5421 Cloud Computing
Cisco Real Exam Dumps IT-Dumps
BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang2,
Microsoft Research Stanford University
Hamed Rezaei, Mojtaba Malekpourshahraki, Balajee Vamanan
Chuanxiong Guo, Haitao Wu, Kun Tan,
AMP: A Better Multipath TCP for Data Center Networks
Data Center TCP (DCTCP)
RDMA over Commodity Ethernet at Scale
Centralized Arbitration for Data Centers
Lecture 16, Computer Networks (198:552)
Lecture 17, Computer Networks (198:552)
2019/5/13 A Weighted ECMP Load Balancing Scheme for Data Centers Using P4 Switches Presenter:Hung-Yen Wang Authors:Peng Wang, George Trimponias, Hong Xu,
CS 401/601 Computer Network Systems Mehmet Gunes
Data Centers.
Chapter 4: outline 4.1 Overview of Network layer data plane
Presentation transcript:

Kai Chen (HKUST) Nov 1, 2013, USTC, Hefei Data Center Networking 1

Outline Introduction to Data Centers Data Center Network Architecture Data Center Routing Data Center Transport Control 2

Big Data Applications and Cloud Computing 3  Scientific: 200GB of astronomy data a night  Business: 1 million customer transactions, 2.5PB of data per hour  Social network: 60 billion photos in its user base, 25TB of log data per day  Web search: 20PB of search data per day … …

Data Centers as Infrastructures 4 Google’s 36 world wide data centers (2008) A 125,000-square-foot Walmart’s data center in Arkansas (10,000s – 100,000s of servers) Modular data centers with 1000s of servers (20- or 40-foot container)

Data Center Network Architectures 5

Today’s Production DCN Structure 6 Aggregation switch (ToR switch) Top-of-Rack Core switch A Production DCN structure adopted from Cisco 1:1 1:5 ~ 1:20 1:240 Communication bottleneck

Research Community: Fattree 7 ToR Aggregation Core 16 Rearrangeably non-blocking, 1:1 provisioning

VL2 8 1:1 provisioning

BCube server switch 9

BCube BCube0 BCube1 server switch Level-1 Level-0 Connecting rule - The i-th server in the j-th BCube 0 connects to the j-th port of the i-th level-1 switch A BCube k network supports servers - n is the number of servers in a BCube 0 - k is the level of that BCube A server is assigned a BCube addr (a k,a k-1,…,a 0 ) where a i  [0,k] Neighboring server addresses differ in only one digit 10

Helios (Optical/Electrical) 11

Many other DCN Structures DCell 2008 MDCube 2009 c-Through 2010 Flyways (wireless) 2011 Jellyfish 2012 OSA 2012 Mordia 2013 … 12

Data Center Routing 13

Fattree Routing: exploiting hierarchy on a Fattree topology Slide 14 Pod Number POD 0POD 1POD 2POD 3 14

Positional Pseudo MAC Addresses PMAC = pod.position.port.vmid; Slide :00:00:02:00:01 00:00:00:03:00:01 00:00:01:02:00:01 00:00:01:03:00:01 00:01:00:02:00:01 00:01:00:03:00:01 00:01:01:02:00:01 00:01:01:03:00:01 00:02:00:02:00:01 00:02:00:03:00:01 00:02:01:02:00:01 00:02:01:03:00:01 00:03:00:02:00:01 00:03:00:03:00:01 00:03:01:02:00:01 00:03:01:03:00:01

Installing AMAC to PMAC mappings Slide 16

Proxy based Address Resolution (ARP) Slide 17

Routing on Fattree (PortLand) Encode the topology information into PMAC, and routing based on PMAC addresses – Each host sees its own AMAC and remote host’s PMAC – Ingress switch does AMAC -> PMAC rewriting for src host – Egress switch does PMAC -> AMAC mapping for dst host – Longest prefix match routing on destination PMAC Slide 18

BCube Routing: exploiting server location in a hypercube topology BCube0 BCube1 Level-1 Level-0 19 BCube ID: indicate the location of a server in a BCube topology 00 -> 20 -> 23, or 00 -> 03 -> 23

Bcube: multi-paths for one-to-one traffic The diameter of a BCube k is k+1 There are k+1 parallel paths between any two servers in a BCube k

MAC addr Bcube addr BCube0 BCube1 MAC030 MAC131 MAC232 MAC333 port Switch MAC table MAC200 MAC211 MAC222 MAC233 port Switch MAC table BCube: Server centric and source routing MAC23MAC data MAC23MAC data dstsrc MAC20MAC data MAC20MAC data Server-centric BCube - Switches never connect to other switches and only act as L2 crossbars - Servers control routing, load balancing, fault-tolerance 21

Routing on Bcube Encode the topology information into Bcube IDs, and routing based on Bcube IDs – Source derives the routing paths based on Src-Dst Bcube IDs – Source writes the whole path information into packet headers – Forwarding is based on path information embedded in packet heads – Bcube switches are dumb, intelligence is within servers

Summary on DCN Routings VL2, Dcell, CamCube, Jellyfish… Topology-dependent – Customized to specific topology, fully explore topology characteristic – Constrained, not generalized to generic DCNs – Some of them not deployable in commodity switches SDN/OpenFlow – Flexible, explicit routing control – Scalability, OpenFlow entries: 1K – 4K – Heavy-weight, active paths swapping – Error-prone 23

Data Center Transport Control 24

TLA MLA Worker Nodes ……… Partition/Aggregate Application Structure 25 Picasso “Everything you can imagine is real.”“Bad artists copy. Good artists steal.” “It is your work in life that is the ultimate seduction.“ “The chief enemy of creativity is good sense.“ “Inspiration does exist, but it must find you working.” “I'd like to live as a poor man with lots of money.“ “Art is a lie that makes us realize the truth. “Computers are useless. They can only give you answers.” ….. 1. Art is a lie… 2. The chief… 3. … Art is a lie… 3. ….. Art is… Picasso Time is money  Strict deadlines (SLAs) Missed deadline  Lower quality result Deadline = 250ms Deadline = 50ms Deadline = 10ms

Generality of Partition/Aggregate The foundation for many large-scale web applications. – Web search, Social network composition, Ad selection, etc. Example: Facebook Partition/Aggregate ~ Multiget – Aggregators: Web Servers – Workers: Memcached Servers 26 Memcached Servers Internet Web Servers Memcached Protocol

Workloads Partition/Aggregate [~2KB] (Query) Short messages [50KB-1MB] ( C oordination, Control state) Large flows [1MB-50MB] ( D ata update) 27 Delay-sensitive Throughput-sensitive

Impairments Incast Queue Buildup Buffer Pressure 28

Incast 29 TCP timeout Worker 1 Worker 2 Worker 3 Worker 4 Aggregator RTO min = 300 ms Synchronized mice collide.  Caused by Partition/Aggregate.

Queue Buildup 30 Sender 1 Sender 2 Receiver Big flows buildup queues.  Increased latency for short flows. Measurements in Bing cluster  For 90% packets: RTT < 1ms  For 10% packets: 1ms < RTT < 15ms

Data Center Transport Requirements High Burst Tolerance –Incast due to Partition/Aggregate is common 2. Low Latency –Short flows, queries 3. High Throughput –Continuous data updates, large file transfers 4. Meet flow deadlines –Some flows carry deadlines (e.g., search), flows miss their deadlines will not be useful The challenge is to achieve these together.

Tension Between Requirements 32 High Burst Tolerance High Throughput Low Latency DCTCP Deep Buffers:  Queuing Delays Increase Latency Shallow Buffers:  Bad for Bursts & Throughput Objective: Low Queue Occupancy & High Throughput

Many Recent DCN Transport Designs Reduce RTO min 2009 Data Center TCP (DCTCP) 2010 Deadline-Aware –D–D 3 (centralized) 2011, and D 2 (decentralized) 2012 Preemptive Flow Scheduling (PQD) 2012 pFabric

Review: The TCP/ECN Control Loop 34 Sender 1 Sender 2 Receiver ECN Mark (1 bit) ECN = Explicit Congestion Notification

DCTCP: two key ideas 1.React in proportion to the extent of congestion, not its presence. Reduces variance in sending rates, lowering queuing requirements. 2.Mark based on instantaneous queue length. Fast feedback to better deal with bursts. 18 ECN MarksTCPDCTCP Cut window by 50%Cut window by 40% Cut window by 50%Cut window by 5%

Data Center TCP Algorithm Switch side: – Mark packets when Queue Length > K. 19 Sender side: – Maintain running average of fraction of packets marked (α). In each RTT:  Adaptive window decreases: – Note: decrease factor between 1 and 2. B K Mark Don’t Mark

Why DCTCP Works 1.High Burst Tolerance Large buffer headroom → bursts fit. Aggressive marking → sources react before packets are dropped. 2. Low Latency Small buffer occupancies → low queuing delay. 3. High Throughput ECN averaging → smooth rate adjustments, low variance. 21

Just touch a small part of Data Center Networking, a lot of interesting things to do, a long way to go. Thanks! 38