CS434/534: Topics in Network Systems Cloud Data Centers: Topology, Control; VL2 Yang (Richard) Yang Computer Science Department Yale University 208A.

Slides:



Advertisements
Similar presentations
SDN Controller Challenges
Advertisements

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric. Presented by: Vinuthna Nalluri Shiva Srivastava.
Data Center Fabrics. Forwarding Today Layer 3 approach: – Assign IP addresses to hosts hierarchically based on their directly connected switch. – Use.
Multi-Layer Switching Layers 1, 2, and 3. Cisco Hierarchical Model Access Layer –Workgroup –Access layer aggregation and L3/L4 services Distribution Layer.
Datacenter Network Topologies
Virtual Layer 2: A Scalable and Flexible Data-Center Network Work with Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Parantap Lahiri,
ProActive Routing In Scalable Data Centers with PARIS Joint work with Dushyant Arora + and Jennifer Rexford* + Arista Networks *Princeton University Theophilus.
A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Data-Center Traffic Management COS 597E: Software Defined Networking.
Jennifer Rexford Fall 2010 (TTh 1:30-2:50 in COS 302) COS 561: Advanced Computer Networks Data.
A Scalable, Commodity Data Center Network Architecture.
Datacenter Networks Mike Freedman COS 461: Computer Networks
LAN Overview (part 2) CSE 3213 Fall April 2017.
Networking the Cloud Presenter: b 電機三 姜慧如.
VL2 – A Scalable & Flexible Data Center Network Authors: Greenberg et al Presenter: Syed M Irteza – LUMS CS678: 2 April 2013.
Cloud Scale Performance & Diagnosability Comprehensive SDN Core Infrastructure Enhancements vRSS Remote Live Monitoring NIC Teaming Hyper-V Network.
LAN Switching and Wireless – Chapter 1 Vilina Hutter, Instructor
© 2006 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Introducing Network Design Concepts Designing and Supporting Computer Networks.
VL2: A Scalable and Flexible Data Center Network Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David.
Campus Networking Best Practices Hervey Allen NSRC & University of Oregon Dale Smith University of Oregon & NSRC
Department of Computer Science A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares Alexander Loukissas Amin Vahdat SIGCOMM’08 Reporter:
SDN AND OPENFLOW SPECIFICATION SPEAKER: HSUAN-LING WENG DATE: 2014/11/18.
© 2006 Cisco Systems, Inc. All rights reserved.Cisco PublicITE I Chapter 6 1 Introducing Network Design Concepts Designing and Supporting Computer Networks.
6.888: Lecture 2 Data Center Network Architectures Mohammad Alizadeh Spring 2016  Slides adapted from presentations by Albert Greenberg and Changhoon.
Data Centers and Cloud Computing 1. 2 Data Centers 3.
VL2: A Scalable and Flexible Data Center Network
Data Center Architectures
Yiting Xia, T. S. Eugene Ng Rice University
Data Center Networking
Chapter 6: Securing the Cloud
Instructor Materials Chapter 1: LAN Design
CIS 700-5: The Design and Implementation of Cloud Networks
Lecture 2: Cloud Computing
Data Center Network Topologies II
Overview: Cloud Datacenters
EE384Y: Packet Switch Architectures Scaling Crossbar Switches
CS 3700 Networks and Distributed Systems
Lecture 2: Leaf-Spine and PortLand Networks
Data Center Network Architectures
A Survey of Data Center Network Architectures By Obasuyi Edokpolor
Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.
ETHANE: TAKING CONTROL OF THE ENTERPRISE
Networking Recap Storage Intro
Revisiting Ethernet: Plug-and-play made scalable and efficient
Data Center Network Architectures
Improving Datacenter Performance and Robustness with Multipath TCP
Chapter 4 Data Link Layer Switching
Chapter 4: Routing Concepts
CS 3700 Networks and Distributed Systems
Introduction to Cloud Computing
NTHU CS5421 Cloud Computing
GGF15 – Grids and Network Virtualization
IS3120 Network Communications Infrastructure
湖南大学-信息科学与工程学院-计算机与科学系
CS 4700 / CS 5700 Network Fundamentals
CS 31006: Computer Networks – The Routers
Bridges and Extended LANs
ElasticTree: Saving Energy in Data Center Networks
NTHU CS5421 Cloud Computing
CS 4700 / CS 5700 Network Fundamentals
VL2: A Scalable and Flexible Data Center Network
Internet and Web Simple client-server model
Data Center Architectures
Data Center Networks Mohammad Alizadeh Fall 2018
CS 401/601 Computer Network Systems Mehmet Gunes
Connectors, Repeaters, Hubs, Bridges, Switches, Routers, NIC’s
Elmo Muhammad Shahbaz Lalith Suresh, Jennifer Rexford, Nick Feamster,
Lecture 8, Computer Networks (198:552)
Lecture 9, Computer Networks (198:552)
Data Center Traffic Engineering
Presentation transcript:

CS434/534: Topics in Network Systems Cloud Data Centers: Topology, Control; VL2 Yang (Richard) Yang Computer Science Department Yale University 208A Watson Email: yry@cs.yale.edu http://zoo.cs.yale.edu/classes/cs434/ Acknowledgement: slides contain content from conference presentations by authors of VL2.

Outline Admin and recap Controller software framework (network OS) supporting programmable networks architecture data model and operations: OpenDaylight as an example distributed data store overview basic Paxos multi-Paxos raft south-bound: consistent network updates Cloud data centers (CDC)

Admin PS1 posted on the Schedule page Please start to talk to me on potential projects

Recap: Raft Leader-less => leader based Basic leader election mechanisms term, heartbeat, finite-state machine, receives vote from majority Basic commitment of a log entry receives confirmation from majority

Committed Present in future leaders’ logs Recap: Raft Safety Raft safety property: If a leader has decided that a log entry is committed, that entry will be present in the logs of all future leaders No special steps by a new leader to revise leader log Solution Use a combination of election rules and commitment rules to achieve safety Extended leader election: (lastTermV > lastTermC) || (lastTermV == lastTermC) && (lastIndexV > lastIndexC) Extended commitment: Must be stored on a majority of servers At least one new entry from leader’s term must also be stored on majority of servers Committed Present in future leaders’ logs Conditions on commitment Conditions on leader election

Recap: Big Picture Program Key component - data store: Data model Key goal: provide applications w/ high-level views and make the views highly available (e.g., 99.99%), scalable. Key component - data store: Data model Data operation model Data store availability Program logically centralized data store Network View Service/ Policy NE Datapath NE Datapath

Discussion What should happen in a Net OS when a link weight is changed 1 3 a b c d e f 1 1 1 1 2

Dependency table (aka inverted index) Recap: Big Picture Assert: TcpDsd==22 Dependency table (aka inverted index) Component Traces (hostTable,1) [false, 1] (hostTable,2) [false, 1,2] (hostTable,3) [false, 3] (hostTable,4) [false, 3,4] topology [false,1,2],[false,3,4] true false Read: EthSrc ? 1 3 (hostTable, 3) (hostTable, 1) Read: EthDst Read: EthDst 2 4 (hostTable, 2) (topology()) (hostTable, 4) (topology()) path1 path2

Recap: Example Transaction b c d e f Link weight change => path for flow a->d change: from a -> b -> c -> d to a -> e -> f -> d A high-level transaction can generate a set of operations (ops) at local devices. The ops should be executed with some order constraint (dependency graph)

Recap: Example Transaction: Updating a Set of Flows Assume each link has capacity 10 Fi: b: means that flow i needs b amount of bw A transaction can be more complex, and hence coordination can be more complex as well. Example from Dionysus

Dependency Graph w/ Resources

Dependency Graph Scheduling operation resource a b before N release N N demand N

Potential Project: Continuous, Consistent Network Updates Discussion: how to define the problem, what is a good data structure, …

Outline Admin and recap Controller software framework (network OS) supporting programmable networks Cloud data center (CDC) networks Background, high-level goal

The Importance of Data Centers Internal users Line-of-Business apps External users Web portals Web services Multimedia applications Cloud services (e.g., Azure, AWS, …) “A 1-millisecond advantage in trading applications can be worth $100 million a year to a major brokerage firm

Datacenter Traffic Growth Today: Petabits/s in one DC More than core of the Internet! Today’s datacenter networks carry more traffic than the entire core of the Internet! Source: “Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network”, SIGCOMM 2015.

Data Center Costs Amortized Cost* Component Sub-Components ~45% Servers CPU, memory, disk ~25% Power infrastructure UPS, cooling, power distribution ~15% Power draw Electrical utility costs Network Switches, links, transit *3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money The Cost of a Cloud: Research Problems in Data Center Networks. Sigcomm CCR 2009. Greenberg, Hamilton, Maltz, Patel. 17

Server Costs Ugly secret: 30% utilization considered “good” in data centers Uneven application fit Each server has CPU, memory, disk: most applications exhaust one resource, stranding the others Long provisioning timescales New servers purchased quarterly at best Uncertainty in demand Demand for a new service can spike quickly Risk management Not having spare servers to meet demand brings failure just when success is at hand Session state and storage constraints If the world were stateless servers, life would be good

Goal: Agility – Any Service, Any Server Turn the servers into a single large fungible pool Dynamically expand and contract service footprint as needed Benefits Lower server component cost Achieve high performance and reliability Increase service developer productivity Increase productivity because services can be deployed much faster Agility: The same of most infrastructure projects

Achieving Agility Workload management Means for rapidly installing a service’s code on a server Virtual machines, disk images, containers Storage Management Means for a server to access persistent data easily Distributed filesystems (e.g., HDFS, blob stores) Network Means for communicating with other servers, regardless of where they are in the data center

Datacenter Network Ultimate Goal 10,000s of ports Compute Storage (Disk, Flash, …) Provide the illusion of “One Big Switch”

Outline Admin and recap Controller software framework (network OS) supporting programmable networks Cloud data center (CDC) networks Background, high-level goal Traditional CDC vs the one-big switch abstraction

Conventional DC Architecture Internet CR CR DC-Layer 3 . . . AR AR AR AR DC-Layer 2 Key CR = Core Router (L3) AR = Access Router (L3) S = Ethernet Switch (L2) A = Rack of app. servers S S . . . S S S S A A … A A A … A ~ 1,000 servers/pod == IP subnet Reference – “Data Center: Load balancing Data Center Services”, Cisco 2004

Conventional DC: Topology Problem CR CR ~ 200:1 AR AR AR AR S S S S ~ 40:1 . . . S S S S S S S S ~ 5:1 Dependence on high-cost proprietary routers Poor reliability/utilization A A … A A A … A A A … A A A … A Heterogenous server-to-server capacity Fundamentally a tree, the higher up in the tree, the more potential competition on resources, limiting any server for any service

Conventional DC: Topology Problem CR CR AR AR AR AR S S S S . . . S S S S S S S S Dependence on high-cost proprietary routers Poor reliability/utilization A A … A A A … A A A … A A A … A Poor reliability Fundamentally a tree, link failures in top of the tree can lead to large fraction loss/reliability issues

Conventional DC: Control Problem CR CR ~ 200:1 AR AR AR AR S S S S S S S S S S S S A A … A A A … A A A … A A A A … A IP subnet (VLAN) #1 IP subnet (VLAN) #2 Partition by IP subnet limits agility For a VM to move to a different subnet (e.g., to use the resources), the VM’s IP address must change.

~ 1,000 servers/pod == IP subnet Discussion: L2 vs L3 Internet — L2 pros, cons? — L3 pros, cons? CR CR DC-Layer 3 . . . AR AR AR AR DC-Layer 2 Key CR = Core Router (L3) AR = Access Router (L3) S = Ethernet Switch (L2) A = Rack of app. servers S S . . . S S S S Ethernet switching (layer 2) Fixed IP addresses and auto-configuration (plug & play) Seamless mobility, migration, and failover Broadcast limits scale (ARP) Spanning Tree Protocol IP routing (layer 3) Scalability through hierarchical addressing Multipath routing through equal-cost multipath Can’t migrate w/o changing IP address Complex configuration A A … A A A … A ~ 1,000 servers/pod == IP subnet Reference – “Data Center: Load balancing Data Center Services”, Cisco 2004

Layer 2 vs. Layer 3 Ethernet switching (layer 2) IP routing (layer 3) Fixed IP addresses and auto-configuration (plug & play) Seamless mobility, migration, and failover Broadcast limits scale (ARP) Spanning Tree Protocol IP routing (layer 3) Scalability through hierarchical addressing Multipath routing through equal-cost multipath More complex configuration Can’t migrate w/o changing IP address

Layer 2 vs. Layer 3 for Data Centers

Outline Admin and recap Controller software framework (network OS) supporting programmable networks Cloud data center (CDC) networks Background, high-level goal Traditional CDC vs the one-big switch abstraction VL2 design and implementation

Measurements Informing VL2 Design Data-Center traffic analysis: Traffic volume between servers to entering/leaving data center is 4:1 Demand for bandwidth between servers growing faster Network is the bottleneck of computation Traffic patterns are highly volatile A large number of distinctive patterns even in a day Instability of traffic patterns Cannot predict traffic easily Failure characteristics: Pattern of networking equipment failures: 95% < 1min, 98% < 1hr, 99.6% < 1 day, 0.09% > 10 days Flow distribution analysis: Majority of flows are small, biggest flow size is 100MB The distribution of internal flows is simpler and more uniform 50% times of 10 concurrent flows, 5% greater than 80 concurrent flows

Discussion How may you handle dynamic traffic patterns?

The Illusion of a Huge L2 Switch 3. Performance isolation VL2 Goals The Illusion of a Huge L2 Switch 1. L2 semantics 2. Uniform high capacity 3. Performance isolation A A A A A … A A A A A A … A A A A A A A A A A … A A A A A A A A A A A A … A A A A

Discussion What may performance isolation mean?

Objectives in Detail Layer-2 semantics: Uniform high capacity: Easily assign any server to any service Assigning servers to service should be independent of network topology Configure server with whatever IP address the service expects VM keeps the same IP address even after migration Uniform high capacity: Maximum rate of server to server traffic flow should be limited only by capacity on network cards Performance isolation: Traffic of one service should not be affected by traffic of other services (need the above bound)

VL2 Topology: Basic Idea single-root tree multi-root tree

Foundation of Data Center Networks: Clos Networks The bigger the m, the more flexible in switching. ingress middle egress Q: How big is m so that each new call can be established w/o moving current calls? Strict-sense nonblocking Clos networks (m ≥ 2n−1): the original 1953 Clos result[edit] If m ≥ 2n−1, the Clos network is strict-sense nonblocking, meaning that an unused input on an ingress switch can always be connected to an unused output on an egress switch, without having to re-arrange existing calls. This is the result which formed the basis of Clos's classic 1953 paper. Assume that there is a free terminal on the input of an ingress switch, and this has to be connected to a free terminal on a particular egress switch. In the worst case, n−1 other calls are active on the ingress switch in question, and n−1 other calls are active on the egress switch in question. Assume, also in the worst case, that each of these calls passes through a different middle-stage switch. Hence in the worst case, 2n−2 of the middle stage switches are unable to carry the new call. Therefore, to ensure strict-sense nonblocking operation, another middle stage switch is required, making a total of 2n−1. If m ≥ n, the Clos network is rearrangeably nonblocking, meaning that an unused input on an ingress switch can always be connected to an unused output on an egress switch, but for this to take place, existing calls may have to be rearranged by assigning them to different centre stage switches in the Clos network.[3] To prove this, it is sufficient to consider m = n, with the Clos network fully utilised; that is, r×n calls in progress. The proof shows how any permutation of these r×n input terminals onto r×n output terminals may be broken down into smaller permutations which may each be implemented by the individual crossbar switches in a Clos network with m = n. The proof uses Hall's marriage theorem[4] which is given this name because it is often explained as follows. Suppose there are r boys and r girls. The theorem states that if every subset of k boys (for each k such that 0 ≤ k ≤ r) between them know k or more girls, then each boy can be paired off with a girl that he knows. It is obvious that this is a necessary condition for pairing to take place; what is surprising is that it is sufficient. In the context of a Clos network, each boy represents an ingress switch, and each girl represents an egress switch. A boy is said to know a girl if the corresponding ingress and egress switches carry the same call. Each set of k boys must know at least k girls because k ingress switches are carrying k×n calls and these cannot be carried by less than k egress switches. Hence each ingress switch can be paired off with an egress switch that carries the same call, via a one-to-one mapping. These r calls can be carried by one middle-stage switch. If this middle-stage switch is now removed from the Clos network, m is reduced by 1, and we are left with a smaller Clos network. The process then repeats itself until m = 1, and every call is assigned to a middle-stage switch. Q: If you can move existing calls, it is only m >= n. https://en.wikipedia.org/wiki/Clos_network

Folded Clos (Fat-Tree) Topology https://www.nanog.org/sites/default/files/monday.general.hanks.multistage.10.pdf

Generic K-ary Fat Tree K-ary fat tree: three-layer topology (edge, aggregation and core) k pods w/ each pod consisting of 2 layers of k/2 k-port switches each edge switch connects to k/2 servers & k/2 aggr. switches each aggr. switch connects to k/2 edge & k/2 core switches each core switch connects to k pods - k-port switch, k pods Each pod k/2 edge switches k/2 * k/2 servers per pod total k/2 * k/2 * k servers k/2 aggr switches k/2 * k/2 / k core switches per pod total k/2 * k/2 core switches K^3/4 http://www.fiber-optic-tutorial.com/100g-switch-price-and-configuration.html http://web.stanford.edu/class/ee384y/Handouts/clos_networks.pdf https://en.wikipedia.org/wiki/Clos_network http://www.cs.cornell.edu/courses/cs5413/2014fa/lectures/08-fattree.pdf

Generic K-ary Fat Tree Q: How many servers per pod? Fat-Tree: a special type of Clos Networks Generic K-ary Fat Tree - k-port switch, k pods Each pod k/2 edge switches k/2 * k/2 servers per pod total k/2 * k/2 * k servers k/2 aggr switches k/2 * k/2 / k core switches per pod total k/2 * k/2 core switches K^3/4 http://www.fiber-optic-tutorial.com/100g-switch-price-and-configuration.html http://web.stanford.edu/class/ee384y/Handouts/clos_networks.pdf https://en.wikipedia.org/wiki/Clos_network Q: How many servers per pod? Q: How many links btw each two layers? Q: How many servers in total? Q: How many servers for k = 48, 96, 144? Q: How many core switches?

VL2 Topology VL2 DA / 2 Int switches . . . . . . DI Aggr switches Assume Each Int switch has DI ports; Each Aggr has DA ports VL2 Topology VL2 Each Aggr switch uses half ports to connect to TOR switches, half to each Intermediate switch DA / 2 Int switches Int . . . Each TOR connects to two Aggr switches Aggr . . . Each Int switch connects to each Aggr switch DI Aggr switches DI DA/4 TOR Assume D_I ports per IS; D_A ports per AS Each aggr switch uses half ports to connect to TOR switches, half to each Intermediate switch => D_A/2 intermediate switches Each IS connects to all Aggr switches => D_I aggr switches - Each TOR connects to 2 aggr switches => D_I * D_A /2 /2 = D_I D_A /4 TOR switches - Each server connects 20 servers => Servers: 20 KD/4 . . . . . . . . . TOR . . . . . . . . . . . 20 Servers EachTOR connects 20 servers 20 (DI DA/4) servers 42

VL2 Topology VL2 D=DI=DA (# of 10G ports) Max DC size (# of Servers) 48 96 144 VL2 Topology 11,520 VL2 46,080 103,680 Int . . . Aggr . . . . . . . . . . . . TOR . . . . . . . . . . . 20 Servers 43

Summary: Why Fat-Tree? Fat tree has identical bandwidth at any bisections Each layer has the same aggregated bandwidth Can be built using cheap devices with uniform capacity Each port supports same speed as end host All devices can transmit at line speed if packets are distributed uniform along available paths Great scalability: k-port switch supports k3/4 servers

Jellyfish (random) [NSDI’12] Some Other Topologies Fat-tree [SIGCOMM’08] Jellyfish (random) [NSDI’12] BCube [SIGCOMM’10] 45

Offline Read Current facebook data center topology: https://code.fb.com/production-engineering/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/

Single-Chip “Merchant Silicon” Switches Switch ASIC 6 pack Wedge Image courtesy of Facebook

Multiple switching layers (Why?) https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/

Long cables (fiber) https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/