CS434/534: Topics in Network Systems Cloud Data Centers: Topology, Control; VL2 Yang (Richard) Yang Computer Science Department Yale University 208A.

CS434/534: Topics in Network Systems Cloud Data Centers: Topology, Control; VL2 Yang (Richard) Yang Computer Science Department Yale University 208A Watson Acknowledgement: slides contain content from conference presentations by authors of VL2.

Outline Admin and recap
Controller software framework (network OS) supporting programmable networks architecture data model and operations: OpenDaylight as an example distributed data store overview basic Paxos multi-Paxos raft south-bound: consistent network updates Cloud data centers (CDC)

Admin PS1 posted on the Schedule page
Please start to talk to me on potential projects

Recap: Raft Leader-less => leader based
Basic leader election mechanisms term, heartbeat, finite-state machine, receives vote from majority Basic commitment of a log entry receives confirmation from majority

Committed Present in future leaders’ logs
Recap: Raft Safety Raft safety property: If a leader has decided that a log entry is committed, that entry will be present in the logs of all future leaders No special steps by a new leader to revise leader log Solution Use a combination of election rules and commitment rules to achieve safety Extended leader election: (lastTermV > lastTermC) || (lastTermV == lastTermC) && (lastIndexV > lastIndexC) Extended commitment: Must be stored on a majority of servers At least one new entry from leader’s term must also be stored on majority of servers Committed Present in future leaders’ logs Conditions on commitment Conditions on leader election

Recap: Big Picture Program Key component - data store: Data model
Key goal: provide applications w/ high-level views and make the views highly available (e.g., 99.99%), scalable. Key component - data store: Data model Data operation model Data store availability Program logically centralized data store Network View Service/ Policy NE Datapath NE Datapath

Discussion What should happen in a Net OS when a link weight is changed 1 3 a b c d e f 1 1 1 1 2

Dependency table (aka inverted index)
Recap: Big Picture Assert: TcpDsd==22 Dependency table (aka inverted index) Component Traces (hostTable,1) [false, 1] (hostTable,2) [false, 1,2] (hostTable,3) [false, 3] (hostTable,4) [false, 3,4] topology [false,1,2],[false,3,4] true false Read: EthSrc ? 1 3 (hostTable, 3) (hostTable, 1) Read: EthDst Read: EthDst 2 4 (hostTable, 2) (topology()) (hostTable, 4) (topology()) path1 path2

Recap: Example Transaction
b c d e f Link weight change => path for flow a->d change: from a -> b -> c -> d to a -> e -> f -> d A high-level transaction can generate a set of operations (ops) at local devices. The ops should be executed with some order constraint (dependency graph)

Recap: Example Transaction: Updating a Set of Flows
Assume each link has capacity 10 Fi: b: means that flow i needs b amount of bw A transaction can be more complex, and hence coordination can be more complex as well. Example from Dionysus

Dependency Graph w/ Resources

Dependency Graph Scheduling
operation resource a b before N release N N demand N

Potential Project: Continuous, Consistent Network Updates
Discussion: how to define the problem, what is a good data structure, …

Controller software framework (network OS) supporting programmable networks Cloud data center (CDC) networks Background, high-level goal

The Importance of Data Centers
Internal users Line-of-Business apps External users Web portals Web services Multimedia applications Cloud services (e.g., Azure, AWS, …) “A 1-millisecond advantage in trading applications can be worth $100 million a year to a major brokerage firm

Datacenter Traffic Growth
Today: Petabits/s in one DC More than core of the Internet! Today’s datacenter networks carry more traffic than the entire core of the Internet! Source: “Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network”, SIGCOMM 2015.

Data Center Costs Amortized Cost* Component Sub-Components ~45%
Servers CPU, memory, disk ~25% Power infrastructure UPS, cooling, power distribution ~15% Power draw Electrical utility costs Network Switches, links, transit *3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money The Cost of a Cloud: Research Problems in Data Center Networks. Sigcomm CCR Greenberg, Hamilton, Maltz, Patel. 17

Server Costs Ugly secret: 30% utilization considered “good” in data centers Uneven application fit Each server has CPU, memory, disk: most applications exhaust one resource, stranding the others Long provisioning timescales New servers purchased quarterly at best Uncertainty in demand Demand for a new service can spike quickly Risk management Not having spare servers to meet demand brings failure just when success is at hand Session state and storage constraints If the world were stateless servers, life would be good

Goal: Agility – Any Service, Any Server
Turn the servers into a single large fungible pool Dynamically expand and contract service footprint as needed Benefits Lower server component cost Achieve high performance and reliability Increase service developer productivity Increase productivity because services can be deployed much faster Agility: The same of most infrastructure projects

Achieving Agility Workload management
Means for rapidly installing a service’s code on a server Virtual machines, disk images, containers Storage Management Means for a server to access persistent data easily Distributed filesystems (e.g., HDFS, blob stores) Network Means for communicating with other servers, regardless of where they are in the data center

Datacenter Network Ultimate Goal
10,000s of ports Compute Storage (Disk, Flash, …) Provide the illusion of “One Big Switch”

Controller software framework (network OS) supporting programmable networks Cloud data center (CDC) networks Background, high-level goal Traditional CDC vs the one-big switch abstraction

Conventional DC Architecture
Internet CR CR DC-Layer 3 . . . AR AR AR AR DC-Layer 2 Key CR = Core Router (L3) AR = Access Router (L3) S = Ethernet Switch (L2) A = Rack of app. servers S S . . . S S S S A A … A A A … A ~ 1,000 servers/pod == IP subnet Reference – “Data Center: Load balancing Data Center Services”, Cisco 2004

Conventional DC: Topology Problem
CR CR ~ 200:1 AR AR AR AR S S S S ~ 40:1 . . . S S S S S S S S ~ 5:1 Dependence on high-cost proprietary routers Poor reliability/utilization A A … A A A … A A A … A A A … A Heterogenous server-to-server capacity Fundamentally a tree, the higher up in the tree, the more potential competition on resources, limiting any server for any service

Conventional DC: Topology Problem
CR CR AR AR AR AR S S S S . . . S S S S S S S S Dependence on high-cost proprietary routers Poor reliability/utilization A A … A A A … A A A … A A A … A Poor reliability Fundamentally a tree, link failures in top of the tree can lead to large fraction loss/reliability issues

Conventional DC: Control Problem
CR CR ~ 200:1 AR AR AR AR S S S S S S S S S S S S A A … A A A … A A A … A A A A … A IP subnet (VLAN) #1 IP subnet (VLAN) #2 Partition by IP subnet limits agility For a VM to move to a different subnet (e.g., to use the resources), the VM’s IP address must change.

~ 1,000 servers/pod == IP subnet
Discussion: L2 vs L3 Internet — L2 pros, cons? — L3 pros, cons? CR CR DC-Layer 3 . . . AR AR AR AR DC-Layer 2 Key CR = Core Router (L3) AR = Access Router (L3) S = Ethernet Switch (L2) A = Rack of app. servers S S . . . S S S S Ethernet switching (layer 2) Fixed IP addresses and auto-configuration (plug & play) Seamless mobility, migration, and failover Broadcast limits scale (ARP) Spanning Tree Protocol IP routing (layer 3) Scalability through hierarchical addressing Multipath routing through equal-cost multipath Can’t migrate w/o changing IP address Complex configuration A A … A A A … A ~ 1,000 servers/pod == IP subnet Reference – “Data Center: Load balancing Data Center Services”, Cisco 2004

Layer 2 vs. Layer 3 Ethernet switching (layer 2) IP routing (layer 3)
Fixed IP addresses and auto-configuration (plug & play) Seamless mobility, migration, and failover Broadcast limits scale (ARP) Spanning Tree Protocol IP routing (layer 3) Scalability through hierarchical addressing Multipath routing through equal-cost multipath More complex configuration Can’t migrate w/o changing IP address

Layer 2 vs. Layer 3 for Data Centers

Controller software framework (network OS) supporting programmable networks Cloud data center (CDC) networks Background, high-level goal Traditional CDC vs the one-big switch abstraction VL2 design and implementation

Measurements Informing VL2 Design
Data-Center traffic analysis: Traffic volume between servers to entering/leaving data center is 4:1 Demand for bandwidth between servers growing faster Network is the bottleneck of computation Traffic patterns are highly volatile A large number of distinctive patterns even in a day Instability of traffic patterns Cannot predict traffic easily Failure characteristics: Pattern of networking equipment failures: 95% < 1min, 98% < 1hr, 99.6% < 1 day, 0.09% > 10 days Flow distribution analysis: Majority of flows are small, biggest flow size is 100MB The distribution of internal flows is simpler and more uniform 50% times of 10 concurrent flows, 5% greater than 80 concurrent flows

Discussion How may you handle dynamic traffic patterns?

The Illusion of a Huge L2 Switch 3. Performance isolation
VL2 Goals The Illusion of a Huge L2 Switch 1. L2 semantics 2. Uniform high capacity 3. Performance isolation A A A A A … A A A A A A … A A A A A A A A A A … A A A A A A A A A A A A … A A A A

Discussion What may performance isolation mean?

Objectives in Detail Layer-2 semantics: Uniform high capacity:
Easily assign any server to any service Assigning servers to service should be independent of network topology Configure server with whatever IP address the service expects VM keeps the same IP address even after migration Uniform high capacity: Maximum rate of server to server traffic flow should be limited only by capacity on network cards Performance isolation: Traffic of one service should not be affected by traffic of other services (need the above bound)

VL2 Topology: Basic Idea
single-root tree multi-root tree

Foundation of Data Center Networks: Clos Networks
The bigger the m, the more flexible in switching. ingress middle egress Q: How big is m so that each new call can be established w/o moving current calls? Strict-sense nonblocking Clos networks (m ≥ 2n−1): the original 1953 Clos result[edit] If m ≥ 2n−1, the Clos network is strict-sense nonblocking, meaning that an unused input on an ingress switch can always be connected to an unused output on an egress switch, without having to re-arrange existing calls. This is the result which formed the basis of Clos's classic 1953 paper. Assume that there is a free terminal on the input of an ingress switch, and this has to be connected to a free terminal on a particular egress switch. In the worst case, n−1 other calls are active on the ingress switch in question, and n−1 other calls are active on the egress switch in question. Assume, also in the worst case, that each of these calls passes through a different middle-stage switch. Hence in the worst case, 2n−2 of the middle stage switches are unable to carry the new call. Therefore, to ensure strict-sense nonblocking operation, another middle stage switch is required, making a total of 2n−1. If m ≥ n, the Clos network is rearrangeably nonblocking, meaning that an unused input on an ingress switch can always be connected to an unused output on an egress switch, but for this to take place, existing calls may have to be rearranged by assigning them to different centre stage switches in the Clos network.[3] To prove this, it is sufficient to consider m = n, with the Clos network fully utilised; that is, r×n calls in progress. The proof shows how any permutation of these r×n input terminals onto r×n output terminals may be broken down into smaller permutations which may each be implemented by the individual crossbar switches in a Clos network with m = n. The proof uses Hall's marriage theorem[4] which is given this name because it is often explained as follows. Suppose there are r boys and r girls. The theorem states that if every subset of k boys (for each k such that 0 ≤ k ≤ r) between them know k or more girls, then each boy can be paired off with a girl that he knows. It is obvious that this is a necessary condition for pairing to take place; what is surprising is that it is sufficient. In the context of a Clos network, each boy represents an ingress switch, and each girl represents an egress switch. A boy is said to know a girl if the corresponding ingress and egress switches carry the same call. Each set of k boys must know at least k girls because k ingress switches are carrying k×n calls and these cannot be carried by less than k egress switches. Hence each ingress switch can be paired off with an egress switch that carries the same call, via a one-to-one mapping. These r calls can be carried by one middle-stage switch. If this middle-stage switch is now removed from the Clos network, m is reduced by 1, and we are left with a smaller Clos network. The process then repeats itself until m = 1, and every call is assigned to a middle-stage switch. Q: If you can move existing calls, it is only m >= n.

Folded Clos (Fat-Tree) Topology

Generic K-ary Fat Tree K-ary fat tree: three-layer topology (edge, aggregation and core) k pods w/ each pod consisting of 2 layers of k/2 k-port switches each edge switch connects to k/2 servers & k/2 aggr. switches each aggr. switch connects to k/2 edge & k/2 core switches each core switch connects to k pods - k-port switch, k pods Each pod k/2 edge switches k/2 * k/2 servers per pod total k/2 * k/2 * k servers k/2 aggr switches k/2 * k/2 / k core switches per pod total k/2 * k/2 core switches K^3/4

Generic K-ary Fat Tree Q: How many servers per pod?
Fat-Tree: a special type of Clos Networks Generic K-ary Fat Tree - k-port switch, k pods Each pod k/2 edge switches k/2 * k/2 servers per pod total k/2 * k/2 * k servers k/2 aggr switches k/2 * k/2 / k core switches per pod total k/2 * k/2 core switches K^3/4 Q: How many servers per pod? Q: How many links btw each two layers? Q: How many servers in total? Q: How many servers for k = 48, 96, 144? Q: How many core switches?

VL2 Topology VL2 DA / 2 Int switches . . . . . . DI Aggr switches
Assume Each Int switch has DI ports; Each Aggr has DA ports VL2 Topology VL2 Each Aggr switch uses half ports to connect to TOR switches, half to each Intermediate switch DA / 2 Int switches Int . . . Each TOR connects to two Aggr switches Aggr . . . Each Int switch connects to each Aggr switch DI Aggr switches DI DA/4 TOR Assume D_I ports per IS; D_A ports per AS Each aggr switch uses half ports to connect to TOR switches, half to each Intermediate switch => D_A/2 intermediate switches Each IS connects to all Aggr switches => D_I aggr switches - Each TOR connects to 2 aggr switches => D_I * D_A /2 /2 = D_I D_A /4 TOR switches - Each server connects 20 servers => Servers: 20 KD/4 . . . TOR . . . 20 Servers EachTOR connects 20 servers 20 (DI DA/4) servers 42

VL2 Topology VL2 D=DI=DA (# of 10G ports) Max DC size (# of Servers)
48 96 144 VL2 Topology 11,520 VL2 46,080 103,680 Int . . . Aggr . . . . . . TOR . . . 20 Servers 43

Summary: Why Fat-Tree? Fat tree has identical bandwidth at any bisections Each layer has the same aggregated bandwidth Can be built using cheap devices with uniform capacity Each port supports same speed as end host All devices can transmit at line speed if packets are distributed uniform along available paths Great scalability: k-port switch supports k3/4 servers

Jellyfish (random) [NSDI’12]
Some Other Topologies Fat-tree [SIGCOMM’08] Jellyfish (random) [NSDI’12] BCube [SIGCOMM’10] 45

Offline Read Current facebook data center topology:

Single-Chip “Merchant Silicon” Switches
Switch ASIC 6 pack Wedge Image courtesy of Facebook

Multiple switching layers
(Why?)

Long cables (fiber)

CS434/534: Topics in Network Systems Cloud Data Centers: Topology, Control; VL2 Yang (Richard) Yang Computer Science Department Yale University 208A.

Similar presentations

Presentation on theme: "CS434/534: Topics in Network Systems Cloud Data Centers: Topology, Control; VL2 Yang (Richard) Yang Computer Science Department Yale University 208A."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS434/534: Topics in Network Systems Cloud Data Centers: Topology, Control; VL2 Yang (Richard) Yang Computer Science Department Yale University 208A.

Similar presentations

Presentation on theme: "CS434/534: Topics in Network Systems Cloud Data Centers: Topology, Control; VL2 Yang (Richard) Yang Computer Science Department Yale University 208A."— Presentation transcript:

Similar presentations

About project

Feedback