A Network-State Management Service Peng Sun Ratul Mahajan, Jennifer Rexford, Lihua Yuan, Ming Zhang, Ahsan Arefin Princeton & Microsoft.

Slides:



Advertisements
Similar presentations
1 Perspectives from Operating a Large Scale Website Dennis Lee VP Technical Operations, Marchex.
Advertisements

Mobile IP How Mobile IP Works? Agenda What problems does Mobile IP solve? Mobile IP: protocol overview Scope Requirements Design goals.
A Flexible Cloud-Computing Platform Focus on solving business problems
Chapter 3: Planning a Network Upgrade
View the home as a computer Ratul Mahajan Microsoft Research IEEE CCW, Oct 2011 Joint work with Sharad Agarwal, AJ Brush, Colin Dixon, Bongshin Lee, Stefan.
Incremental Update for a Compositional SDN Hypervisor Xin Jin Jennifer Rexford, David Walker.
The Top 10 Reasons Why Federated Can’t Succeed And Why it Will Anyway.
Traffic Engineering with Forward Fault Correction (FFC)
Software-defined networking: Change is hard Ratul Mahajan with Chi-Yao Hong, Rohan Gandhi, Xin Jin, Harry Liu, Vijay Gill, Srikanth Kandula, Mohan Nanduri,
Dynamic Scheduling of Network Updates Xin Jin Hongqiang Harry Liu, Rohan Gandhi, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Jennifer Rexford, Roger Wattenhofer.
Logically Centralized Control Class 2. Types of Networks ISP Networks – Entity only owns the switches – Throughput: 100GB-10TB – Heterogeneous devices:
Dynamic Scheduling of Network Updates Based on the slides by Xin Jin Hongqiang Harry Liu, Rohan Gandhi, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Jennifer.
Copyright 2004 Koren & Krishna ECE655/DataRepl.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
The Case for Enterprise Ready Virtual Private Clouds Timothy Wood, Alexandre Gerber *, K.K. Ramakrishnan *, Jacobus van der Merwe *, and Prashant Shenoy.
Making Cellular Networks Scalable and Flexible Li Erran Li Bell Labs, Alcatel-Lucent Joint work with collaborators at university of Michigan, Princeton,
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Network Architecture for Joint Failure Recovery and Traffic Engineering Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.
High Performance All-Optical Networks with Small Buffers Yashar Ganjali High Performance Networking Group Stanford University
A Routing Control Platform for Managing IP Networks Jennifer Rexford Princeton University
Rethinking Internet Traffic Management: From Multiple Decompositions to a Practical Protocol Jiayue He Princeton University Joint work with Martin Suchara,
Handout # 4: Scaling Controllers in SDN - HyperFlow
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Óscar González de Dios PCE, the magic component of Segment Routing Telefónica I+D.
Router modeling using Ptolemy Xuanming Dong and Amit Mahajan May 15, 2002 EE290N.
Building a Strong Foundation for a Future Internet Jennifer Rexford ’91 Computer Science Department (and Electrical Engineering and the Center for IT Policy)
Draft-li-rtgwg-cc-igp-arch-00IETF 88 RTGWG1 An Architecture of Central Controlled Interior Gateway Protocol (IGP) draft-li-rtgwg-cc-igp-arch-00 Zhenbin.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
Bandwidth DoS Attacks and Defenses Robert Morris Frans Kaashoek, Hari Balakrishnan, Students MIT LCS.
Remote Monitoring and Desktop Management Week-7. SNMP designed for management of a limited range of devices and a limited range of functions Monitoring.
Hands-On Microsoft Windows Server 2008
Data Center Infrastructure
Use Case for Distributed Data Center in SUPA
Software-Defined Networks Jennifer Rexford Princeton University.
VeriFlow: Verifying Network-Wide Invariants in Real Time
NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.
Rapid software development 1. Topics covered Agile methods Extreme programming Rapid application development Software prototyping 2.
Central Control over Distributed Routing fibbing.net SIGCOMM Stefano Vissicchio 18th August 2015 UCLouvain Joint work with O. Tilmans (UCLouvain), L. Vanbever.
CP Summer School Modelling for Constraint Programming Barbara Smith 2. Implied Constraints, Optimization, Dominance Rules.
Scalable Multi-Class Traffic Management in Data Center Backbone Networks Amitabha Ghosh (UtopiaCompression) Sangtae Ha (Princeton) Edward Crabbe (Google)
Vytautas Valancius, Nick Feamster, Akihiro Nakao, and Jennifer Rexford.
Robustness of complex networks with the local protection strategy against cascading failures Jianwei Wang Adviser: Frank,Yeong-Sung Lin Present by Wayne.
A survey of SDN: Past, Present and Future of Programmable Networks Speaker :Yu-Fu Huang Advisor :Dr. Kai-Wei Ke Date:2014/Sep./30 1.
Aaron Gember, Theophilus Benson, Aditya Akella University of Wisconsin-Madison.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
An Application of VoIP and MPLS Advisor: Dr. Kevin Ryan
SDN Management Layer DESIGN REQUIREMENTS AND FUTURE DIRECTION NO OF SLIDES : 26 1.
1 | © 2015 Infinera Open SDN in Metro P-OTS Networks Sten Nordell CTO Metro Business Group
CellSDN: Software-Defined Cellular Core networks Xin Jin Princeton University Joint work with Li Erran Li, Laurent Vanbever, and Jennifer Rexford.
Improving Network Management with Software Defined Network Group 5 : z Xuling Wu z Haipeng Jiang z Sichen Wu z Aparna Sanil.
Review of Parnas’ Criteria for Decomposing Systems into Modules Zheng Wang, Yuan Zhang Michigan State University 04/19/2002.
A Better Way Huawei Financial Agile Network Solution Success Cases.
Software Defined Networking BY RAVI NAMBOORI. Overview  Origins of SDN.  What is SDN ?  Original Definition of SDN.  What = Why We need SDN ?  Conclusion.
Maslow’s hierarchy of network programming and the unmet needs Ratul Mahajan Microsoft Research.
Software–Defined Networking Meron Aymiro. What is Software-Defined Networking?  Software-Defined Networking (SDN) has the potential of to transcend the.
SDN challenges Deployment challenges
A Network-State Management Service
University of Maryland College Park
Author: Ragalatha P, Manoj Challa, Sundeep Kumar. K
RSVP: A New Resource ReSerVation Protocol
Towards Reliable Application Deployment in the Cloud
Enterprise vCPE use case requirement
Software Defined Networking (SDN)
Dynamic Scheduling of Network Updates
Chapter 16: Distributed System Structures
The Top 10 Reasons Why Federated Can’t Succeed
Software Defined Networking (SDN)
Enabling Innovation Inside the Network
Hardware-less Testing for RAS Software
Cisco Meraki Digital Solutions for K-12 Education
Distributed Systems and Concurrency: Distributed Systems
Presentation transcript:

A Network-State Management Service Peng Sun Ratul Mahajan, Jennifer Rexford, Lihua Yuan, Ming Zhang, Ahsan Arefin Princeton & Microsoft

Complex Infrastructure Variety of vendors/models/time 1 Number of2010 Data CenterA few Network Device 1,000s Network Capacity 10s of Tbps Number of Data CenterA few10s Network Device 1,000s10s of 1,000s Network Capacity 10s of TbpsPbps Microsoft Azure

Management Applications 2 Traffic Engineering Load Balancing Link Corruption Mitigation Device Firmware Upgrade ……

Our Question How to safely run multiple management applications on shared infrastructure 3

It does not work due to 2 problems Naïve Solution Run independently 4 Traffic Engineering Link Corruption Mitigation Firmware Upgrade Network Devices

Agg A ToRs Agg B Core12 Problem #1: Conflict 5 Link-corruption- mitigation adjusts traffic away from Core1 TE tunes traffic among links to Core1, 2

Agg A ToRs Agg B Core12 Problem #2: Safety Violation 6 Link-corruption- mitigation shuts down faulty Agg A Firmware-upgrade schedules Agg B to upgrade

Potential Solution #1 One monolithic application Central control of all actions 7 Traffic Engineering Firmware Upgrade Link Corruption Mitigation

Too Complex to Build Difficult to develop Combine all applications that are already individually complicated High maintenance cost for such huge software in practice 8

Potential Solution #2 Explicit coordination among applications Consensus over network changes 9 Traffic Engineering Firmware Upgrade Link Corruption Mitigation

Still Too Complex Hard to understand each other Diverse network interactions 10 ApplicationRouting Device Config Traffic Engineering Firmware upgrade

Main Enemy: Complexity Application development Application coordination 11 Monolithic Indepen- dent Explicitly coordinate SimpleComplex

What We Advocate Loose coupling of applications Design principle: Simplicity with safety guarantees Forgo joint optimization Worthwhile tradeoff for simplicity Applications could do it out-of-band 12

Overview of Statesman Network operating system for safe multi-application operation Uses network state abstraction Three views of network state Dependency model of states 13

The “State” in Statesman Complexity of dealing with devices Heterogeneity Device-specific commands 14 Network Devices Network State

State Variable Examples State VariableValue Device Power StatusUp, down Device FirmwareVersion number Device SDN Agent BootUp, down Device Routing StateRouting rules Link Admin StatusUp, down Link Control PlaneBGP, OpenFlow, … 15

Simplify Device Interaction Past Now 16 SNMP, OF, vendor API, … ReadWrite Network Devices Network State Application Device Statistics Application Device- specific cmds

Views of Network State 17 Network Devices Observed State Actual state of the whole network Target State Desired state to be updated on the whole network Target State Network State Application

Network Devices Two Views Are Not Enough 18 Observed State Target State One More View Proposed State A group of entity-variable-values desired by an application Proposed State Application

How Merging Works Combine multiple proposed states into a safe target state Conflict resolution Last-writer-wins Priority-based locking Sufficient for current deployment Safety invariant checking Partial rejection & Skip update 19

Choose Safety Invariants Our current choice Connectivity: Every pair of ToRs in one DC is connected Capacity: 99% of ToR pairs have at least 50% capacity 20 Hinder application too frequently TightLoose Cannot protect network operation

Recap of Three-View Model Simplify network management 21 Observed State Target State Proposed State What we see from the network What we want the network to be What can be actually done on the network Statesman Application

Yet Another Problem What’s in Proposed State Small number of state variables that application cares Implicit conflicts arises Caused by state dependency 22

A BC D Implicit Conflict 23 TE writes new value of routing state of B for tunneling traffic Firmware-upgrade writes new value of firmware state of B

Dependency Relations 24 PowerState FirmwareVersion ConfigurationState RoutingState AdminState ConfigurationState PathState Device Link bgpdSDN

Build in Dependency Model Statesman calculates it internally Only exposes the result for each state variable Whether the variable is controllable 25

Statesman System 26 Target State MonitorUpdater Checker Proposed State Observed State Storage Service

Deployment Overview Operational in Microsoft Azure for 10 months Cover 10 DCs of 20K devices 27

Production Applications 3 diverse applications built Device firmware upgrade Link corruption mitigation Traffic engineering Finish within months Only thousands of lines of code 28

Case #1: Resolve Conflict Inter-DC TE & Firmware-upgrade 29 BR 1 BR 2 DC 1 BR 8 BR 7 DC 4 BR 3 BR 4 DC 2 BR 5 DC 3 BR 6 DC = Data Center BR = Border Router

30 Firmware-upgrade acquires lock of BR1 TE fails to acquire lock, and moves traffic away BR1 firmware upgrade starts BR1 firmware upgrade ends. Lock released. TE re-acquires lock, and moves traffic back … … … …

Case #1 Summary Each application: Simple logic Unaware of the other Statesman enables: Conflict resolution Necessary coordination 31

Case #2: Maintain Capacity Invariant Firmware-upgrade & Link-corruption-mitigation 32 … ToR Agg … … … Core … Pod n … Pod n … Pod n 14 Link corrupting packets

33 Upgrade proceeds in normal speed in Pod 3 and 5 Upgrade in Pod 4 is slowed down by checker due to lost capacity … … … … …

Case #2 Summary Statesman: Automatically adjusts application progresses Keeps the network within safety requirements 34

Conclusion Need network operating system for multiple management applications Statesman Loose coupling of applications Network state abstraction Deployed and operational in Azure 35

36 Thanks! Questions ? Check paper for related works