Network Automation Albert Greenberg, Nick Feamster, Richard Mortier, Mark Poepping, Lun Li, Sharad Agarwal, Changhoon Kim, Ramveer Chandra, et al.
2 What is network automation? The performance of the following network tasks with minimal human involvement: –Provisioning –Detection –Diagnosis –Remediation Corollary: Humans become involved with network operation at higher levels (i.e., not repeatedly doing the same painful tasks)
3 Some Questions Why automate? What to automate? (desired end states) How do we get there? Robotize current methodology, or rethink? Self-correction (like biological systems, e.g., DNA) What are the roadblocks? Are our network element building blocks and their behavior fit for automation? Big guard rails?
4 Why Automate? Human cost –Are we talking about making operators redundant? –No…it’s more about automating folklore? –Care costs >> Ops costs, so self-help >> self-managing? Reliability!!! –Continuous high quality service – very high availability –Faster detection, remediation, etc. Scale!!! –How else to keep up with feature creep? –“Every case is a special case” (we don’t really believe this)
5 What to Automate? Proactive Piece –Is-ness spec driving automation? Reactive Piece –Detection (See) Possible to monitor and detect network problems? What data sets are needed? How to do correlation of those datasets? (metadata) The role of detection vs. statistical analysis –Diagnosis (Know) Again, what data needs to be collected to make this possible Stat based vs model based? –Remediation (Restore) Do we want automated scripts How far along this spectrum to go? (Many answers.)
6 Vision Network operators plug in boxes, and walk away…sort of –A small set of policies trigger programs which write programs which write programs which … realizes the network –A small set of probes provide all measurements and event collection/ correlation needed to support internal metrics and external SLAs Knowledge database –Operators become specialists: forensics, software development, etc. (operation at a higher level, less fire-fighting) Caveat: there will always be a need for amazing people, but doing more introspective work: (design, test, certification... and … automation over-ride when needed)
7 Roadblocks Cost Complexity Data Knowledge Human factors
8 Obstacle 1: Cost Automation costs money and time –Worth detecting if there’s nothing to do about it? –Worth automating if the operation only happens once? Alternate solution 1: Monkeys –At what point is it time to automate the corner case Alternate solution 2: Overprovision –Perhaps we can ride out the storm… (or expect failures and design low cost systems so that they don’t really matter) –Server community has seen that repeatable simple components + software can provide both very low cost and resilient whole (e.g., Google switching and computing platform)
9 Obstacle 2: Complexity How to manage it? –Dummy boxes and lots of wires/stitching –Monolithic box with complexity in configuration Fewer types of boxes, templates, ways to do essentially the same thing? –Coke’s network vs Pepsi’s network?
10 Obstacle 3: Data Lots of inputs –Topology –Configuration –Fault events (measured and logged) –Performance events (e.g., active measurements) –Version numbers –Fiber mappings Metadata Crucial! Version numbers, gaps in data collection, collection method, staleness… If this data goes inconsistent, big surprises! Challenges –Correlation what to do when data isn’t correlated? –Privacy and sharing issues
11 Obstacle 4: Human Nature/Corner Cases Operators are used to touching routers Automation effectively adds a “shim” Humans will likely want a way to bypass the configuration database –How to maintain consistency between human tweakage and the database? –How to evolve the automation database? (when does a corner case become “normal”)