Rethinking NetFlow: A Case for a Coordinated “RISC” Architecture for Flow Monitoring Vyas Sekar Joint work with Mike Reiter, Hui Zhang David Andersen,

Rethinking NetFlow: A Case for a Coordinated “RISC” Architecture for Flow Monitoring Vyas Sekar Joint work with Mike Reiter, Hui Zhang David Andersen, Anupam Gupta, Ramana Kompella, Walter Willinger 1

Flow Monitoring is critical for effective Network Management 2 Traffic Engineering Analyze new user apps Anomaly Detection Network Forensics Worm Detection Accounting Botnet analysis ……. Many management applications Evolving and growing over time Need high-fidelity measurements

Requirements for monitoring 3 Network Operations Center Flow reports report = ( flow = same src-dst, ports, proto) + pkt/byte counters Respect resource constraints High flow coverage Provide network-wide goals Low data management overhead High-fidelity for all applications

Sampling due to resource constraints Routers cannot record every packet/flow – Constraints: CPU, Memory, Bandwidth Resource constraints don’t go away! – Network demands scale even as routers become more powerful Some form of sampling is inevitable – Record/report only a subset of the traffic 4

Current solution 5 Respect resource constraints High flow coverage Provide network-wide goals Low data management overhead Uniform packet sampling, e.g., Cisco NetFlow Each router independently samples packets Aggregates sampled packets into flow reports    Biased towards large flows Redundant measurements Too coarse High-fidelity for all applications  Not very good for security

How do we meet the requirements? High-fidelity for all applications Respect resource constraints High flow coverage Provide network-wide goals Low data mgmt overhead Part 1: Coordinated Sampling Part 2: “RISC” monitoring 6

High-level idea 8 Sampling algorithm not biased to large flows Packet sampling has low flow coverage due to bias toward large flows Routers sample independently  Wasted measurements Can’t reason about network-wide goals Treat routers in the network as a system to be managed in a coordinated fashion!

9 Part 1 Outline Motivation Design of cSamp (Coordinated Sampling) Evaluation Practical deployment

Design Random flow sampling (single router) – Sample flows not packets Hash-based coordination (single path) – Efficient, non-redundant sampling – Coordination without explicit communication Network-wide optimization (whole network) – Satisfy network-wide constraints and objectives 10

Design (single router) Random flow sampling – Sample flows not packets 11

Flow sampling 12 1613111 Flow memory (flow, counter #pkts) 3 [3,10] Hash range 6 Sample flows, not packets, to increase flow coverage 1131611 Compute hash, log if in range Version IHL TOS Length Identification Flags Offset TTL Protocol Checksum Source IP address Destination IP address …… SourcePort DestinationPort Hash Flowid  [0,Max] 1131611 Packet header 1 1

Design (single path) Random flow sampling (single router) – Sample flows not packets Hash-based coordination – Efficient, non-redundant sampling – Coordination without explicit communication 13

Hash-based coordination 14 Flow memory 1 [1,4] Hash range 3 Flow memory 8 [7,9] Hash range Non-overlapping hash-ranges avoids redundant monitoring Coordination without communication Stream: 11816135 R1 R2 4 1 1

Design (whole network) Random flow sampling (single router) – Sample flows not packets Hash-based coordination (single path) – Efficient, non-redundant sampling – Coordination without explicit communication Network-wide optimization – Satisfy network-wide constraints and objectives 15

Network-wide view 16 Many paths = Origin-Destination (OD) pairs in a network e.g., NYC-PIT, PIT-SFO Moving from a single-path to network?

Network-wide coordination 17 Assign non-overlapping ranges per OD-pair/path [1,5] [1,3] [3,7] [1,2] [7,9] [5,8]

cSamp algorithm on each router 18 [5,10] [1,4] Sampling Manifest 1. Get OD-Pair from packet 3. Look up hash-range for OD-pair from sampling manifest 2. Compute hash (flow = packet 5-tuple) 4.Log if hash falls in range for this OD-pair Red vs. Green? Flow memory 2 2 1 OD Range

Overall system architecture 19 [1,5] [5,9] [3,7] [1,2] [7,9] [5,8] Network Operations Center Generate sampling manifests Applications Configuration Dissemination Flow reports

Framework for generating manifests 20 Network-wide optimization OD-pair info Traffic, Path(routers) Router constraints e.g., SRAM for flow records Sampling manifests { } per router Objective: Max  i ε ODPairs Coverage i  Traffic i Subject to achieving maximum Min i ε ODPairs {Coverage i } Linear Program Inputs Output

cSamp vs. other sampling solutions Metrics reflect initial goals – Coverage, network-wide goals, redundancy Flow sampling – Fixed-rate and Maximal flow sampling – Use same memory (400K flow records) Packet sampling – 1-in-100 and 1-in-50 (edge) – Allow infinite memory 22

Total flow coverage 23 cSamp is 2-3X better than packet sampling, 30% over maximal flow sampling

Minimum fractional coverage 24 cSamp is significantly better than other solutions! Maximal flow sampling is inadequate for network-wide objectives

How do these solutions fare? 25 Requirements Packet Sampling (NetFlow) Flow Sampling cSamp Respect resource constraints High flow coverage  Network-wide goals  Minimize redundant measurements 

Practical Issues 27 2. Is the optimization scalable? Need two improvements (binary search + max-flow) 3. What about multi-path routing? Simple, lightweight extension 1.What about traffic dynamics? History + short-term adaptation 4. How do interior routers identify OD-pairs? Assume ingress routers mark packets

28 Why we may want to avoid this …. Extra overhead on ingress OD-pair id might be ambiguous (multi-egress peers) Need to modify packet headers or add shim header May require overhaul of routing infrastructure How do interior routers identify OD-pairs? Assume ingress routers mark packets

29 Can we realize the benefits of cSamp without requiring OD-pair identification? Use local info. at router to make sampling decisions “Stitch” coverage for a path across routers on that path

R1R3R2R1R4R0 What local info can I get from packet and routing table? {Previous Hop, My Id, NextHop} SamplingSpec Granularity at which sampling decisions are made SamplingSpec Granularity at which sampling decisions are made How much traffic to sample for this SamplingSpec? SamplingAtom Discrete hash- ranges, select some of them to log SamplingAtom Discrete hash- ranges, select some of them to log 30

= = “Stitching” together coverage union R1 R2 R4 R3 R5 R6 R7 31

Problem Formulation 32 Coverage for path P i Load on router R j Maximize: Total flow coverage:  i T i C i Minimum fractional coverage: min i {C i } Subject To:  j, Load j  L j

Maximize: Total flow coverage:  i T i C i Min. frac coverage: min i {C i } Subject To:  j, Load j  L j Sorry.. NP-hard! Can’t even approximate min without resource augmentation Sorry.. NP-hard! Can’t even approximate min without resource augmentation Total flow coverage: Submodular maximization with partition-knapsack constraints Efficient greedy algorithm with near-optimal performance Min. fractional flow coverage: Intelligent augmentation much better than theoretical guarantee Partial/incremental deployment of adding OD-pair identifiers 33

Total flow coverage 34 cSamp-T (“tuple”, “tuple+”) gives near-ideal total coverage cSamp-T (tuple+) gives near-ideal total flow coverage vs. cSamp

Minimum fractional coverage 35 With smart resource augmentation, cSamp-T gives good min. frac. coverage

37 FSD Heavy Hitters Entropy Super Spreaders Change Detection Outdegree histogram Port Addr Port Addr Port Addr Port Addr Src Dst Src Dst What functionality should we put on routers ?

Current Research: Application-Specific! FSD Port Addr Port Addr Separate Counters & Estimation algorithms Per App Separate Counters & Estimation algorithms Per App Heavy Hitters Entropy Super Spreaders Change Detection Outdegree histogram Port Addr Port Addr Src Dst Src Dst Why? Application-specific approaches provide higher fidelity Traffic 38

Alternative: “RISC” Traffic Why? Late-binding to applications, Easier to implement, “Future-proof” FSD Heavy Hitters Entropy Super Spreaders Change Detection Outdegree histogram Port Addr Port Addr Port Addr Port Addr Src Dst Src Dst Generic Data Collection 39 Decouple Collection and Computation Decouple Collection and Computation

RISC vs. Application-Specific Revisit this perception that RISC does not provide good performance Requirements Application Specific RISC 1.0 NetFlow RISC 2.0 Fidelity across applications  ? Implementation Complexity  ? Processing Overhead  ? Changing Applications  ? Enable Diagnostics  ? 40

Why this might make sense? 41 Each app-specific algorithm requires dedicated counters Primary bottleneck for high-speed monitoring = SRAM counters Look at aggregate memory usage across applications Pool in these resources into a few sampling primitives Run these with sufficient fidelity!

Challenges 42 What RISC primitives should we implement? Does it perform comparably to application-specific approaches? Combination of flow sampling, sample and hold, cSamp Yes! RISC with aggregate resources is comparable or even better

44 Requirements RISC 2.0 Fidelity across applications ? Implementation Complexity ? Processing Overhead ? Changing Applications ? Enable Diagnostics ? Two broad classes “Structure”  Flow Sampling “Volume”  Sample and Hold Two broad classes “Structure”  Flow Sampling “Volume”  Sample and Hold Provide flow reports like NetFlow Coordination Network-wide Optimization Coordination Network-wide Optimization What RISC primitives should we implement?

Sample and Hold 45 1613111 Flow memory (flow, counter #pkts) 1 6 Accurate counts of “heavy hitters” with few counters 1131611 Algorithm If flow is already logged  update Sample packet with probability p If new flow  create counter 1131611 1 1 234

Putting the pieces together 46

FSD Heavy Hitters Entropy Super Spreaders Change Detection Outdegree histogram Port Addr Port Addr Port Addr Port Addr Src Dst Src Dst FlowSamp + Sample & Hold Calculate aggregate memory usage Compute “Relative Accuracy Difference” +  good -  bad Compute “Relative Accuracy Difference” +  good -  bad 48

Sensitivity to Application Portfolio 49 Bigger app. portfolio or Some resource intensive apps  Better gains for RISC approach “Relative Accuracy Difference” +  good -  bad “Relative Accuracy Difference” +  good -  bad Bigger portfolio  More resources

Evaluation: Single Router 50 RISC > Application-specific for most applications Worse for heavyhitter, but not by much! “Relative Accuracy Difference” +  good -  bad “Relative Accuracy Difference” +  good -  bad

Network-wide evaluation: Per-Ingress 51 Results for network-wide per ingress mirror single router evaluation RISC > Application-specific for most applications “Relative Accuracy Difference” +  good -  bad “Relative Accuracy Difference” +  good -  bad

RISC vs. Application-Specific Requirements Application Specific RISC 1.0 NetFlow RISC 2.0 FS, SH, cSamp Fidelity across applications  Implementation Complexity  Processing Overhead  Changing Applications  Enable Diagnostics  Yes, We Can! 52

Meeting requirements for Flow Monitoring High-fidelity for all applications Respect resource constraints High flow coverage Provide network-wide goals Low data mgmt overhead Coordinated Sampling With “RISC” primitives 53

Looking beyond flow monitoring WAN optimization, Content-Aware Networking “Intelligent and targeted caching of traffic at selected nodes of the network can help engineer traffic by reducing load and avoiding hot spots, cut end-to-end latency, and enhance the end-user experience.” – SmartRE (with Ashok Anand, Aditya Akella) Intrusion Detection/Prevention for Enterprises (ongoing work.. ) 54

Thanks! www.cs.cmu.edu/~vyass www.cs.cmu.edu/~4D vyass@cs.cmu.edu 55

Rethinking NetFlow: A Case for a Coordinated “RISC” Architecture for Flow Monitoring Vyas Sekar Joint work with Mike Reiter, Hui Zhang David Andersen,

Similar presentations

Presentation on theme: "Rethinking NetFlow: A Case for a Coordinated “RISC” Architecture for Flow Monitoring Vyas Sekar Joint work with Mike Reiter, Hui Zhang David Andersen,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Rethinking NetFlow: A Case for a Coordinated “RISC” Architecture for Flow Monitoring Vyas Sekar Joint work with Mike Reiter, Hui Zhang David Andersen,

Similar presentations

Presentation on theme: "Rethinking NetFlow: A Case for a Coordinated “RISC” Architecture for Flow Monitoring Vyas Sekar Joint work with Mike Reiter, Hui Zhang David Andersen,"— Presentation transcript:

Similar presentations

About project

Feedback