SCREAM: Sketch Resource Allocation for Software-defined Measurement (CoNEXT’15) I will show you how to stuff many measurement tasks with high accuracy on switches with limited resources. Masoud is on job market Masoud Moshref, Minlan Yu, Ramesh Govindan, Amin Vahdat
Measurement is Crucial for Network Management Network Management on multiple tenants: Traffic Engineering Anomaly Detection Accounting Traffic Engineering DDoS detection Anomaly Detection Need fine-grained visibility of how networks work Multiple tenants, many mgm goals, these mgm goals are achieved with these measurement tasks Anomaly detection: to find performance bugs Traffic engineering algorithms need to know big flows For DDoS detection we need to know if a set of IPs send too much traffic An anomaly is to find hosts that communicate with too many destinations and can be detected using a super source detection task. Measurement: Heavy Hitter detection Heavy hitter detection (HH) Heavy Hitter detection Hierarchical heavy hitter detection (HHH) Change detection Super source detection (SSD)
Software Defined Measurement Controller DREAM [SIGCOMM’14] / SCREAM [CoNEXT’15] Task 1 Task 2 Configure Emphasize on the workflow (Tasks configure switches, switches measure and send counters to the controller, tasks make report to the user) Collect Switch A Switch B Task 1 counters Task 1 counters Task 2 counters Task 2 counters
Our Focus: Sketch-based Measurement Summaries of streaming data to approximately answer specific queries Ex: Bitmap for counting unique items OpenFlow Counters Sketches DREAM [SIGCOMM’14] SCREAM [CoNEXT’15] Memory Expensive, Cheaper SRAM power-hungry TCAM Don’t say it in a way that sketches are our contribution Sketches: Memory efficient, streaming, approximate, specific query All traffic all-the-time sketches are more accurate for variable traffic and support more tasks like flow size distribution Counters Volume counters Volume and Connection counters Flows Selected prefixes All traffic all-the-time Sketches use a cheaper memory and are more expressive
Sketch Example: Count-Min Sketch At packet arrival: w h1(IP) 2+1=3 h2(IP) (IP, 1 Kbytes) 4+1=5 d h3(IP) 1+1=2 At query: What is the traffic size of IP? = row with min collision = Min(3,5,2) = 2 Explain what Count-Min means At query we pick minimum to pick the row that had minimum HASH COLLISIONS Highlight the traffic dependence early: Given a sketch of a specific size, its error depends on traffic properties such as total traffic size. (emphasize here) Resource accuracy trade-off: Provable error bound given traffic properties
Challenges: Limited Counters for Many Tasks Limited shared resources: SRAM capacity (e.g., 128 MB) Shared with other functions (e.g., routing) Too many resources to guarantee accuracy: 1 MB-32 MB per task Less than 4-128 tasks in SRAM Many task instances: 3 types (Heavy hitter, Hierarchical heavy hitter, Super source) Different flow aggregates (Rack, App, Src/Dst/Port) 1000s of tenants
Goal: Many Accurate Sketch-based Measurements Users dynamically instantiate a variety of measurement tasks SCREAM supports the largest number of measurement tasks while maintaining measurement accuracy At high level, our contribution is to enable flexible measurements in networks where users can dynamically instantiate a number of complex measurements into the network state. Our system, SCREAM, accommodates the largest number of measurements while maintaining accuracy, by leveraging tradeoffs between aggregate switch resource consumption and measurement accuracy.
Approach: Dynamic Resource Allocation Resource accuracy trade-off depends on traffic Count Min: Provable error bound given traffic properties Ex: Skew of traffic from each IP Worst-case uses >10x counters than average Required memory Skew Dynamic allocation for current traffic
Opportunity: Temporal Multiplexing Memory requirement varies over time Task 1 Task 2 Required Memory This gives us the opportunity of temporal multiplexing to support more tasks. Time Multiplex memory among tasks over time
Opportunity: Spatial Multiplexing Memory requirement varies across switches Task 1 Task 2 Required Memory It also gives us the opportunity of spatial multiplexing to support more tasks. Switch A Switch B Multiplex memory among tasks across switches
Key Insight Leverage spatial and temporal multiplexing and dynamically allocate switch memory per task to achieve sufficient accuracy for many tasks DREAM has the same insight SCREAM applies it for sketches
SCREAM Contributions 1- Allocate memory among sketch-based task instances across switches while maintaining sufficient accuracy SCREAM Dynamic resource allocator Allocation Heavy hitter (HH) tasks Hierarchical heavy hitter (HHH) tasks Super Source (SSD) tasks 2- Supports 3 sketch-based task types Anomaly detection Traffic engineering DDoS detection
SCREAM Iterative Workflow Collect & report Counters & output Estimate accuracy Accuracy Allocate resources Memory size
SCREAM Iterative Workflow Collect & report Merge counters from switches Accuracy Estimate accuracy Task1 accuracy <80% Allocate resources Give more memory to task1
SCREAM Iterative Workflow Collect & report Merge counters from switches Accuracy Estimate accuracy Skew of traffic for task2 changes Task2 accuracy <80% Allocate resources Give more memory to task2
SCREAM Challenges Collect & report Network-wide task implementation using sketches Estimate accuracy Accuracy estimation without the ground-truth Allocate resources Fast & Stable allocation in DREAM [SIGCOMM’14]
Challenge: Merge Sketches of Different Sizes Network-wide Task Heavy hitter (HH) Source IPs sending > 10Mbps 1- Tasks can have traffic from multiple switches 2- The sketches on these switches may have different sizes 3- However, previous work can only merge sketches of the same size Let me give an example, Consider heavy hitter detection task that finds source Ips that send > 10Mbps. It has traffic from two switches A and B and it wants to find the size of a flow that sends 10 Mbps on one and 15 on another It uses count-min sketch. When we change the memory to its sketch on a switch, we change the number of counters per row (w),. Previous work just add two arrays but it is impossible if the arrays have different sizes. 25 10 15 Switch A Switch B d d w1 w2
SCREAM Solution to Merge Sketches for HH Detection Previous work: Min of sums SCREAM: Sum of mins 10 40 30 50 70 20 50 80 90 + 10 40 30 50 70 20 ≥ 10 20 Min Min + 50 30 Previous work can only merge sketches of the same size. A natural extension for sketches of different sizes is to find the corresponding counter for each prefix at each row and sum the counters at similar rows across sketches. We call this approach min of sums. Here, I show the counters in each row with different colors. Taking the minimum of sums results in 50. Another approach is to get the approximation from each sketch and add them together. We call this sum of mins and results in 30 in this example Sum of mins is always smaller than min of sums. Because count-min always over-approximates, smaller is more accurate thus SCREAM uses sum of mins. 25 10 15 Switch A Switch B 10 30 70 40 50 20 Both over-approximate smaller is more accurate
SCREAM Solutions Collect & report Network-wide task implementation using sketches Merge sketches of different sizes for HH, HHH, SSD SSD algorithm with higher and more stable accuracy Estimate accuracy Accuracy estimation without the ground-truth Allocate resources Fast & Stable allocation in DREAM [SIGCOMM’14]
Precision Estimation for Heavy Hitter Detection True detected HH Detected HHs Precision = = Sum(P[Detected HH is true]) Insight: Relate probability to Error on counters of detected HHs Threshold True HH False HH Estimate-Threshold Error Estimate-Threshold Thus to estimate the probability that a detected HH is a true HH we need to find the probability that the error is larger than the difference between estimated value and threshold Next I describe how to find this probabilty Estimated Real P[Detected HH is true] = 1 - P[Error ≥ Estimate-Threshold]
Precision Estimation Step 1: Find a Bound on The Error Insight: Relate probability to Error on counters of detected HHs Idea: Use average Error in Markov’s inequality to bound it This is the strawman solution. We know the average error on each counter of count-min. Thus using Markov’s inequality, we find a bound on the probability that error goes above estimation minus threshold Step 1 P[Detected HH is true] = 1 - P[Error ≥ Estimate-Threshold]
Precision Estimation Step 2: Improve The Bound A row in Count-Min: Step 2 We know counter indices for heavy items so we can find their collisions. Thus we don’t need to rely on Markov’s inequality to find their errors. Step 1 Insight: Average Error = heavy items collision + small items collision Counter indices of detected HHs show heavy collisions Idea: Markov’s inequality only for small items
SCREAM Solutions Collect & report Network-wide task implementation using sketches Merge sketches of different sizes for HH, HHH, SSD SSD algorithm with higher and more stable accuracy Estimate accuracy Accuracy estimation without the ground-truth Precision estimators for HH, HHH and SSD tasks Allocate resources Fast & Stable allocation in DREAM [SIGCOMM’14]
Evaluation Metrics: Satisfaction of a task: Fraction of task’s lifetime with sufficient accuracy % of rejected tasks OpenSketch allocates for bounded relative error based on the worst-case traffic. We test it for different error bounds. Alternatives: OpenSketch: Allocate for bounded error for worst-case traffic at task instantiation (test with different bounds) Oracle: Knows required resource for a task in each switch in advance
Evaluation Setting Simulation for 8 switches: 256 task instances (HH, HHH, SSD, combination) Accuracy bound = 80% 5 min tasks arriving in 20 minutes 2 hours CAIDA trace We tested for each type of tasks and combination of them
SCREAM Provides High Accuracy for More Tasks SCREAM: High satisfaction and low reject We tested open sketch with 10%, 50% and 90% relative error bounds OpenSketch: Loose bound Under provision low satisfaction Tight bound Over provision high reject
SCREAM’s Performance Is Close to An Oracle SCREAM satisfaction is lower because: Iterative allocation takes time Accuracy estimation has error
Other Evaluations Changing traffic skew SCREAM supports more accurate tasks than OpenSketch Accuracy estimation error SCREAM accuracy estimation has 5% error in average Other accuracy metrics Tasks in SCREAM have high recall (low false negative)
Conclusion Measurement is crucial for SDN management in a resource-constrained environment Practical sketch-based SDM by dynamic memory allocation Implementing network-wide tasks using sketches Estimating accuracy for 3 types of tasks SCREAM is available at github.com/USC-NSL/SCREAM