1 A New Paradigm For Distributed Monitoring Ling Huang, Minos Garofalakis, Nina Taft and Anthony Joseph {minos.garofalakis,

1 A New Paradigm For Distributed Monitoring Ling Huang, Minos Garofalakis, Nina Taft and Anthony Joseph hling@cs.berkeley.edu {minos.garofalakis, nina.taft}@intel.com adj@cs.berkeley.edu Sys Lunch ▪ Feb, 2006

Outline Introduction & Motivation The Problem Definition Related Work The Proposed Solution  The Platform  Extensions The Research plan

Operation Center Introduction: Network Monitoring Large-scale network monitoring and intrusion detection systems  Distributed and collaborative monitoring boxes  Continuously generating time series data Existing research focuses on data streaming  Collect, store and aggregate network state  Monitor and correlate data for trend analysis  Well suited to answering approximate queries and continuously recording system state Monitor 1 Monitor 2 Monitor 3

The Need for Distributed Triggers Streaming protocol-based approaches suffer from excessive query overhead  Always  -approximation regardless system conditions  Wasting resource if applications only care 0-1 information I aim to design distributed triggering protocols  Trigger alarms based on aggregate conditions and threshold Monitoring systems call for a triggering component [Ankur04]  Detect and react to constraint violations/system anomalies  Maintain system-wide logical predicates/invariants  Doesn’t provide guarantee

An Typical Example A set of distributed monitors  Each produces a time series signals  Send filtered version of signals to coordinator  No communication among monitors A coordinator X  Is aggregation, detection and coordination center  Fires trigger upon violations  Informs monitors the level of accuracy for signal updates

Streaming vs. Triggering Streaming protocols  Aim at approximation  Accurate system state  Rich information for detail analysis  Always incur overhead Triggering protocols  Aim at detection  0-1 system state  Concise information indicating anomalies  incur overhead when necessary  Provide strong detection guarantee

Outline Introduction & Motivation  The Problem Definition Related Work The Proposed Solution  The Platform  Extensions The Research plan

Sum Problem Setup Constraints on aggregate  Conditions on subset of nodes  Accrue penalty when bypass threshold C  Fire trigger whenever penalty exceeds error tolerance  Aggregate function  Current work supports simple queries Focus on SUM and AVG here Extending to MIN, MAX at ongoing work  Future work to support general and complex queries

Problem Statement User Inputs:  Constraint violation threshold: C  Tolerable error zone around constraint:   Tolerable false alarm rate:   Tolerable missed detection rate:  GOAL: fire trigger whenever penalty exceeds error tolerance  with required accuracy level  AND with minimum communication overhead (monitor updates)

Let V(t,  ) be size of penalty, at time t, over past window  Instantaneous violation Fixed-window violation Varying-window violation 4 Three Types of Violations for a any  in [1, t] for a user given fixed   >  <  > 

Detection of Varying-Window Violation Key insight: Varying-window trigger is equivalent to a queue overflow problem The centralized queuing model Value, penalty and queue Trigger fires!

The Relationship Between Violation Types General problem – detecting this condition: 1) If is given, it is the fixed-window version 2) If, it is the instantaneous version 3) If is any value, it is the varying-window version  Penalty violation independent of time  Strong and strict guarantee

Proposed Research Distributed triggering system  Open platform to support General queries with general constraints  SUM, MIN, MAX, Quantile, …… Operation on general time series  Controllable detection performance via ( , ,  )  Communication-efficient Minimize communication at given detection performance Provide flexibility for tradeoff performance with overhead  Applying to broad-range of applications

Outline Introduction & Motivation The Problem Definition  Related Work The Proposed Solution  The Platform  Extensions The Research plan

Related work: Database Data streaming  Adaptive filtering from Olston & Widom -accurate answers to simple queries Adaptive local threshold to achieve optimal results  Sketching streams from Cormode & Garofalakis -accurate answers to general and complex queries  Key difference: I focus on -detection instead of approximation TAG and its follow-on focus on tree-based in-network processing PIER brings DB style queries at Internet scale

Related work: Monitoring and Detection Lots of progress in distributed monitoring, profiling and intrusion detection  Share information and foster collaboration between distributed boxes  Systematic coordination for security operations  Little consideration of efficient management of distributed data  Provide examples why a triggering tool would be useful

Outline Introduction & Motivation The Problem Definition Related Work  The Proposed Solution  The Platform  Extensions The Research plan

Key Contributions Achieved The first distributed triggering protocol which  Achieves controllable detection performance  Minimizes communication-overhead For SUM and AVG queries  Mathematical definition of distributed triggering problem  Queuing framework, analytical solution and probabilistic guarantee for varying-window triggers  Adaptive protocol and deterministic guarantee for instantaneous and fixed-window triggers  System implementation of inst. and varying-win. triggers; deployment and evaluation on PlanetLab

Problem Space and Current Status Query support Quantile, Entropy, Hist., … Fixed-window Triggers SUM, AVG, MIN, MAX Varying-window Triggers Multi-level P2P Distributed One-level Violation Types Yes No Instantaneous Triggers … … … … ……

Distributed Trigger Tracking Framework Alarms User inputs Original monitored time series Filtered time series Distr. Monitors Coordinator 2

Solution Overview Minimize communication cost by:  Having monitors send as few updates as possible  Carefully managing the discrepancy between the coordinator’s view of the global state and the actual global state  Providing the coordinator with an accurate enough view so that it fires the trigger with prescribed accuracy Key idea  Filter monitored signal, don’t send an update unless surprising change has occurred  When far away from trigger threshold, monitors can afford to be less accurate. Coordinator informs them when they can do this, and by how much.

22 1) Varying-window Triggers Fire an alarm when overflows

The Distributed Queuing Model Distributed queuing model for varying-window triggers (b) Queue-based filtering (a) Distributed queuing model under-estimate over-estimate  1,...,  n : monitor queue size;   coordinator queue size Number of TCP Requests

Coordinator simulates a virtual queue of size Getting an update, coordinator  Dequeues, where is the time elapse since last update  Enqueues or dequeues  Updates  Fires the alarm if the queue gets full  If necessary, re-computes queue parameters Adaptive Protocol for Varying-win. Triggers Each monitor simulates a virtual queue of size Whenever its local queue under/over-flows, i.e.,, Monitor  Predicts a new  Updates to coordinator  Resets and repeats virtual queue simulation

Queuing Analysis: The Model Each input is decomposed into two parts  Continuous enqueuing with rate  Discrete enqueuing/dequeuing with size How is the detection behavior of solution model different from centralized model? (a) The centralized model(b) The Distributed solution model

Let start the analysis with uniform, which is easy for analysis and is applicable to non-uniform case We want as large as possible to reduce communication overhead However, large brings large burst in the system, which requires a large to absorb the burst Certainly, value of are constrained by the error tolerance Using queuing theory, we can analyze the overflow probability of the queue, thus determining the values of Queuing Analysis: The Setup

Queuing Analysis: Missed Detection The centralized model overflows … The solution model does not overflow!

Queuing Analysis: False Alarm The centralized model does not overflow … The solution model overflows!

Adaptivity and Heterogeneous  ’s Adaptivity Heterogeneous  ’s  After computing, set  Optimal is solved by Olston & Widom using convex optimization approach

Results for Varying-Window Triggers Desired vs. achieved detection performance  miss detection rate  false alarm rate Achieved   and  * are always less than target  and  indicating that analytical model find upper bounds on the detection performance.

Results for Varying-Window Triggers Parameters design and tradeoff between false alarm, missed detection and communication overhead Error tolerance  = 0.2C Overhead = # of messages sent / total # of monitoring epochs

32 2) Instantaneous Triggers Fire an alarm if

Each monitor updates information to coordinator if where is determined by coordinator Adaptive Protocol for Inst. Triggers Coordinator X check in which global slack is adaptively computed and optimally split for monitors Simply setting is the data streaming approach

Results for Instantaneous Triggers Comm. cost when comparing to existing approaches our schemes

We guarantee a around threshold C The Detection Performance Guarantee band of uncertainty Theorem: the described protocol guarantees (1) always fires if (2) never fires if Key decision: Tradeoff between communication cost and triggering performance

Benefit of Adaptive Global Slack Input signals Adaptive global slack Fixed global slack band of uncertainty Key observation: Adaptive slack is substantially larger than fixed slack

Outline Introduction & Motivation The Problem Definition Related Work  The Proposed Solution  The Platform  Extensions The Research plan

Extensions Platform  Probabilistic guarantee for instantaneous and fixed-window triggers  Supporting general queries with general constraints Applications  Distributed workload alarming system  Coordinated end-host profiling & detection system

dstIP protocolID srcIP srcPort dstPort dstIP State-of-the-Art of Profiling & Detection Profiling network traffic at gateway using entropy  Initial success with entropy metrics on packet headers  Have not been applied to end- host profiling Profiling end-hosts using graphlets  Anomalies show up as distinct perturbations in the graph  Initial success in detecting scanning, DDoS, ICMP attacks, web service attacks. srcIPprotocolIDdstIPsrcPortdstPortdstIP

Coordinated End-Host Profiling & Detection Limitations of graphlet model:  Graphlets currently do not support time series  Interaction between host & group profiles is thin Integrating end-host profiling with triggering system to enable coordinated detection  Build time series profiles to facilitate anomaly detection  Extend profiling systems by providing underlying triggering support  Identifying new functionalities for triggering system How can profiling for security be improved? How should triggering system be extended?

Outline Introduction & Motivation Related Work The Problem Definition The Proposed Solution  The Platform  Extensions  The Research plan

The Research Plan Complete solution for simple queries (month 0-3)  Providing probabilistic guarantee for inst. triggers  Supporting triggers with min, max operation Applications (month 4-8)  End-host profiling to facilitate anomaly detection  Triggering on profiles to enable coordinated detection Solution to support complex queries (month 9-12)  Sketching techniques  Prediction models Write dissertation and apply for jobs (month 12-18)

http://www.cs.berkeley.edu/~hling/ hling@cs.berkeley.edu hling@cs.berkeley.edu Thank You!

45 Backup Slides

Handle Data Loss: Overview Local filtering is data loss! Data loss due to  Filtering (voluntarily)  Network delay (involuntarily)  Network congestion (involuntarily) Mechanism  Qos Priority delivery for monitoring data Small bandwidth consumption and is affordable  Statistical estimation Data interpolation and extrapolation Dual prediction model at both monitors and coordinator

Data Acquisition with Statistical Estimation Prediction model can be any of: 1) Last value, 2) Simple averaging, 3) ARMA, 4) Multi-level prediction, 5) Kalman filtering, etc. Is update available from monitors? No, request a prediction Aggregation/ Queuing Prediction value Update value Yes Calibration Is prediction outside slack bound? Streaming Source Prediction Model update to coordinator Yes Calibration No, drop the data _ The Dual-Module Data Acquisition Mechanism Prediction Model Monitor Coordinator

Handle Network Failure Detect failure  Heart beat to keep alive Handle failure  Multiple paths to coordinator  Multiple coordinators Backup coordinator Different triggers on different coordinators  P2P protocol to maintain resilient topology P2P has embedded tree P2P gracefully handles node join and leaving P2P can exploit alternative path for fault-tolerance routing

A Paradox Triggering protocol uses more resource when system at critical state, in which less resource is available  Separate resource for monitoring data and normal traffic  When system is persistently in critical state, coordinator tells monitors that they should not update information unless their states change substantially

50 3) Fixed-window Triggers Fire an alarm for a given if

The transformation Let’s define Then So, protocols for instantaneous triggers work for fixed- window triggers

Framework for Fixed-window Triggers Window-based local sum, then filtering ……

Examples Enterprise security operations  Distributed monitors are IDS boxes  Coordinator for global log repository and analysis inside security operations center. ISP IT teams  Monitors on each link  Network operation center which pulls data for detection of hot spots, failures, attacks, and check when upgrades needed. Monitoring time series can be  Number of TCP requests  Number of DNS transactions  Traffic volume per port 80  ……

Large enables large local smoothing to reduce the communication cost. However it may  absorb too much update “space”, thus causing missed detection  make the system globally bursty, thus causing false alarms Missed detection happens when the queue in the centralized model overflows (real violation), but our queue in the solution model does not (no alarm) False alarm happens when queue in the centralized model does not overflows (no violation), but the queue in the solution model overflows (fires alarm) Queuing Analysis: Some Intuitions

1 A New Paradigm For Distributed Monitoring Ling Huang, Minos Garofalakis, Nina Taft and Anthony Joseph {minos.garofalakis,

Similar presentations

Presentation on theme: "1 A New Paradigm For Distributed Monitoring Ling Huang, Minos Garofalakis, Nina Taft and Anthony Joseph {minos.garofalakis,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 A New Paradigm For Distributed Monitoring Ling Huang, Minos Garofalakis, Nina Taft and Anthony Joseph {minos.garofalakis,

Similar presentations

Presentation on theme: "1 A New Paradigm For Distributed Monitoring Ling Huang, Minos Garofalakis, Nina Taft and Anthony Joseph {minos.garofalakis,"— Presentation transcript:

Similar presentations

About project

Feedback