Ranking the Importance of Alerts for Problem Determination in Large Computer System Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories America, Princeton
Outline Introduction – Motivation & Goal System Invariants – Invariants extraction – Value propagation Collaborative peer review mechanism – Rules & Fault model – Ranking alerts Experiment result Conclusion ICAC 2009 : 6/16/20092
Motivation ICAC 2009 : 6/16/ Large & complex systems are deployed by integrating many heterogeneous components: – servers, routers, storage & software from multiple vendors. – Hidden dependencies Log/Performance data from components – Operators set many rules to check it and trigger alerts. E.g. Web > 70% – Rule setting: independent & isolated Operator’s own system knowledge.
Goal ICAC 2009 : 6/16/20094 Which alerts should we analyze first? - Get more consensus from others - Blend system management knowledge from multiple operators We introduce “Peer-review” mechanism – To rank the importance of alerts. Operators can prioritize problem determinations process. > 70% Alert 1 > 150 Alert 2 > 60% Alert 3 > 35k Alert 4 Alert 3Alert 1Alert 2Alert 4
Full automation Alerts Ranking Process ICAC 2009 : 6/16/ t t t Off line > 70% Alert 1 > 150 Alert 2 > 60% Alert 3 > 35k Alert 4 1. Extract Invariants from monitoring data Invariants model Operators (w/ domain knowledge) Large system Alert 1 Alert 2 Alert 3 Alert 4 2. Define alert rules3. Sort alert rules [ICAC 2006] [TDSC 2006] [TKDE 2007] [DSN 2006] 4. Rank alerts Online At time of alerts received Alert 1 Alert 4 Real alerts Domain information
System Invariants ICAC 2009 : 6/16/20096 m1m1 m2m2 m4m4 m3m3 mimi m i+1 m i+2 mnmn t t t t t t t any constant relationship ??? mnmn Flow intensity: the intensity with which internal monitoring data reacts to the volume of user requests. Target System User requests t t User requests flow through system endlessly and many internal monitoring data react to the volume of user requests accordingly. We search the relationships among these internal measurements collected at various points. If modeled relationships continue to hold all the time, they can be regarded as invariants of the system.
Invariant Examples ICAC 2009 : 6/16/20097 Check implicit relationships, but not real values of flow intensities, which are always changing. However many relationships are constant !! – Example: x, y are changing but the equation y=f (x) is constant. Load Balancer Load Balancer I1 O1 O2 O3 I1 = O1+O2+O3 Database Server Database Server Packet volume V1 SQL query number N1 V1 = f(N1) Invariant
Automated Invariants Search ICAC 2009 : 6/16/20098 model library f Target System observation data pick any two measurements i, j to learn f ij f ij: Invariant candidates with new data [t1-t2], do f ij hold ? drop the variants f ij P i : Confidence Score NO Sequential validation [t0-t1] Monitoring observation data [t1-t2] with new data [t k -t k+1 ], do f ij hold ? observation data [tk-tk+1] P0P0 P1P1 Yes drop the variants f ij NO PKPK Yes Template
One example in model library ICAC 2009 : 6/16/20099 We use an AutoRegressive model with eXogenous (ARX) to learn the relationship between two flow intensity measurements. Define Given a sequence of real observations, using LMS, we learn the model by minimizing the error. A fitness function can be used to evaluate how well the learned model fits the real data.
Value Propagation with Invariants ICAC 2009 : 6/16/ x y=f(x) y z z=g(y) u v u=h(x) v=s(u) Extract invariants Converged Set z=g(f(x)) v=s(h(x)) With ARX Model Multi hops
Rules and Fault Model ICAC 2009 : 6/16/ Rule PredicateAction Probability of fault occurrence x 1 0 xTxT Fault model for each rule False positive False negative Ideal model Realistic model
Probability of Reporting a True Positive Alert Importance of an alert: ICAC 2009 : 6/16/ Probability of Reporting a True Positive (PRTP) generated by value x A very small false positive rate leads to large number of false positive repots. Ex. One measurement is checked every minute and its FP rate is 0.1% => 60x24x365x0.1% = 526 FP reports for a year! => What if thousands of measurements are there!!! Ex. Real operation support system: 80% of reports are FPs
Local Context Mapping to Global Context ICAC 2009 : 6/16/ > 70% Alert 1 > 150 Alert 2 > 60% Alert 3 > 35k Alert 4 WebAP DB Different semantics Global context CPU%Web = CPU%Web = CPU%Web = Fault model of CPU%Web PRTP x 1 0 xTxT x x x = = = Prob(true|X ) > Prob(true|X T ) > Prob(true|X ) > Prob(true|X ) Alert 3 Alert 1 Alert 2 Alert 4
Local Context Mapping to Global Context ICAC 2009 : 6/16/ > 70% Alert 1 > 150 Alert 2 > 60% Alert 3 > 35k Alert 4 WebAP DB Fault model of Network%AP PRTP x 1 0 x x x xTxT Prob(true|X ) > Prob(true|X ) > Prob(true|X ) > Prob(true|X T ) Alert 3 Alert 1 Alert 2 Alert 4 Alert ranking: No Change
Alerts Ranking Process ICAC 2009 : 6/16/ Rank alerts Online At time of alerts received Alert 1 Alert 4 Real alerts
Ranking Alerts (Case I) ICAC 2009 : 6/16/ Sorted alert rules Alert 6 Alert 2 Alert 3 Alert 7 Alert 5 Alert 9 Alert 1 Alert 8 Alert 4 Case I: Receive ONLY ALERTS, no monitoring data from components Alert 2 Alert 3 Alert 7 Alert 5 Alert 1 Alerts ranking alerts generated Operator’s knowledge & configuration System Invariants Network
Ranking Alerts (Case II) ICAC 2009 : 6/16/ Case II: Receive both alerts and monitoring data from components Fault model of CPU%Web PRTP x 1 0 xTxT x x x = = Observed Value X(CPU%Web) Number of Threshold Violations (NTV) NTV=3 Fault model of Network%AP PRTP x 1 0 x x x xTxT Observed Value X(Network%AP) NTV=2 Alert by CPU%Web is more important than one from Network%AP.
Index Introduction – Motivation & Goal System Invariants – Invariants extraction – Value propagation Collaborative peer review mechanism – Rules & Fault model – Ranking alerts Experiment result Conclusion ICAC 2009 : 6/16/200918
Experimental system ICAC 2009 : 6/16/ Flow Intensities: : the number of EJB created at time t. : the JVM processing time at time t. : the number of SQL queries at time t. Flow Intensities: : the number of EJB created at time t. : the JVM processing time at time t. : the number of SQL queries at time t. A D C B BAD C Invariant Examples:
Extracted Invariants Network ICAC 2009 : 6/16/ m1m1 m3m3 m5m5 m2m2 m4m4 m6m6
Thresholds of Measurements ICAC 2009 : 6/16/ m1m1 m2m2 m3m3 m4m4 m5m5 m6m6 m1m1 T m2m2 T m3m3 T m4m4 T m5m5 T m6m6 T
Thresholds of Measurements ICAC 2009 : 6/16/ m1m1 m1m1 T m2m2 T m3m3 T m4m4 T m5m5 T m6m6 T m2m m3m m4m m5m m6m
Ranking Alerts with NTVs (1) ICAC 2009 : 6/16/ m1m1 m1m1 T m2m2 T m3m3 T m4m4 T m5m5 T m6m6 T m2m m3m m4m m5m m6m Observed value NTVs
Ranking Alerts with NTVs (1) ICAC 2009 : 6/16/200924
Ranking Alerts with NTVs (2) ICAC 2009 : 6/16/ m1m1 m1m1 T m2m2 T m3m3 T m4m4 T m5m5 T m6m6 T m2m m3m m4m m5m m6m Observed value NTVs
Ranking Alerts with NTVs (2) ICAC 2009 : 6/16/ Inject a problem (SCP copy) to Web server
Conclusion We introduce a peer review mechanism to rank alerts from heterogeneous components – By mapping local thresholds of various rules into their equivalent values in a global context – Based on system invariants network model To support operators’ consultation for prioritization of problem determination. ICAC 2009 : 6/16/200927
Thank You! Questions? ICAC 2009 : 6/16/200928