Ranking the Importance of Alerts for Problem Determination in Large Computer System Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories.

Ranking the Importance of Alerts for Problem Determination in Large Computer System Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories America, Princeton

Outline Introduction – Motivation & Goal System Invariants – Invariants extraction – Value propagation Collaborative peer review mechanism – Rules & Fault model – Ranking alerts Experiment result Conclusion ICAC 2009 : 6/16/20092

Motivation ICAC 2009 : 6/16/2009 3 Large & complex systems are deployed by integrating many heterogeneous components: – servers, routers, storage & software from multiple vendors. – Hidden dependencies Log/Performance data from components – Operators set many rules to check it and trigger alerts. E.g. CPU% @ Web > 70% – Rule setting: independent & isolated Operator’s own system knowledge.

Goal ICAC 2009 : 6/16/20094 Which alerts should we analyze first? - Get more consensus from others - Blend system management knowledge from multiple operators We introduce “Peer-review” mechanism – To rank the importance of alerts. Operators can prioritize problem determinations process. CPU% @Web > 70% Alert 1 DiskUsg@Web > 150 Alert 2 CPU% @DB > 60% Alert 3 Network@AP > 35k Alert 4 Alert 3Alert 1Alert 2Alert 4

Full automation Alerts Ranking Process ICAC 2009 : 6/16/2009 5 t t t Off line CPU% @Web > 70% Alert 1 DiskUsg@Web > 150 Alert 2 CPU% @DB > 60% Alert 3 Network@AP > 35k Alert 4 1. Extract Invariants from monitoring data Invariants model Operators (w/ domain knowledge) Large system Alert 1 Alert 2 Alert 3 Alert 4 2. Define alert rules3. Sort alert rules [ICAC 2006] [TDSC 2006] [TKDE 2007] [DSN 2006] 4. Rank alerts Online At time of alerts received Alert 1 Alert 4 Real alerts Domain information

System Invariants ICAC 2009 : 6/16/20096 m1m1 m2m2 m4m4 m3m3 mimi m i+1 m i+2 mnmn t t t t t t t......... any constant relationship ??? mnmn Flow intensity: the intensity with which internal monitoring data reacts to the volume of user requests. Target System User requests t t User requests flow through system endlessly and many internal monitoring data react to the volume of user requests accordingly. We search the relationships among these internal measurements collected at various points. If modeled relationships continue to hold all the time, they can be regarded as invariants of the system.

Invariant Examples ICAC 2009 : 6/16/20097 Check implicit relationships, but not real values of flow intensities, which are always changing. However many relationships are constant !! – Example: x, y are changing but the equation y=f (x) is constant. Load Balancer Load Balancer I1 O1 O2 O3 I1 = O1+O2+O3 Database Server Database Server Packet volume V1 SQL query number N1 V1 = f(N1) Invariant

Automated Invariants Search ICAC 2009 : 6/16/20098 model library f Target System observation data pick any two measurements i, j to learn f ij f ij: Invariant candidates with new data [t1-t2], do f ij hold ? drop the variants f ij P i : Confidence Score NO Sequential validation [t0-t1] Monitoring observation data [t1-t2] with new data [t k -t k+1 ], do f ij hold ? observation data [tk-tk+1] P0P0 P1P1 Yes drop the variants f ij NO PKPK Yes Template

One example in model library ICAC 2009 : 6/16/20099 We use an AutoRegressive model with eXogenous (ARX) to learn the relationship between two flow intensity measurements. Define Given a sequence of real observations, using LMS, we learn the model by minimizing the error. A fitness function can be used to evaluate how well the learned model fits the real data.

Value Propagation with Invariants ICAC 2009 : 6/16/200910 x y=f(x) y z z=g(y) u v u=h(x) v=s(u) Extract invariants Converged Set z=g(f(x)) v=s(h(x)) With ARX Model Multi hops

Rules and Fault Model ICAC 2009 : 6/16/200911 Rule PredicateAction Probability of fault occurrence x 1 0 xTxT Fault model for each rule False positive False negative Ideal model Realistic model

Probability of Reporting a True Positive Alert Importance of an alert: ICAC 2009 : 6/16/200912 Probability of Reporting a True Positive (PRTP) generated by value x A very small false positive rate leads to large number of false positive repots. Ex. One measurement is checked every minute and its FP rate is 0.1% => 60x24x365x0.1% = 526 FP reports for a year! => What if thousands of measurements are there!!! Ex. Real operation support system: 80% of reports are FPs

Local Context Mapping to Global Context ICAC 2009 : 6/16/200913 CPU% @Web > 70% Alert 1 DiskUsg@Web > 150 Alert 2 CPU% @DB > 60% Alert 3 Network@AP > 35k Alert 4 WebAP DB Different semantics Global context CPU%Web = fa(Network@AP) CPU%Web = fb(CPU%@DB) CPU%Web = fc(DiskUsg%@Web) Fault model of CPU%Web PRTP x 1 0 xTxT x CPU@DB x DiskUsg@WEB x Network@AP = fa(Network@AP) = fc(DiskUsg@WEB) = fb(CPU%@AP) Prob(true|X CPU@DB ) > Prob(true|X T ) > Prob(true|X DiskUsg@Web ) > Prob(true|X Network@AP ) Alert 3 Alert 1 Alert 2 Alert 4

Local Context Mapping to Global Context ICAC 2009 : 6/16/200914 CPU% @Web > 70% Alert 1 DiskUsg@Web > 150 Alert 2 CPU% @DB > 60% Alert 3 Network@AP > 35k Alert 4 WebAP DB Fault model of Network%AP PRTP x 1 0 x CPU@WEB x CPU@DB x DiskUsg@WEB xTxT Prob(true|X CPU@DB ) > Prob(true|X CPU@WEB ) > Prob(true|X DiskUsg@Web ) > Prob(true|X T ) Alert 3 Alert 1 Alert 2 Alert 4 Alert ranking: No Change

Alerts Ranking Process ICAC 2009 : 6/16/200915 4. Rank alerts Online At time of alerts received Alert 1 Alert 4 Real alerts

Ranking Alerts (Case I) ICAC 2009 : 6/16/200916 Sorted alert rules Alert 6 Alert 2 Alert 3 Alert 7 Alert 5 Alert 9 Alert 1 Alert 8 Alert 4 Case I: Receive ONLY ALERTS, no monitoring data from components Alert 2 Alert 3 Alert 7 Alert 5 Alert 1 Alerts ranking 1 2 3 4 5 5 alerts generated Operator’s knowledge & configuration System Invariants Network

Ranking Alerts (Case II) ICAC 2009 : 6/16/200917 Case II: Receive both alerts and monitoring data from components Fault model of CPU%Web PRTP x 1 0 xTxT x CPU@DB x DiskUsg@WEB x Network@AP = fa(Network@AP) = fc(DiskUsg@WEB)= fb(CPU%@AP) Observed Value X(CPU%Web) Number of Threshold Violations (NTV) NTV=3 Fault model of Network%AP PRTP x 1 0 x CPU@WEB x CPU@DB x DiskUsg@WEB xTxT Observed Value X(Network%AP) NTV=2 Alert by CPU%Web is more important than one from Network%AP.

Index Introduction – Motivation & Goal System Invariants – Invariants extraction – Value propagation Collaborative peer review mechanism – Rules & Fault model – Ranking alerts Experiment result Conclusion ICAC 2009 : 6/16/200918

Experimental system ICAC 2009 : 6/16/200919 Flow Intensities: : the number of EJB created at time t. : the JVM processing time at time t. : the number of SQL queries at time t. Flow Intensities: : the number of EJB created at time t. : the JVM processing time at time t. : the number of SQL queries at time t. A D C B BAD C Invariant Examples:

Extracted Invariants Network ICAC 2009 : 6/16/200920 m1m1 m3m3 m5m5 m2m2 m4m4 m6m6

Thresholds of Measurements ICAC 2009 : 6/16/200921 70 30000 80 70 30000 20000 m1m1 m2m2 m3m3 m4m4 m5m5 m6m6 m1m1 T m2m2 T m3m3 T m4m4 T m5m5 T m6m6 T 63.6 70.2 70.5 77.0 59.8

Thresholds of Measurements ICAC 2009 : 6/16/200922 70 m1m1 m1m1 T m2m2 T m3m3 T m4m4 T m5m5 T m6m6 T 63.6 70.2 70.5 77.0 59.8 30000 m2m2 32726 33006 33212 36316 28207 80 m3m3 71.4 78.0 86.4 81.0 66.9 30000 m4m4 29540 29646 32613 25469 27018 70 m5m5 57.4 62.8 63.7 54.1 63.0 20000 m6m6 23208 23291 25688 21200 23509

Ranking Alerts with NTVs (1) ICAC 2009 : 6/16/200923 70 m1m1 m1m1 T m2m2 T m3m3 T m4m4 T m5m5 T m6m6 T 63.6 70.2 70.5 77.0 59.8 30000 m2m2 32726 33006 33212 36316 28207 80 m3m3 71.4 78.0 86.4 81.0 66.9 30000 m4m4 29540 29646 32613 25469 27018 70 m5m5 57.4 62.8 63.7 54.1 63.0 20000 m6m6 23208 23291 25688 21200 23509 Observed value 73.634319 81.671.43062122620 NTVs 55 5652

Ranking Alerts with NTVs (1) ICAC 2009 : 6/16/200924

Ranking Alerts with NTVs (2) ICAC 2009 : 6/16/200925 70 m1m1 m1m1 T m2m2 T m3m3 T m4m4 T m5m5 T m6m6 T 63.6 70.2 70.5 77.0 59.8 30000 m2m2 32726 33006 33212 36316 28207 80 m3m3 71.4 78.0 86.4 81.0 66.9 30000 m4m4 29540 29646 32613 25469 27018 70 m5m5 57.4 62.8 63.7 54.1 63.0 20000 m6m6 23208 23291 25688 21200 23509 Observed value 73.531478 54.646.12271218564 NTVs 52 ----

Ranking Alerts with NTVs (2) ICAC 2009 : 6/16/200926 Inject a problem (SCP copy) to Web server

Conclusion We introduce a peer review mechanism to rank alerts from heterogeneous components – By mapping local thresholds of various rules into their equivalent values in a global context – Based on system invariants network model To support operators’ consultation for prioritization of problem determination. ICAC 2009 : 6/16/200927

Thank You! Questions? ICAC 2009 : 6/16/200928

Ranking the Importance of Alerts for Problem Determination in Large Computer System Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories.

Similar presentations

Presentation on theme: "Ranking the Importance of Alerts for Problem Determination in Large Computer System Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ranking the Importance of Alerts for Problem Determination in Large Computer System Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories.

Similar presentations

Presentation on theme: "Ranking the Importance of Alerts for Problem Determination in Large Computer System Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories."— Presentation transcript:

Similar presentations

About project

Feedback