Application-level logs: visualization and anomaly detection

Application-level logs: visualization and anomaly detection
Peter Bodík UC Berkeley

Introduction previous work:
visualization of HTTP access logs automatic detection and localization of anomalies can we extend this to application-level logs? preliminary work on application logs from Amazon.com

Overview review work with Ebates.com
capturing application behavior at Amazon.com: application logs visualization of application logs anomaly detection

Work with Ebates.com HTTP access logs
analyzed the top 40 pages (98% of all traffic) detection of anomalies compare current traffic with normal traffic Naïve Bayes, chi-square test visualization easy to notice anomalies

Sample anomaly How long did it take you to read this? warning #3:
detection time: Sun Nov 16 19:27:00 PST 2003 start: Sun Nov 16 19:24:00 PST 2003 end: Sun Nov 16 21:05:00 PST 2003 significance = 7.05 Most anomalous pages: /landing.jsp /landing_merchant.jsp /mall_ctrl.jsp How long did it take you to read this?

Visualization of the same anomaly

Conclusion Pros: Cons: anomaly detection worked
visualization worked even better makes false positives “cheaper” able to detect/localize problems earlier Cons: only looks at web pages won’t tell you enough about the problem night anomalies

Anomaly score in one dataset
night anomalies

Application-level logs
Can we use the same approach on app-level logs? analyzed logs from 3 failures in Amazon.com

Amazon.com F user request A H C E HTML page B G D

Application logs every request recorded in an application log
request calls operation() in service B service B remote calls to other services C D E every request recorded in an application log

How to visualize application logs?
Is this similar to HTTP access logs? HTTP logs: user requests a web page count number of hits to a page application logs: request calls remote methods count number of calls to a method count number of requests to an operation

Summary of features for every method M: #requests for operation O
count #requests that called M #calls to M per request average execution time of M #requests for operation O #requests to host H average Time, UserTime, SystemTime

Failure 1 from Amazon.com operators: “problem from 12:37 to 12:41,
affected most services, applications recovered at 12:52”

Failure 2 from Amazon.com operators:
“10 minute outage from 12:52 to 13:02”

Failure 3 misconfiguration have logs only from one service
no anomalies visible

Modeling method frequencies
same as for page frequencies in HTTP logs assumption: frequency of a page/method doesn’t change during the day frequencies are independent model the frequency as a Gaussian compute mean and variance anomaly score  negative log likelihood

Negative log likelihood/anomaly score over time

Anomalous method anomaly higher variance, causes false positives
frequency anomaly higher variance, causes false positives time [hours]

... another method frequency mean changes over time time [hours]

… and another one wouldn’t notice this peak, causes false negatives
frequency wouldn’t notice this peak, causes false negatives time [hours]

Well ... this is just another source of anomalies! How do we know what’s really broken? sort anomalies by time of detection only the early anomalies are important fine-grained anomalies detect problem earlier earlier warning likely root cause

Library of failures signature of a failure:
not just which metrics are anomalous but also when they became anomalous

Application-level logs: visualization and anomaly detection

Similar presentations

Presentation on theme: "Application-level logs: visualization and anomaly detection"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Application-level logs: visualization and anomaly detection

Similar presentations

Presentation on theme: "Application-level logs: visualization and anomaly detection"— Presentation transcript:

Similar presentations

About project

Feedback