Application-level logs: visualization and anomaly detection Peter Bodík UC Berkeley
Introduction previous work: visualization of HTTP access logs automatic detection and localization of anomalies can we extend this to application-level logs? preliminary work on application logs from Amazon.com
Overview review work with Ebates.com capturing application behavior at Amazon.com: application logs visualization of application logs anomaly detection
Work with Ebates.com HTTP access logs analyzed the top 40 pages (98% of all traffic) detection of anomalies compare current traffic with normal traffic Naïve Bayes, chi-square test visualization easy to notice anomalies
Sample anomaly How long did it take you to read this? warning #3: detection time: Sun Nov 16 19:27:00 PST 2003 start: Sun Nov 16 19:24:00 PST 2003 end: Sun Nov 16 21:05:00 PST 2003 significance = 7.05 Most anomalous pages: /landing.jsp 19.55 /landing_merchant.jsp 19.50 /mall_ctrl.jsp 3.69 How long did it take you to read this?
Visualization of the same anomaly
Conclusion Pros: Cons: anomaly detection worked visualization worked even better makes false positives “cheaper” able to detect/localize problems earlier Cons: only looks at web pages won’t tell you enough about the problem night anomalies
Anomaly score in one dataset night anomalies
Application-level logs Can we use the same approach on app-level logs? analyzed logs from 3 failures in Amazon.com
Amazon.com F user request A H C E HTML page B G D
Application logs every request recorded in an application log request calls operation() in service B service B remote calls to other services C D E every request recorded in an application log
How to visualize application logs? Is this similar to HTTP access logs? HTTP logs: user requests a web page count number of hits to a page application logs: request calls remote methods count number of calls to a method count number of requests to an operation
Summary of features for every method M: #requests for operation O count #requests that called M #calls to M per request average execution time of M #requests for operation O #requests to host H average Time, UserTime, SystemTime
Failure 1 from Amazon.com operators: “problem from 12:37 to 12:41, affected most services, applications recovered at 12:52”
Failure 2 from Amazon.com operators: “10 minute outage from 12:52 to 13:02”
Failure 3 misconfiguration have logs only from one service no anomalies visible
Modeling method frequencies same as for page frequencies in HTTP logs assumption: frequency of a page/method doesn’t change during the day frequencies are independent model the frequency as a Gaussian compute mean and variance anomaly score negative log likelihood
Negative log likelihood/anomaly score over time
Anomalous method anomaly higher variance, causes false positives frequency anomaly higher variance, causes false positives time [hours]
... another method frequency mean changes over time time [hours]
… and another one wouldn’t notice this peak, causes false negatives frequency wouldn’t notice this peak, causes false negatives time [hours]
Well ... this is just another source of anomalies! How do we know what’s really broken? sort anomalies by time of detection only the early anomalies are important fine-grained anomalies detect problem earlier earlier warning likely root cause
Library of failures signature of a failure: not just which metrics are anomalous but also when they became anomalous