Download presentation
Presentation is loading. Please wait.
Published bySugiarto Jayadi Modified over 6 years ago
1
Application-level logs: visualization and anomaly detection
Peter Bodík UC Berkeley
2
Introduction previous work:
visualization of HTTP access logs automatic detection and localization of anomalies can we extend this to application-level logs? preliminary work on application logs from Amazon.com
3
Overview review work with Ebates.com
capturing application behavior at Amazon.com: application logs visualization of application logs anomaly detection
4
Work with Ebates.com HTTP access logs
analyzed the top 40 pages (98% of all traffic) detection of anomalies compare current traffic with normal traffic Naïve Bayes, chi-square test visualization easy to notice anomalies
5
Sample anomaly How long did it take you to read this? warning #3:
detection time: Sun Nov 16 19:27:00 PST 2003 start: Sun Nov 16 19:24:00 PST 2003 end: Sun Nov 16 21:05:00 PST 2003 significance = 7.05 Most anomalous pages: /landing.jsp /landing_merchant.jsp /mall_ctrl.jsp How long did it take you to read this?
6
Visualization of the same anomaly
7
Conclusion Pros: Cons: anomaly detection worked
visualization worked even better makes false positives “cheaper” able to detect/localize problems earlier Cons: only looks at web pages won’t tell you enough about the problem night anomalies
8
Anomaly score in one dataset
night anomalies
9
Application-level logs
Can we use the same approach on app-level logs? analyzed logs from 3 failures in Amazon.com
10
Amazon.com F user request A H C E HTML page B G D
11
Application logs every request recorded in an application log
request calls operation() in service B service B remote calls to other services C D E every request recorded in an application log
12
How to visualize application logs?
Is this similar to HTTP access logs? HTTP logs: user requests a web page count number of hits to a page application logs: request calls remote methods count number of calls to a method count number of requests to an operation
13
Summary of features for every method M: #requests for operation O
count #requests that called M #calls to M per request average execution time of M #requests for operation O #requests to host H average Time, UserTime, SystemTime
14
Failure 1 from Amazon.com operators: “problem from 12:37 to 12:41,
affected most services, applications recovered at 12:52”
15
Failure 2 from Amazon.com operators:
“10 minute outage from 12:52 to 13:02”
16
Failure 3 misconfiguration have logs only from one service
no anomalies visible
17
Modeling method frequencies
same as for page frequencies in HTTP logs assumption: frequency of a page/method doesn’t change during the day frequencies are independent model the frequency as a Gaussian compute mean and variance anomaly score negative log likelihood
18
Negative log likelihood/anomaly score over time
19
Anomalous method anomaly higher variance, causes false positives
frequency anomaly higher variance, causes false positives time [hours]
20
... another method frequency mean changes over time time [hours]
21
… and another one wouldn’t notice this peak, causes false negatives
frequency wouldn’t notice this peak, causes false negatives time [hours]
22
Well ... this is just another source of anomalies! How do we know what’s really broken? sort anomalies by time of detection only the early anomalies are important fine-grained anomalies detect problem earlier earlier warning likely root cause
23
Library of failures signature of a failure:
not just which metrics are anomalous but also when they became anomalous
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.