Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík, UC Berkeley Greg Friedman, Lukas Biewald, Stanford University.

Slides:

Advertisements

Similar presentations

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Advertisements

Change Detection C. Stauffer and W.E.L. Grimson, “Learning patterns of activity using real time tracking,” IEEE Trans. On PAMI, 22(8): , Aug 2000.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.

Alertbox 5.  Common mistakes with usability tests  Test users are unintelligent  Normal users can handle complex tasks  Do not fix usability issues.

Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.

Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,

Sensitivity of PCA for Traffic Anomaly Detection Evaluating the robustness of current best practices Haakon Ringberg 1, Augustin Soule 2, Jennifer Rexford.

1 In-Network PCA and Anomaly Detection Ling Huang* XuanLong Nguyen* Minos Garofalakis § Michael Jordan* Anthony Joseph* Nina Taft § *UC Berkeley § Intel.

1 CIS607, Fall 2005 Semantic Information Integration Presentation by Dayi Zhou Week 4 (Oct. 19)

1 BotGraph: Large Scale Spamming Botnet Detection Yao Zhao EECS Department Northwestern University.

Stat 512 – Lecture 12 Two sample comparisons (Ch. 7) Experiments revisited.

Network Asset Discovery & Tracking Vern Paxson University of California Berkeley, California USA August 23, 2010.

Benchmarking Anomaly-based Detection Systems Ashish Gupta Network Security May 2004.

1 Adaptive Kalman Filter Based Freeway Travel time Estimation Lianyu Chu CCIT, University of California Berkeley Jun-Seok Oh Western Michigan University.

1 Validation and Verification of Simulation Models.

EECE 571R (Spring 2010) Autonomic Computing (Building Self* Systems) Matei Ripeanu matei at ece.ubc.ca.

Mgt 240 Lecture Exam Two Review November 30, 2004.

A Signal Analysis of Network Traffic Anomalies Paul Barford, Jeffrey Kline, David Plonka, and Amos Ron.

Principles of Time Scales

1 Functional Testing Motivation Example Basic Methods Timing: 30 minutes.

How to make it easy for you customers to find and research you and your services!

Desktop Security: Worms and Viruses Brian Arkills, C&C NDC-Sysmgt.

WAC/ISSCI Automated Anomaly Detection Using Time-Variant Normal Profiling Jung-Yeop Kim, Utica College Rex E. Gantenbein, University of Wyoming.

Anomaly detection Problem motivation Machine Learning.

Monday 13 th November GSY/050388/ © BAE SYSTEMS All Rights Reserved ESA Space Weather Applications Pilot Project Service Development.

User Profiling for Intrusion Detection in Windows NT Tom Goldring R23.

Samuvel Johnson nd MCA B. Contents  Introduction to Real-time systems  Two main types of system  Testing real-time software  Difficulties.

Predictive Evaluation

Kyungmin Lee *, Jason Flinn *, T.J. Giuli +, Brian Noble *, and Christopher Peplin + University of Michigan * Ford Motor Company + AMC: Verifying User.

Welcome to Digital Technologies! Please choose a seat quickly and quietly. No need to log in yet— we are going to be moving seats shortly. DO YOU HAVE.

1 Shopping on the Internet INFO 654 – Spring 2007.

Modeling Resource Sharing Dynamic of VoIP users Over a WLAN Using a Game-Theoretic Approach Presented by Jaebok Kim.

Mirco Nanni, Roberto Trasarti, Giulio Rossetti, Dino Pedreschi Efficient distributed computation of human mobility aggregates through user mobility profiles.

Online, Remote Usability Testing  Use web to carry out usability evaluations  Two main approaches agent-based evaluation (e.g., WebCritera)  model automatically.

Generating Intelligent Links to Web Pages by Mining Access Patterns of Individuals and the Community Benjamin Lambert Omid Fatemieh CS598CXZ Spring 2005.

©2010 John Wiley and Sons Chapter 12 Research Methods in Human-Computer Interaction Chapter 12- Automated Data Collection.

Usability of SE/SDI Websites Observations. Good News Most people Like Most things On Most of our Websites.

WELCOME! Parent Information Night. Tonight’s Agenda CCPS Website as a Resource WHMS Website as a Resource PIV – online gradebook viewer School Messenger.

ECO-DNS: Expected Consistency Optimization for DNS Chen Stephanos Matsumoto Adrian Perrig © 2013 Stephanos Matsumoto1.

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 26, 2012.

Anomaly detection in VoIP and Ethernet traffic under presence of daily patterns Piotr Żuraniewski (UvA/TNO/AGH) Felipe Mata (UAM), Michel Mandjes (UvA),

Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,

QUIA Online Lab Manual & Workbooks Student Registration Process 6/2006.

Online Homework/Tutorial System.

Classsourcing: Crowd-Based Validation of Question-Answer Learning Objects Jakub Šimko, Marián Šimko, Mária Bieliková, Jakub Ševcech, Roman Burger

1 A Framework for Measuring and Predicting the Impact of Routing Changes Ying Zhang Z. Morley Mao Jia Wang.

CHI Web Behavior Patterns1 Separating the Swarm Categorization Methods for User Sessions on the Web Jeffrey Heer, Ed H. Chi Palo Alto Research.

Welcome! Update on Legacy Finance Systems Finance User Group (Webinar Only) November 2015 Nov. 20, 2015.

Intro to HCI Week 3 Homework. 3 Teams of four Top 3 products keep their original duo Bottom 3 teams join one of the other teams.

+ Summer Institute for Online Course Development Institute – Assessment Techniques Presentation by Nancy Harris Dept of Computer Science.

Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík ‡, Greg Friedman †, Lukas Biewald †, Helen Levine §, George.

Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.

CS526: Information Security Chris Clifton November 25, 2003 Intrusion Detection.

Automating Configuration Troubleshooting with Dynamic Information Flow Analysis Mona Attariyan Jason Flinn University of Michigan.

Before usability CS 147: Intro to HCI After 1 st Usability Review.

Sensitivity of PCA for Traffic Anomaly Detection Evaluating the robustness of current best practices Haakon Ringberg 1, Augustin Soule 2, Jennifer Rexford.

LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Network Computing Laboratory Load Balancing and Stability Issues in Algorithms for Service Composition Bhaskaran Raman & Randy H.Katz U.C Berkeley INFOCOM.

COOKIES AND SESSIONS.

Fraud Detection with Machine Learning: A Case Study from Sift Science

Christa Marsh Southern Arkansas University Biology Professor.

The Information School of the University of Washington Information System Design Info-440 Autumn 2002 Session #20.

Chapter 12: Automated data collection methods

What is Cookie? Cookie is small information stored in text file on user’s hard drive by web server. This information is later used by web browser to retrieve.

Application-level logs: visualization and anomaly detection

Refining of Failure Detection Technique in Web Applications

Project Iterations.

Presentation transcript:

Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík, UC Berkeley Greg Friedman, Lukas Biewald, Stanford University HT Levine, Ebates.com George Candea, Stanford University

2 Motivation problem: –takes weeks/months to detect some failures in Internet services assumption: –users change their behavior in response to failures –e.g., can’t follow a link from /shopping_cart to /checkout goal: –quickly detect changes/anomalies in users’ access patterns –localize the cause of the change: which page is causing problems? did the page transitions change?

3 Outline online algorithms for anomaly detection demo of a GUI tool for real-time detection questions we have future work

4 Anomalies in user access patterns why this approach to failure detection? –leverages aggregate intelligence of people using the site –identifying page access patterns can help localize failures –don’t need any instrumentation types of anomalies –unexpected: signify failures/problems –expected: verify the changes/updates to the website what types of user patterns we can observe –frequencies of individual pages –page transitions –user sessions

5 Real-world failures from Ebates.com Ebates.com –mid-sized e-commerce site –provided 5 sets of HTTP logs (1-2 week period) –have access s, chat logs from periods of problems each data set contains one or more failures –mostly site crash examples –problem with survey pages –broken signup page –bad DB query

6 Normal traffic: 11am – 3am 1 hit / 5 minutes10 hits / 5 minutes 100 hits / 5 minutes 11am3am

7 Anomaly: 7am – 1pm 1 hit / 2 minutes10 hits / 2 minutes 100 hits / 2 minutes 7am1pm

8 Online detection of anomalies assign anomaly score to the current time interval handling anomalous intervals in the past 1.use all intervals 2.less weight on the anomalous intervals 3.ignore anomalous intervals localization of problems –most anomalous pages –changes in page transitions time

9 Two algorithms chi-square test –count hits to top 40 pages in the past 6 hours and the past 10 minutes –compare relative frequencies using the chi-square test –more sensitive to frequent pages –compare page transitions before and during the anomaly Naive Bayes anomaly detection –assume that page frequencies are independent –model frequency of each page as a Gaussian –learn mean and variance from the past –anomaly score = 1 - Prob(current interval is normal) –more sensitive to infrequent pages

10 Two Anomalies 1 hit / 5 minutes10 hits / 5 minutes 100 hits / 5 minutes number of hits to the top 10 pages anomaly threshold anomaly score time (hours)

11 GUI tool for real-time detection why need GUI tool? –build trust of the operators –why should the operator believe the algorithm? “picture is worth a thousand words” –manual monitoring/inspection of traffic by operators –make SLT usable in real life report 1 warnings instead of anomalies every minute compare: Most anomalous pages: /landing.jsp /landing_merchant.jsp /mall_ctrl.jsp 3.69 /malltop.go 2.63 /mall.go 2.18 warning #3: detection time: Sun Nov 16 19:27:00 PST 2003 start: Sun Nov 16 19:24:00 PST 2003 end: Sun Nov 16 21:05:00 PST 2003 significance = 7.05

12 Summary of successful results october 2003 – broken signup page: –noticed the problem 7 days earlier + correctly localized! november 2003 – account page problem: –1 st warning: 16 hours earlier –2 nd warning: 1 hour earlier + correctly localized the bad page! july 2001 – landing looping problem: –warning 2 days earlier + correctly localized detected a failure they didn’t tell us about detected three other significant anomalies –feedback: “these might have been important, but we didn’t know about them. definitely useful if detected in real-time.”

13 Oct 2003 – broken signup page (1)

14 Oct 2003 – broken signup page (2)

15 Oct 2003 – broken signup page (3)

16 Oct 2003 – broken signup page (4)

17 Nov 2003 – account page problem (1)

18 Nov 2003 – account page problem (2) 9am1pm

19 How to evaluate? information from HT Levine: –time + root cause of major failures (site down,...) –time of minor problems (DB alarm,...) –harmless updates (code push, page update) scenario: 1.page A pushed at 3:30pm, Monday 2.anomaly on page A at 6pm, Monday 3.mostly ok for next 48 hours 4.site down at 6pm, Wednesday –would detecting the anomaly on Monday help??

20 What is a true/false positive? detected a major/minor problem: GREAT detected a regular site update: ??? detected a significant anomaly, BUT –Ebates knows nothing about it –no major problems at that time –??? detected anomalies almost every night –certainly a false positive

21 Build a simulator? site: PetStore, Rubis? failures: try failures from Ebates user simulator: based on real logs from Ebates cons: –less realistic (how to build a realistic simulator of users?) pros: –know exactly what happened in the system (measure TTD) –try many different failures –use for evaluating TCQ-based preprocessing

22 Localization Naive Bayes better at localization –likely reason: more sensitive to infrequent pages

23 Future work develop better quantitative measures for analysis GUI tool –deploy at EBates –make available as open source –could help convince other companies to provide failure data detect more subtle problems –harder to detect using current methods explore HCI aspects of our approach

24 Conclusions very simple algorithms can detect serious failures visualization helps understand the anomaly have almost-perfect source of failure data –complete HTTP logs –operators willing to cooperate – s, chat logs from periods of problems still hard to evaluate