UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.

Slides:



Advertisements
Similar presentations
Decision Tree Evolution using Limited number of Labeled Data Items from Drifting Data Streams Wei Fan 1, Yi-an Huang 2, and Philip S. Yu 1 1 IBM T.J.Watson.
Advertisements

Test process essentials Riitta Viitamäki,
Active Shape Models Suppose we have a statistical shape model –Trained from sets of examples How do we use it to interpret new images? Use an “Active Shape.
Ziming Zhang, Yucheng Zhao and Yiwen Wan.  Introduction&Motivation  Problem Statement  Paper Summeries  Discussion and Conclusions.
Introduction to IRRIIS testing platform IRRIIS MIT Conference ROME 8 February 2007 Claudio Balducelli.
This material is approved for public release. Distribution is limited by the Software Engineering Institute to attendees. Sponsored by the U.S. Department.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
The google file system Cs 595 Lecture 9.
Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Spark: Cluster Computing with Working Sets
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
1 In-Network PCA and Anomaly Detection Ling Huang* XuanLong Nguyen* Minos Garofalakis § Michael Jordan* Anthony Joseph* Nina Taft § *UC Berkeley § Intel.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.
UC Berkeley Monitoring Hadoop through Tracing Andy Konwinski and Matei Zaharia.
Cumulative Violation For any window size  t  Communication-Efficient Tracking for Distributed Cumulative Triggers Ling Huang* Minos Garofalakis.
** MapReduce Debugging with Jumbune. * Agenda * Debugging Challenges Debugging MapReduce Jumbune’s Debugger Zero Tolerance in Production.
The Chinese University of Hong Kong. Research on Private cloud : Eucalyptus Research on Hadoop MapReduce & HDFS.
Maintaining and Updating Windows Server 2008
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.
Anomaly detection Problem motivation Machine Learning.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Jay Stokes, Microsoft Research John Platt, Microsoft Research Joseph Kravis, Microsoft Network Security Michael Shilman, ChatterPop, Inc. ALADIN: Active.
Chapter 2 Network Design Essentials Instructor: Nhan Nguyen Phuong.
Towards An Open Data Set for Trace-Oriented Monitoring Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Michael R. Lyu 1,2 1 National University.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Using car4ams, the Bayesian AMS data-analysis code V. Palonen, P. Tikkanen, and J. Keinonen Department of Physics, Division of Materials Physics.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
Automated Problem Diagnosis for Production Systems Soila P. Kavulya Scott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU),
Programming for Beginners Martin Nelson Elizabeth FitzGerald Lecture 5: Software Design & Testing; Revision Session.
IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
1 RECONSTRUCTION OF APPLICATION LAYER MESSAGE SEQUENCES BY NETWORK MONITORING Jaspal SubhlokAmitoj Singh University of Houston Houston, TX Fermi National.
MapReduce. What is MapReduce? (1) A programing model for parallel processing of a distributed data on a cluster It is an ideal solution for processing.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík ‡, Greg Friedman †, Lukas Biewald †, Helen Levine §, George.
Progress Report Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Sai Zhang Michael D. Ernst Google Research University of Washington
Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Best detection scheme achieves 100% hit detection with
TraceBench: An Open Data Set for Trace-Oriented Monitoring Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Michael R. Lyu 1,2 1 PDL, National.
Exception and Exception Handling. Exception An abnormal event that is likely to happen during program is execution Computer could run out of memory Calling.
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Profiling/Tracing Method and Tool Evaluation Strategy Summary Slides Hung-Hsun Su UPC Group, HCS lab 1/25/2005.
Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík, UC Berkeley Greg Friedman, Lukas Biewald, Stanford University.
Maintaining and Updating Windows Server 2008 Lesson 8.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
Experience Report: System Log Analysis for Anomaly Detection
Chapter 8 – Software Testing
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Smita Vijayakumar Qian Zhu Gagan Agrawal
Hawk: Hybrid Datacenter Scheduling
Control Theory in Log Processing Systems
Jia-Bin Huang Virginia Tech
Lu Tang , Qun Huang, Patrick P. C. Lee
Abstractions for Fault Tolerance
Presentation transcript:

UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel Labs Berkeley

Why console logs? Detecting problems in large scale Internet services often requires detailed instrumentation Instrumentation can be costly to insert & maintain High code churn Often combine open-source building blocks that are not all instrumented Can we use console logs in lieu of instrumentation? + Easy for developer, so nearly all software has them – Imperfect: not originally intended for instrumentation 2

Problems we are looking for The easy case – rare messages Harder but useful - abnormal sequences NORMAL receiving blk_1 received blk_1 receiving blk_2 ERROR what is wrong with blk_2 ??? 3

Overview and Contributions * Large-scale system problem detection by mining console logs (SOSP’ 09) Accurate online detection with small latency 4 Frequent pattern based filtering OK PCA Detection OK ERROR Dominant cases Non-pattern Normal cases Real anomalies Parsing* Free text logs 200 nodes

Constructing event traces from console logs Parse: message type + variables Group messages by identifiers (automatically discovered) –Group ~= event trace receiving blk_1 received blk_1 reading blk_1 receiving blk_2 received blk_2 receiving blk_2 receiving blk_1 received blk_1 reading blk_1 receiving blk_2 received blk_2 receiving blk_2 5

Online detection: When to make detection? Cannot wait for the entire trace Can last arbitrarily long time How long do we have to wait? Long enough to keep correlations Wrong cut = false positive Difficulties No obvious boundaries Inaccurate message ordering Variations in session duration 6 receiving blk_1 received blk_1 reading blk_1 deleting blk_1 deleted blk_1 receiving blk_1 received blk_1 Time

Frequent patterns help determine session boundaries Key Insight: Most messages/traces are normal Strong patterns “Make common paths fast” Tolerate noise 7

Two stage detection overview 8 Frequent pattern based filtering OK PCA Detection OK ERROR Dominant cases Non-pattern Normal cases Real anomalies Parsing Free text logs 200 nodes

Stage 1 - Frequent patterns (1): Frequent event sets 9 receiving blk_1 received blk_1 reading blk_1 deleting blk_1 deleted blk_1 receiving blk_1 received blk_1 Time Coarse cut by time Find frequent item set Refine time estimation reading blk_1 error blk_1 Repeat until all patterns found PCA Detection

Stage 1 - Frequent patterns (2) : Modeling session duration time Assuming Gaussian? th percentile estimation is off by half 45% more false alarms Mixture distribution Power-law tail + histogram head 10 Duration Count Pr(X>=x)

Stage 2 - Handling noise with PCA detection More tolerant to noise Principal Component Analysis (PCA) based detection 11 Frequent pattern based filtering OK PCA Detection OK ERROR Dominant cases Non-pattern Normal cases Real anomalies Parsing Free text logs 200 nodes

Frequent pattern matching filters most of the normal events Frequent pattern based filtering OK PCA Detection OK ERROR Dominant cases Non-pattern Normal cases Real anomalies Parsing* Free text logs 200 nodes 100% 86% 14% 13.97% 0.03%

Evaluation setup Hadoop file system (HDFS) Experiment on Amazon’s EC2 cloud 203 nodes x 48 hours Running standard map-reduce jobs ~24 million lines of console logs 575,000 traces ~ 680 distinct ones Manual label from previous work Normal/abnormal + why it is abnormal “Eventually normal” – did not consider time For evaluation only 13

Frequent patterns in HDFS Frequent Pattern99.95 th percentile Duration % of messages Allocate, begin write13 sec20.3% Done write, update metadata8 sec44.6% Delete-12.5% Serving block-3.8% Read exception-3.2% Verify block-1.1% Total85.6% 14 Covers most messages Short durations (Total events ~20 million)

Detection latency Detection latency is dominated by the wait time 15 Single event pattern Frequent pattern (matched) Frequent pattern (timed out) Non pattern events

Detection accuracy 16 True Positives False Positives False Negatives PrecisionRecall Online16,9162, %100.0% Offline16,8081, %99.3% (Total trace = 575,319) Ambiguity on “abnormal” Manual labels: “eventually normal” > 600 FPs in online detection as very long latency E.g. a write session takes >500sec to complete (99.99 th percentile is 20sec)

Future work Distributed log stream processing –Handle large scale cluster + partial failures Clustering alarms Allowing feedback from operators Correlation on logs from multiple applications / layers 17

Summary Wei Xu PrecisionRecall Online86.0%100.0% 18 Frequent pattern based filtering OK PCA Detection OK ERROR Dominant cases Non-pattern Normal cases Real anomalies Parsing Free text logs 200 nodes 24 million lines