Fingerprinting the Datacenter Marcel Flores Shih-Chi Chen.

Slides:



Advertisements
Similar presentations
Imbalanced data David Kauchak CS 451 – Fall 2013.
Advertisements

SIGIR 2008 Yandong Liu, Jiang Bian, Eugene Agichtein from Emory & Georgia Tech University.
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
Yue Han and Lei Yu Binghamton University.
PARADIS: Wireless Device Identification with Radiometric Signatures
SurroundSense: Mobile Phone Localization via Ambience Fingerprinting Written by Martin Azizyan, Ionut Constandache, & Romit Choudhury Presented by Craig.
Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz) Serge Abiteboul (INRIA and University of Paris 11) Tova Milo (University of Tel Aviv) Automatic Index.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
RIT Software Engineering
SE 450 Software Processes & Product Metrics 1 Defect Removal.
Fingerprinting the Datacenter: Automated Classification of Performance Crises Kenneth Wade, Ling Su.
1 A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions Zhihong Zeng, Maja Pantic, Glenn I. Roisman, Thomas S. Huang Reported.
Distributed and Efficient Classifiers for Wireless Audio-Sensor Networks Baljeet Malhotra Ioanis Nikolaidis Mario A. Nascimento University of Alberta Canada.
CS Instance Based Learning1 Instance Based Learning.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Fingerprinting the Datacenter Offense Mykell Miller, Gautam Bhawsar.
Decision Making Upul Abeyrathne, Dept. of Economics, University of Ruhuna, Matara.
Software Process and Product Metrics
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
© 2011 Towers Watson. All rights reserved. Use of Statistical Techniques in Complex Actuarial Modeling 46 th Actuarial Research Conference, Storrs, Connecticut.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Including a detailed description of the Colorado Growth Model 1.
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Presented by Tienwei Tsai July, 2005
| October LIGO Document ID: LIGO-G Multivariate Veto Analysis for Real-time Gravitational Wave.
To achieve a level 3 your work must show that: With some help you can gather information to help with designing your project You can draw suitable ideas.
SVY 207: Lecture 13 Ambiguity Resolution
1 Webcam Mouse Using Face and Eye Tracking in Various Illumination Environments Yuan-Pin Lin et al. Proceedings of the 2005 IEEE Y.S. Lee.
CSC 196k Semester Project: Instance Based Learning
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Exploiting Gray-Box Knowledge of Buffer Cache Management Nathan C. Burnett, John Bent, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of.
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
Effort Estimation ( 估计 ) And Scheduling ( 时序安排 ) Presented by Basker George.
Measures of variability: understanding the complexity of natural phenomena.
1 Common Mistakes in Performance Evaluation (1) 1.No Goals  Goals  Techniques, Metrics, Workload 2.Biased Goals  (Ex) To show that OUR system is better.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Team Members Ming-Chun Chang Lungisa Matshoba Steven Preston Supervisors Dr James Gain Dr Patrick Marais.
Autoregressive (AR) Spectral Estimation
Discovery and Systems Health Technical Area NASA Ames Research Center - Computational Sciences Division Automated Diagnosis Sriram Narasimhan University.
1 Some Guidelines for Good Research Dr Leow Wee Kheng Dept. of Computer Science.
Brian Lukoff Stanford University October 13, 2006.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
NTU & MSRA Ming-Feng Tsai
A new clustering tool of Data Mining RAPID MINER.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Improving Music Genre Classification Using Collaborative Tagging Data Ling Chen, Phillip Wright *, Wolfgang Nejdl Leibniz University Hannover * Georgia.
(6) Estimating Computer’s efficiency Software Estimation The objective of Software Estimation is to provide the skills needed to accurately predict the.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
ItemBased Collaborative Filtering Recommendation Algorithms 1.
COMPSCI 720 Security for Smart-devices Tracking Mobile Web Users Through Motion Sensors: Attacks and Defenses [1] Harry Jackson hjac660 [1] Das, Anupam,
Experience Report: System Log Analysis for Anomaly Detection
Machine Learning with Spark MLlib
Proposing Data Mining for Plasma Diagnosis
A review of audio fingerprinting (Cano et al. 2005)
For Evaluating Dialog Error Conditions Based on Acoustic Information
Cold-Start Heterogeneous-Device Wireless Localization
Massachusetts Institute of Technology
Intro to Machine Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Progress Report Meng-Ting Zhong 2015/9/10.
Approaching an ML Problem
Dynamic Authentication of Typing Patterns
ورود اطلاعات بصورت غيربرخط
Chapter 11 Practical Methodology
Intro to Machine Learning
Presentation transcript:

Fingerprinting the Datacenter Marcel Flores Shih-Chi Chen

Motivation Large datacenters often encounter large and complex crises Come in the form of dipping below SLAs Often complex and difficult to diagnose Can be costly to operators

Approach Want to quantify the state of the datacenter in a compact manner Can be compared to past crises Allows for easy identification and diagnoses of crises

Fingerprints Tracks quantiles for each metric Determines hot/normal/cold status for each metric Includes only relevant metrics Uses a similarity metric for comparison

Fingerprint - details Track quantiles of each metric Resistant to outliers Measure 25%, 50%, 95% quantiles Determines if each measurement is Hot (>98th percentile), Cold (<2nd percentile), or Normal

Relevant Metrics Select metrics via feature selection and classification Technique from statistical machine learning Eliminates noise from the fingerprints

Identification Define a similarity metric Allows comparison between current state fingerprint and known crisis fingerprints Identification Threshold determines when two fingerprints are considered the same

Evaluation Used data gathered from a real live data center consisting of hundreds of servers 240 days About 100 metrics per server

Evaluation Criteria Discrimination: when are two crises different? Identification Stability: when does it provide a consistent suggestion? Identification Accuracy: when does it provide the correct label?

Offline Uses all known data Attempts to recall the crises that it saw Provides a baseline What is the best possible (if it knew everything)? Dominates existing methods, near perfect.

Quasi-Online More realistic, but still computes the thresholds offline Doesn’t know the future Known and Unknown accuracy of 85%

Online Everything online, computed on the fly Including Identification Threshold Achieved both accuracies to 80% (with 10 seeding crises) 78% known, 74% unknown (with 2) Does well with smaller seeding set!

A note on Thresholds Hot/Cold thresholds were selected arbitrarily Ran evaluations with varied values from other statistical methods Showed reduced discriminative power (95% down from 99%) Why mess with what works?