Fingerprinting the Datacenter: Automated Classification of Performance Crises Kenneth Wade, Ling Su
Reliable System Backup Fingerprint system Safe and tolerant structure Any alarm when Fingerprint failed
Evaluation 100 metrics sampled over 15 min periods, but only 3 key performance indicators (KPI) are designated o very small subset of collected metrics crisis declared when 10% of machines violate KPI's service- level agreement o if there are hundreds of machines running 24x7, many machines may be violating KPI before a crisis declared o why not have warnings as machines start to violate SLAs
Evaluation 5 identifications performed per crisis, starting when crisis detected and continuing 4 subsequent 15 min epochs o in each epoch, identification is a known label or 'x' o "stable" if 0 or more 'x's followed by 0 or more identical labels... accuracy only determined if "stable" if sequence is unstable, accuracy of individual labels not performed! what about one crisis causing another?
Evaluation hot & cold quantile thresholds change over time due to changes in workload & performance of the application and is periodically re-computed based on new data o what if workload or application performance changes significantly before new thresholds computed? o how often are the thresholds re-computed?
Automation Unidentified crises A lot of manual efforts should be put o Their goal is is to automate the crisis identification process o The name of this paper is Automated Classification of Performance Crises
Scaling They didn't mention anything about scaling of Fingerprint What if there are thousands of crises exist? Time to compare each crises
Experiment Environment Basic information of data center o Server type o Crises frequency Configuration information of Fingerprint
Lack of Significance By far, the most referenced paper (14 times) was a paper from 2005 called Capturing, Indexing, Clustering, and Retrieving System History... who's authors include Armando Fox and Moises Goldszmidt o a method for extracting an indexable signature for characterizing system state is presented o they essentially adapted this signatures work to a datacenter
Lack of Significance This paper is 2 years old and has only been cited by 12 others o a quarter of the papers that cited this paper were also authored by Armando Fox (with one of those also authored with Moises Goldszmidt) If this fingerprinting is actually useful for datacenter operators, wouldn't people be using this methodology by now?