Budapest University of Technology and Economics Department of Measurement and Information Systems Budapest University of Technology and Economics Fault.

Budapest University of Technology and Economics Department of Measurement and Information Systems Budapest University of Technology and Economics Fault Tolerant Systems Research Group O UTLIER D ETECTION IN IT D ATA Agnes Salanki salanki@mit.bme.hu

Small motivational example (1949) Hadlum vs. Hadlum

Small motivational example (1949) Source: http://www.siam.org/meetings/sdm10/tutorial3.pdfhttp://www.siam.org/meetings/sdm10/tutorial3.pdf Average: 280 days (40 weeks) Mrs. Hadlum: 349

From the system modelling aspect  Goals of infrastructure data analysis: identify o Operational domains o Domain boundaries o Transitions effects o Without one specific high level QoS metric  Carrier grade in telecommunication: 99.999% availability: faults are rare Outlier detection So far: typical clustering

Terminology… anomaly surprise rare event novelty outlier exception aberration peculiarity discordant observations 1.Rare: low relative frequency of occurrence 2.Anomalous: „deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins, 1980) 1.Rare: low relative frequency of occurrence 2.Anomalous: „deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins, 1980)

Outline Taxonomy Point vs Collective Behavioral vs Context Effect aspect Approaches Visual methods Distance based Density based Temporal Time series Stream data

TAXONOMY

 Point anomaly o an individual data instance o E. g., low service throughput for a short time interval  Collective anomaly o collection of related data instances o relationship? o E.g., continuously high CPU usage Point vs. Collective?

 Point anomaly o an individual data instance  Collective anomaly o collection of related data instances o relation?

Point vs. Collective?  Point anomaly o an individual data instance  Collective anomaly o collection of related data instances o relation? One single value of -5.7 is typical; 20 right after each other is rather interesting One single value of -5.7 is typical; 20 right after each other is rather interesting

Point vs. Collective? Point anomaly Collective anomaly

Contextual vs. Behavioral  Contextual anomaly o an instance is anomalous in a specific context but not otherwise o Only wrt. contextual variables time, location, benchmark configuration, etc. o E.g., the continiously high CPU usage is acceptable only during workdays  Behavioral anomaly o an instance is anomalous without any context information o E.g., continiously high CPU usage is never acceptable

Contextual vs. Behavioral  Contextual anomaly o a data instance is anomalous in a specific context o but not otherwise  Behavioral anomaly o a data instance is anomalous independently from the context? Contextual: -4°C in the middle of March Behavioral: simply too low

Contextual vs. Behavioral Without the time axis, a cpu_idle of value 600 cannot be considered as suspicious

Time Series Outliers

Categorization by system effects  Additive outlier o Subsequent observations are unaffected  Level Shift Outlier o It has a permanent effect  Innovational Outlier o Initial impact + increasing effect as time proceeds  Transient Change Outlier o ~Innovational outliers, but the effect diminishes exponentially, later the series returns to normal

Basic types Additive Transient change Level Shift Innovational Subsequent observations are unaffected Permanent effect The effect diminishes exponentially, later the series returns to normal Initial impact + increasing effect as time proceeds

Basic types  Additive outlier o Subsequent observations are unaffected  Level Shift Outlier o In contrast to additive outliers, a level shift outlier affects many observations and has a permanent effect  Innovational Outlier o characterized by an initial impact with effects lingering over subsequent observations. The influence of the outliers may increase as time proceeds  Transient Change Outlier o Transient change outliers are similar to level shift outliers, but the effect of the outlier diminishes exponentially over the subsequent observations. Eventually, the series returns to its normal level.

GENERAL APPROACHES

Multi-dimensional data  Distance based o Deviation from the rest of the dataset  Density based o Deviation from the neighborhood

Timeseries  Distance based o Deviation from the rest of the dataset  Density based o Deviation from the neighborhood

Ranking aspects of methods  Complexity o Resource requirements  Incremental maintainability o Support for online analysis  Required domain-specific knowledge o Amount of paremeters  Sensitivity o Sensitivity of the method on paremeters  Semi-supervised approach o Support for an initial „typical” or „faulty” subset

OUTLIER DETECTION IN EDA

1D  Aggregation suppresses outliers Unimodal distributions: boxplot/rugplot

1D  Aggregation suppresses outliers Real outliers can be masked on a boxplot

 Generalization of 1D density

Support of multiple operational modes, e.g. READs and WRITEs in a data base Support of multiple operational modes, e.g. READs and WRITEs in a data base Everything else can be marked as „outlier” or „transition” 2d density plot, mosaic plot, heatmap, etc.  often overaggregation without concrete points 

Multi-dimensional plots Good for detecting 1D outliers in one go

Multi-dimensional plots: parallel c. Aggregation suppresses outliers

Multi-dimensional plots: biplots  Biplot o Data points and variables  PCA-based o Goal: maximum of variability o Indicator variables o Modified point-point distance

DISTANCE BASED APPROACHES

Convex hull 0 23154678 76845321 Min.: 23144321  Convex hull method: Tukey, 1974 Extreme points Median: only at the end

Convex hull  Convex hull method: Tukey, 1974

MCD  Minimum Covariance Determinant  Idea o „Normal domain”: the most compact subset  What does compact mean?

MCD  Minimum Covariance Determinant  Idea o „Normal domain”: the most compact subset  What does compact mean? 0.0014 0.00041 0.00011 Exhaustive search? choose(n = 1000, k = 900) [1] 6.385051e+139 Exhaustive search? choose(n = 1000, k = 900) [1] 6.385051e+139

FAST-MCD X  

Distance function  By real data: different ranges o E.g., memory in bytes/MB/GB o Comparison of CPU and memory values?  Normalization into [0, 1]  Other distance functions

Mahalanobis distance

BACON  Blocked Adaptive Computationally Efficient Outlier Nominators  Initial set in semi-supervised mode  New set: based on a threshold

DENSITY BASED APPROACHES

DB It is irrelevant that we are in the middle, if we do not have neighbors  Distance-based approach

LOF motiváció: mikor jó a DB?

LOF  Local Outlier Factor  Basic idea: comparison only with the neighbors o Local density  Outlier cretirium o The local density is much lower than by the neighbors

LOF If my neighbors are lonely too, then everything is all right LOF: DMwR::lofactor Local outlier factor

TIME SERIES

Time Series Outliers  Time as context attribute is relevant  Outlier to find can be o A single deviant point o A whole subsequent Group of points without relationships are irrelevant No need for collective outliers  Usually in 1D o direct generalization to more dimensions

Outliers among sequences

 Ideas o project the subsequence into one value o store only the similarity matrix  From that point: any clustering method works  Typical distance metrics o Numerical: simple Euclidian distance o Discrete: length of the longest common subsequence

Distance functions of time series  Euclidian o Offset on axis x?  Dynamic time warping o Comparison based on outliers  Length of common subsequence

Dynamic time warping  The points are considered not based on their position o Motivation: voice recognition

Dynamic time warping

Sakoe-Chiba band

Longest common subsequence  Exact time is abstracted  Focus on the order of values

Longest common subsequence  Exact time is abstracted  Focus on the order of values  Generalization to numerical values

Outliers in sequences  

Outliers in sequences 

Outliers in sequences Original data set: 5 different values With removal of this point: 4 values are enough

STREAM PROCESSING

Stream processing 1.Many sources 2.With unknown sampling frequency 1.Many sources 2.With unknown sampling frequency Resource requirements Once per stream: „Local maximum?” About stream at all times: „Report each new maximum”

Typically sliding window approches  Autocorrelation methods o Where do we differ from the predicted value? o Where does the autocorrelation model change?  Storm application on the lab

Summary  Hybrid methods can be extremely successful o Visual + automated combinations: precision, speed  Usually from business outliers to platform metrics o Triggered sliding window  Once we have an outlier o boundaries o real time analysis

Budapest University of Technology and Economics Department of Measurement and Information Systems Budapest University of Technology and Economics Fault.

Similar presentations

Presentation on theme: "Budapest University of Technology and Economics Department of Measurement and Information Systems Budapest University of Technology and Economics Fault."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Budapest University of Technology and Economics Department of Measurement and Information Systems Budapest University of Technology and Economics Fault.

Similar presentations

Presentation on theme: "Budapest University of Technology and Economics Department of Measurement and Information Systems Budapest University of Technology and Economics Fault."— Presentation transcript:

Similar presentations

About project

Feedback