Download presentation
Presentation is loading. Please wait.
Published byBrent Lyons Modified over 9 years ago
1
Budapest University of Technology and Economics Department of Measurement and Information Systems Budapest University of Technology and Economics Fault Tolerant Systems Research Group O UTLIER D ETECTION IN IT D ATA Agnes Salanki salanki@mit.bme.hu
2
Small motivational example (1949) Hadlum vs. Hadlum
3
Small motivational example (1949) Source: http://www.siam.org/meetings/sdm10/tutorial3.pdfhttp://www.siam.org/meetings/sdm10/tutorial3.pdf Average: 280 days (40 weeks) Mrs. Hadlum: 349
4
From the system modelling aspect Goals of infrastructure data analysis: identify o Operational domains o Domain boundaries o Transitions effects o Without one specific high level QoS metric Carrier grade in telecommunication: 99.999% availability: faults are rare Outlier detection So far: typical clustering
5
Terminology… anomaly surprise rare event novelty outlier exception aberration peculiarity discordant observations 1.Rare: low relative frequency of occurrence 2.Anomalous: „deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins, 1980) 1.Rare: low relative frequency of occurrence 2.Anomalous: „deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins, 1980)
6
Outline Taxonomy Point vs Collective Behavioral vs Context Effect aspect Approaches Visual methods Distance based Density based Temporal Time series Stream data
7
TAXONOMY
8
Point anomaly o an individual data instance o E. g., low service throughput for a short time interval Collective anomaly o collection of related data instances o relationship? o E.g., continuously high CPU usage Point vs. Collective?
9
Point anomaly o an individual data instance Collective anomaly o collection of related data instances o relation?
10
Point vs. Collective? Point anomaly o an individual data instance Collective anomaly o collection of related data instances o relation? One single value of -5.7 is typical; 20 right after each other is rather interesting One single value of -5.7 is typical; 20 right after each other is rather interesting
11
Point vs. Collective? Point anomaly Collective anomaly
12
Contextual vs. Behavioral Contextual anomaly o an instance is anomalous in a specific context but not otherwise o Only wrt. contextual variables time, location, benchmark configuration, etc. o E.g., the continiously high CPU usage is acceptable only during workdays Behavioral anomaly o an instance is anomalous without any context information o E.g., continiously high CPU usage is never acceptable
13
Contextual vs. Behavioral Contextual anomaly o a data instance is anomalous in a specific context o but not otherwise Behavioral anomaly o a data instance is anomalous independently from the context? Contextual: -4°C in the middle of March Behavioral: simply too low
14
Contextual vs. Behavioral Without the time axis, a cpu_idle of value 600 cannot be considered as suspicious
15
Time Series Outliers
16
Categorization by system effects Additive outlier o Subsequent observations are unaffected Level Shift Outlier o It has a permanent effect Innovational Outlier o Initial impact + increasing effect as time proceeds Transient Change Outlier o ~Innovational outliers, but the effect diminishes exponentially, later the series returns to normal
17
Basic types Additive Transient change Level Shift Innovational Subsequent observations are unaffected Permanent effect The effect diminishes exponentially, later the series returns to normal Initial impact + increasing effect as time proceeds
18
Basic types Additive outlier o Subsequent observations are unaffected Level Shift Outlier o In contrast to additive outliers, a level shift outlier affects many observations and has a permanent effect Innovational Outlier o characterized by an initial impact with effects lingering over subsequent observations. The influence of the outliers may increase as time proceeds Transient Change Outlier o Transient change outliers are similar to level shift outliers, but the effect of the outlier diminishes exponentially over the subsequent observations. Eventually, the series returns to its normal level.
19
GENERAL APPROACHES
20
Multi-dimensional data Distance based o Deviation from the rest of the dataset Density based o Deviation from the neighborhood
21
Timeseries Distance based o Deviation from the rest of the dataset Density based o Deviation from the neighborhood
22
Ranking aspects of methods Complexity o Resource requirements Incremental maintainability o Support for online analysis Required domain-specific knowledge o Amount of paremeters Sensitivity o Sensitivity of the method on paremeters Semi-supervised approach o Support for an initial „typical” or „faulty” subset
23
OUTLIER DETECTION IN EDA
24
1D Aggregation suppresses outliers Unimodal distributions: boxplot/rugplot
25
1D Aggregation suppresses outliers Real outliers can be masked on a boxplot
28
Generalization of 1D density
29
Support of multiple operational modes, e.g. READs and WRITEs in a data base Support of multiple operational modes, e.g. READs and WRITEs in a data base Everything else can be marked as „outlier” or „transition” 2d density plot, mosaic plot, heatmap, etc. often overaggregation without concrete points
30
Multi-dimensional plots Good for detecting 1D outliers in one go
31
Multi-dimensional plots: parallel c. Aggregation suppresses outliers
32
Multi-dimensional plots: biplots Biplot o Data points and variables PCA-based o Goal: maximum of variability o Indicator variables o Modified point-point distance
33
DISTANCE BASED APPROACHES
34
Convex hull 0 23154678 76845321 Min.: 23144321 Convex hull method: Tukey, 1974 Extreme points Median: only at the end
35
Convex hull Convex hull method: Tukey, 1974
36
Convex hull Convex hull method: Tukey, 1974
37
DB
38
MCD Minimum Covariance Determinant Idea o „Normal domain”: the most compact subset What does compact mean?
39
MCD Minimum Covariance Determinant Idea o „Normal domain”: the most compact subset What does compact mean? 0.0014 0.00041 0.00011 Exhaustive search? choose(n = 1000, k = 900) [1] 6.385051e+139 Exhaustive search? choose(n = 1000, k = 900) [1] 6.385051e+139
40
FAST-MCD X
41
Distance function By real data: different ranges o E.g., memory in bytes/MB/GB o Comparison of CPU and memory values? Normalization into [0, 1] Other distance functions
42
Mahalanobis distance
43
BACON Blocked Adaptive Computationally Efficient Outlier Nominators Initial set in semi-supervised mode New set: based on a threshold
44
DENSITY BASED APPROACHES
45
DB It is irrelevant that we are in the middle, if we do not have neighbors Distance-based approach
46
LOF motiváció: mikor jó a DB?
47
LOF Local Outlier Factor Basic idea: comparison only with the neighbors o Local density Outlier cretirium o The local density is much lower than by the neighbors
48
LOF If my neighbors are lonely too, then everything is all right LOF: DMwR::lofactor Local outlier factor
49
TIME SERIES
50
Time Series Outliers Time as context attribute is relevant Outlier to find can be o A single deviant point o A whole subsequent Group of points without relationships are irrelevant No need for collective outliers Usually in 1D o direct generalization to more dimensions
51
Outliers among sequences
52
Ideas o project the subsequence into one value o store only the similarity matrix From that point: any clustering method works Typical distance metrics o Numerical: simple Euclidian distance o Discrete: length of the longest common subsequence
53
Distance functions of time series Euclidian o Offset on axis x? Dynamic time warping o Comparison based on outliers Length of common subsequence
54
Dynamic time warping The points are considered not based on their position o Motivation: voice recognition
55
Dynamic time warping
56
Sakoe-Chiba band
57
Longest common subsequence Exact time is abstracted Focus on the order of values
58
Longest common subsequence Exact time is abstracted Focus on the order of values Generalization to numerical values
59
Outliers in sequences
60
Outliers in sequences
61
Outliers in sequences Original data set: 5 different values With removal of this point: 4 values are enough
62
STREAM PROCESSING
63
Stream processing 1.Many sources 2.With unknown sampling frequency 1.Many sources 2.With unknown sampling frequency Resource requirements Once per stream: „Local maximum?” About stream at all times: „Report each new maximum”
64
Typically sliding window approches Autocorrelation methods o Where do we differ from the predicted value? o Where does the autocorrelation model change? Storm application on the lab
65
Summary Hybrid methods can be extremely successful o Visual + automated combinations: precision, speed Usually from business outliers to platform metrics o Triggered sliding window Once we have an outlier o boundaries o real time analysis
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.