Presentation is loading. Please wait.

Presentation is loading. Please wait.

Budapest University of Technology and Economics Department of Measurement and Information Systems Budapest University of Technology and Economics Fault.

Similar presentations


Presentation on theme: "Budapest University of Technology and Economics Department of Measurement and Information Systems Budapest University of Technology and Economics Fault."— Presentation transcript:

1 Budapest University of Technology and Economics Department of Measurement and Information Systems Budapest University of Technology and Economics Fault Tolerant Systems Research Group O UTLIER D ETECTION IN IT D ATA Agnes Salanki salanki@mit.bme.hu

2 Small motivational example (1949) Hadlum vs. Hadlum

3 Small motivational example (1949) Source: http://www.siam.org/meetings/sdm10/tutorial3.pdfhttp://www.siam.org/meetings/sdm10/tutorial3.pdf Average: 280 days (40 weeks) Mrs. Hadlum: 349

4 From the system modelling aspect  Goals of infrastructure data analysis: identify o Operational domains o Domain boundaries o Transitions effects o Without one specific high level QoS metric  Carrier grade in telecommunication: 99.999% availability: faults are rare Outlier detection So far: typical clustering

5 Terminology… anomaly surprise rare event novelty outlier exception aberration peculiarity discordant observations 1.Rare: low relative frequency of occurrence 2.Anomalous: „deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins, 1980) 1.Rare: low relative frequency of occurrence 2.Anomalous: „deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins, 1980)

6 Outline Taxonomy Point vs Collective Behavioral vs Context Effect aspect Approaches Visual methods Distance based Density based Temporal Time series Stream data

7 TAXONOMY

8  Point anomaly o an individual data instance o E. g., low service throughput for a short time interval  Collective anomaly o collection of related data instances o relationship? o E.g., continuously high CPU usage Point vs. Collective?

9  Point anomaly o an individual data instance  Collective anomaly o collection of related data instances o relation?

10 Point vs. Collective?  Point anomaly o an individual data instance  Collective anomaly o collection of related data instances o relation? One single value of -5.7 is typical; 20 right after each other is rather interesting One single value of -5.7 is typical; 20 right after each other is rather interesting

11 Point vs. Collective? Point anomaly Collective anomaly

12 Contextual vs. Behavioral  Contextual anomaly o an instance is anomalous in a specific context but not otherwise o Only wrt. contextual variables time, location, benchmark configuration, etc. o E.g., the continiously high CPU usage is acceptable only during workdays  Behavioral anomaly o an instance is anomalous without any context information o E.g., continiously high CPU usage is never acceptable

13 Contextual vs. Behavioral  Contextual anomaly o a data instance is anomalous in a specific context o but not otherwise  Behavioral anomaly o a data instance is anomalous independently from the context? Contextual: -4°C in the middle of March Behavioral: simply too low

14 Contextual vs. Behavioral Without the time axis, a cpu_idle of value 600 cannot be considered as suspicious

15 Time Series Outliers

16 Categorization by system effects  Additive outlier o Subsequent observations are unaffected  Level Shift Outlier o It has a permanent effect  Innovational Outlier o Initial impact + increasing effect as time proceeds  Transient Change Outlier o ~Innovational outliers, but the effect diminishes exponentially, later the series returns to normal

17 Basic types Additive Transient change Level Shift Innovational Subsequent observations are unaffected Permanent effect The effect diminishes exponentially, later the series returns to normal Initial impact + increasing effect as time proceeds

18 Basic types  Additive outlier o Subsequent observations are unaffected  Level Shift Outlier o In contrast to additive outliers, a level shift outlier affects many observations and has a permanent effect  Innovational Outlier o characterized by an initial impact with effects lingering over subsequent observations. The influence of the outliers may increase as time proceeds  Transient Change Outlier o Transient change outliers are similar to level shift outliers, but the effect of the outlier diminishes exponentially over the subsequent observations. Eventually, the series returns to its normal level.

19 GENERAL APPROACHES

20 Multi-dimensional data  Distance based o Deviation from the rest of the dataset  Density based o Deviation from the neighborhood

21 Timeseries  Distance based o Deviation from the rest of the dataset  Density based o Deviation from the neighborhood

22 Ranking aspects of methods  Complexity o Resource requirements  Incremental maintainability o Support for online analysis  Required domain-specific knowledge o Amount of paremeters  Sensitivity o Sensitivity of the method on paremeters  Semi-supervised approach o Support for an initial „typical” or „faulty” subset

23 OUTLIER DETECTION IN EDA

24 1D  Aggregation suppresses outliers Unimodal distributions: boxplot/rugplot

25 1D  Aggregation suppresses outliers Real outliers can be masked on a boxplot

26

27

28  Generalization of 1D density

29 Support of multiple operational modes, e.g. READs and WRITEs in a data base Support of multiple operational modes, e.g. READs and WRITEs in a data base Everything else can be marked as „outlier” or „transition” 2d density plot, mosaic plot, heatmap, etc.  often overaggregation without concrete points 

30 Multi-dimensional plots Good for detecting 1D outliers in one go

31 Multi-dimensional plots: parallel c. Aggregation suppresses outliers

32 Multi-dimensional plots: biplots  Biplot o Data points and variables  PCA-based o Goal: maximum of variability o Indicator variables o Modified point-point distance

33 DISTANCE BASED APPROACHES

34 Convex hull 0 23154678 76845321 Min.: 23144321  Convex hull method: Tukey, 1974 Extreme points Median: only at the end

35 Convex hull  Convex hull method: Tukey, 1974

36 Convex hull  Convex hull method: Tukey, 1974

37 DB

38 MCD  Minimum Covariance Determinant  Idea o „Normal domain”: the most compact subset  What does compact mean?

39 MCD  Minimum Covariance Determinant  Idea o „Normal domain”: the most compact subset  What does compact mean? 0.0014 0.00041 0.00011 Exhaustive search? choose(n = 1000, k = 900) [1] 6.385051e+139 Exhaustive search? choose(n = 1000, k = 900) [1] 6.385051e+139

40 FAST-MCD X  

41 Distance function  By real data: different ranges o E.g., memory in bytes/MB/GB o Comparison of CPU and memory values?  Normalization into [0, 1]  Other distance functions

42 Mahalanobis distance

43 BACON  Blocked Adaptive Computationally Efficient Outlier Nominators  Initial set in semi-supervised mode  New set: based on a threshold

44 DENSITY BASED APPROACHES

45 DB It is irrelevant that we are in the middle, if we do not have neighbors  Distance-based approach

46 LOF motiváció: mikor jó a DB?

47 LOF  Local Outlier Factor  Basic idea: comparison only with the neighbors o Local density  Outlier cretirium o The local density is much lower than by the neighbors

48 LOF If my neighbors are lonely too, then everything is all right LOF: DMwR::lofactor Local outlier factor

49 TIME SERIES

50 Time Series Outliers  Time as context attribute is relevant  Outlier to find can be o A single deviant point o A whole subsequent Group of points without relationships are irrelevant No need for collective outliers  Usually in 1D o direct generalization to more dimensions

51 Outliers among sequences

52  Ideas o project the subsequence into one value o store only the similarity matrix  From that point: any clustering method works  Typical distance metrics o Numerical: simple Euclidian distance o Discrete: length of the longest common subsequence

53 Distance functions of time series  Euclidian o Offset on axis x?  Dynamic time warping o Comparison based on outliers  Length of common subsequence

54 Dynamic time warping  The points are considered not based on their position o Motivation: voice recognition

55 Dynamic time warping

56 Sakoe-Chiba band

57 Longest common subsequence  Exact time is abstracted  Focus on the order of values

58 Longest common subsequence  Exact time is abstracted  Focus on the order of values  Generalization to numerical values

59 Outliers in sequences  

60 Outliers in sequences 

61 Outliers in sequences Original data set: 5 different values With removal of this point: 4 values are enough

62 STREAM PROCESSING

63 Stream processing 1.Many sources 2.With unknown sampling frequency 1.Many sources 2.With unknown sampling frequency Resource requirements Once per stream: „Local maximum?” About stream at all times: „Report each new maximum”

64 Typically sliding window approches  Autocorrelation methods o Where do we differ from the predicted value? o Where does the autocorrelation model change?  Storm application on the lab

65 Summary  Hybrid methods can be extremely successful o Visual + automated combinations: precision, speed  Usually from business outliers to platform metrics o Triggered sliding window  Once we have an outlier o boundaries o real time analysis


Download ppt "Budapest University of Technology and Economics Department of Measurement and Information Systems Budapest University of Technology and Economics Fault."

Similar presentations


Ads by Google