Download presentation
Presentation is loading. Please wait.
Published byJean Lamb Modified over 9 years ago
1
Anomaly Detection Using Symmetric Compression Benjamin Arai & Chris Baron Computer Science and Engineering Department University of California - Riverside barai@cs.ucr.edu cbaron@cs.ucr.edu
2
What are anomalies? Something that is peculiar, irregular, abnormal, and difficult to classify with the surrounding data Anomalies are subject to interpretation Two anomalies can look completely different from one another
3
Why is this important? Security –Abnormal activity –Intrusion detection Health –Atypical rhythmic patterns (i.e. heart beat, breathing) Equities and Financial Data Detection in general
4
Motivation Searching for a specific pattern is relatively trivial for a computer (at least in linear time), and has been well researched (I.e. KMP, Boyer-Moore, Edit Distance) How does a computer detect surprising patterns without being told in advance what they look like? Utilize Kolmogorov complexity with compression!
5
Kolmogorov complexity and information distance K(x) – Smallest program that prints out x K(x|y) – Smallest program that prints out x given y as an input Information distance – How different are x and y? –Edit distance? –Normalize
6
Normalized information distance (K(x|y) + K(y|x)) / (K(xy) –Close to 0 then very similar –Close to 1 then very different Compression does a good job at estimating Kolmogorov complexity We use compression to find anomalies
7
How compression works Create a dictionary that maps long sequences to short ones The more these long sequences are used, the better the compression (works well with text) i.e: –the = 01 –and = 10 –algorithm = 11
8
Compression dictionary example
9
How compression works Bzip2 –Burrow-Wheeler transform –Huffman Encoding –Compressed with dictionary These methods combined create an efficient estimate of Kolmogorov complexity
10
Our algorithm Split input string into equal sections –How many sections? Compress each section, and sections containing anomalies should appear as outliers (by looking at their size normalized) For each section containing an anomaly, split and compare against section most likely not containing an anomaly
11
Pseudo code Initial_cuts(data) { do { split(data, number of splits); compress splits; number of splits++; } while(no normalized splits > threshold) base_check = minimal normalized coefficient for each normalized split > threshold { drill_down(normalized split); } Normalized split x =
12
How it works (Example) Initial split 1.0
13
How it works Second split 0.752880921895 0.7471190781051.5
14
How it works Final split 0.678651685393 0.7182022471910.6031460674162.0
15
Preliminary results 0.165094 1.367925 0.04717 2.287736
16
Preliminary results 0.141892 0.628378 1.682432 0.02027 1.317568 2.209459
17
Preliminary results 0.169903 1.019417 0.776699 2.5
18
Preliminary results 0.909091 0.181818 2
19
Results (Partial Epilepsy 1)
21
0.678652 0.718202 0.603146 2
22
Results (Partial Epilepsy 2)
24
0.630324 2 0.739353
25
Multi-anomaly detection
26
Future research Tests extended to using binary data (i.e. pictures, video, etc.) Finding anomalies in pairs of data –It is hot out –Chris is wearing a coat –It is hot out, and Chris is wearing a coat Anomaly detection refinement?
27
Drill down Drill_down(data) { a = data(0…n/2); b = data(n/2+1…n); if(data < size_threshold) { add data’s coordinates to link list and return; } else if(a is similar to b) { Drill_down(a); Drill_down(b); } else if(a is closer to mean) { Drill_down(b); } else { Drill_down(a); } Drills down splits containing anomalies to get a closer approximation Mean = slices of split most likely not to contain an anomaly of sizes data/2
28
Questions? If you have any questions, please visit http://www.google.com
29
References M. Li, X. Chen, X. Li, B. Ma, and P. Vitanyi, The Similarity Metric, 2002 M. Burrows and D.J. Wheeler, A Block-sorting Lossless Data Compression Algorithm, digital Systems Research Center, Palo Alto, CA, 1994 E. Keogh, S. Lonardi, and B. Chiu, Finding Surprising Patterns in a Time Series Database in Linear Time and Space, University of California Riverside, Riverside, CA, 2002
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.