Anomaly Detection Using Symmetric Compression Benjamin Arai & Chris Baron Computer Science and Engineering Department University of California - Riverside.

Anomaly Detection Using Symmetric Compression Benjamin Arai & Chris Baron Computer Science and Engineering Department University of California - Riverside barai@cs.ucr.edu cbaron@cs.ucr.edu

What are anomalies? Something that is peculiar, irregular, abnormal, and difficult to classify with the surrounding data Anomalies are subject to interpretation Two anomalies can look completely different from one another

Why is this important? Security –Abnormal activity –Intrusion detection Health –Atypical rhythmic patterns (i.e. heart beat, breathing) Equities and Financial Data Detection in general

Motivation Searching for a specific pattern is relatively trivial for a computer (at least in linear time), and has been well researched (I.e. KMP, Boyer-Moore, Edit Distance) How does a computer detect surprising patterns without being told in advance what they look like? Utilize Kolmogorov complexity with compression!

Kolmogorov complexity and information distance K(x) – Smallest program that prints out x K(x|y) – Smallest program that prints out x given y as an input Information distance – How different are x and y? –Edit distance? –Normalize

Normalized information distance (K(x|y) + K(y|x)) / (K(xy) –Close to 0 then very similar –Close to 1 then very different Compression does a good job at estimating Kolmogorov complexity We use compression to find anomalies

How compression works Create a dictionary that maps long sequences to short ones The more these long sequences are used, the better the compression (works well with text) i.e: –the = 01 –and = 10 –algorithm = 11

Compression dictionary example

How compression works Bzip2 –Burrow-Wheeler transform –Huffman Encoding –Compressed with dictionary These methods combined create an efficient estimate of Kolmogorov complexity

Our algorithm Split input string into equal sections –How many sections? Compress each section, and sections containing anomalies should appear as outliers (by looking at their size normalized) For each section containing an anomaly, split and compare against section most likely not containing an anomaly

Pseudo code Initial_cuts(data) { do { split(data, number of splits); compress splits; number of splits++; } while(no normalized splits > threshold) base_check = minimal normalized coefficient for each normalized split > threshold { drill_down(normalized split); } Normalized split x =

How it works (Example) Initial split 1.0

How it works Second split 0.752880921895 0.7471190781051.5

How it works Final split 0.678651685393 0.7182022471910.6031460674162.0

Preliminary results 0.165094 1.367925 0.04717 2.287736

Preliminary results 0.141892 0.628378 1.682432 0.02027 1.317568 2.209459

Preliminary results 0.169903 1.019417 0.776699 2.5

Preliminary results 0.909091 0.181818 2

Results (Partial Epilepsy 1)

0.678652 0.718202 0.603146 2

Results (Partial Epilepsy 2)

0.630324 2 0.739353

Multi-anomaly detection

Future research Tests extended to using binary data (i.e. pictures, video, etc.) Finding anomalies in pairs of data –It is hot out –Chris is wearing a coat –It is hot out, and Chris is wearing a coat Anomaly detection refinement?

Drill down Drill_down(data) { a = data(0…n/2); b = data(n/2+1…n); if(data < size_threshold) { add data’s coordinates to link list and return; } else if(a is similar to b) { Drill_down(a); Drill_down(b); } else if(a is closer to mean) { Drill_down(b); } else { Drill_down(a); } Drills down splits containing anomalies to get a closer approximation Mean = slices of split most likely not to contain an anomaly of sizes data/2

Questions? If you have any questions, please visit http://www.google.com

References M. Li, X. Chen, X. Li, B. Ma, and P. Vitanyi, The Similarity Metric, 2002 M. Burrows and D.J. Wheeler, A Block-sorting Lossless Data Compression Algorithm, digital Systems Research Center, Palo Alto, CA, 1994 E. Keogh, S. Lonardi, and B. Chiu, Finding Surprising Patterns in a Time Series Database in Linear Time and Space, University of California Riverside, Riverside, CA, 2002

Anomaly Detection Using Symmetric Compression Benjamin Arai & Chris Baron Computer Science and Engineering Department University of California - Riverside.

Similar presentations

Presentation on theme: "Anomaly Detection Using Symmetric Compression Benjamin Arai & Chris Baron Computer Science and Engineering Department University of California - Riverside."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Anomaly Detection Using Symmetric Compression Benjamin Arai & Chris Baron Computer Science and Engineering Department University of California - Riverside.

Similar presentations

Presentation on theme: "Anomaly Detection Using Symmetric Compression Benjamin Arai & Chris Baron Computer Science and Engineering Department University of California - Riverside."— Presentation transcript:

Similar presentations

About project

Feedback