An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

Slides:

Advertisements

Similar presentations

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

Feature Grouping-Based Fuzzy-Rough Feature Selection Richard Jensen Neil Mac Parthaláin Chris Cornelis.

Imbalanced data David Kauchak CS 451 – Fall 2013.

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.

Evaluating Search Engine

Approximate Nearest Subspace Search with Applications to Pattern Recognition Ronen Basri, Tal Hassner, Lihi Zelnik-Manor presented by Andrew Guillory and.

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.

Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.

Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.

Support Vector Machines

Berkeley Parlab 1. INTRODUCTION A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing 2. CORRELATIONS TO THE GROUND.

05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.

An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.

Copyright © 2001 by Wiley. All rights reserved. Chapter 1: Introduction to Programming and Visual Basic Computer Operations What is Programming? OOED Programming.

Data Mining Techniques

A.C. Chen ADL M Zubair Rafique Muhammad Khurram Khan Khaled Alghathbar Muddassar Farooq The 8th FTRA International Conference on Secure and.

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.

1. cluster the data. 2. for the data of a cluster, set up the network. 3. begin at a random vertex as source/sink s, choose its farthest vertex as the.

Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.

Learning Objectives Data and Information Six Basic Operations Computer Operations Programs and Programming What is Programming? Types of Languages Levels.

Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (

Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,

ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.

Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.

Phoenix Software Projects Larry Beaty © 2007 Larry Beaty. Copying and distribution of this document is permitted in any medium, provided this notice is.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

Introduction To System Analysis and Design

1 A Projection-Based Framework for Classifier Performance Evaluation Nathalie Japkowicz *+ Pritika Sanghi + Peter Tischer + * SITE, University of Ottawa.

Chapter 12 Evaluating Products, Processes, and Resources.

Weka: Experimenter and Knowledge Flow interfaces Neil Mac Parthaláin

CpSc 810: Machine Learning Evaluation of Classifier.

Quality Software Project Management Software Size and Reuse Estimating.

CE Operating Systems Lecture 3 Overview of OS functions and structure.

Outlier analysis. Outliers Working definition –An outlier x k is an element of a data sequence S that is inconsistent with out expectations, based on.

By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.

Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.

Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

Investigation of sub-patterns discovery and its applications By: Xun Lu Supervisor: Jiuyong Li.

Protecting Browsers from Extension Vulnerabilities Paper by: Adam Barth, Adrienne Porter Felt, Prateek Saxena at University of California, Berkeley and.

Presented by Ho Wai Shing

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.

ICS362 – Distributed Systems Dr. Ken Cosh Week 2.

1.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Lecture 2: OS Structures (Chapter 2.7)

Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.

A new clustering tool of Data Mining RAPID MINER.

Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.

Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,

A new protein-protein docking scoring function based on interface residue properties Reporter: Yu Lun Kuo (D )

Detecting Web Attacks Using Multi-Stage Log Analysis

کاربرد نگاشت با حفظ تنکی در شناسایی چهره

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

Source: Procedia Computer Science（2015）70:

Support Vector Machines (SVM)

Data Mining Anomaly Detection

Outlier Discovery/Anomaly Detection

Investigation of sub-patterns discovery and its applications

TECHNICAL PAPER PRESENTATION By: Srihitha Yerabaka

Data Mining Anomaly Detection

Exploiting the Power of Group Differences to Solve Data Analysis Problems Outlier & Intrusion Detection Guozhu Dong, PhD, Professor CSE

Data Mining Anomaly Detection

Presentation transcript:

An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li

2 Contents Research Question Introduction to Outliers The problem with many dimensions Subspace outlier detection Techniques considered Evaluation Achievements New framework Left to Do End

3 Research Question What is the best way to find outliers in high dimensional data?

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. The attributes-are-dimensions metaphor Central to the concept of outliers Each attribute is considered to be a dimension A database is a dataset Each object, or tuple, is a point The schema is a space So finding unusual objects is a geometric problem

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Outliers “an outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.” (Hawkins, 1980)‏ Outliers point to interesting phenomena. Being able to explain outliers adds strength to a model. Outliers can signify important events – network intrusions – credit card fraud – disease outbreaks

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. The Low Dimensional Case For 1 to about 4 attributes, outlier detection is a solved problem A few techniques exist The most popular is LOF – Local Outlier Factor But they are less reliable as the number of dimensions increases – Because of the curse

The Curse of Dimensionality Consider a dataset with d dimensions For any three points, As d → ∞, a, b and c → ∞ but a/b & a/c → 1 i.e. the distances become more similar as the number of dimensions increases This happens under most common conditions

The Curse of Dimensionality So it becomes like this: for some large distance h Traditional approaches can't find outliers – no points are relatively far away But what if there are some unusual points to be found, but some attributes are distracting us

Subspaces A space within another space Less dimensions e.g. a 2D plane crossing 3D space Can be created by selecting a subset of the set of attributes – This is called feature selection – Equivalent to database projection

Subspace Outlier Detection Outlier Detection in the subspaces Actually looking for “subspace outliers” “point x is an outlier in subspace S” i.e. object x has unusual values for some attributes

Existing Techniques Four subspace outlier detection algorithms were looked at Aggarwal Evolutionary Search Subspace Outlier Degree (SOD)‏ Lazarevic Feature Bagging (LazFB)‏ Most Interesting Subspace Top N Outlier Detection (MOIS)‏ Not much evaluation done Three of them have results for some test data in the papers that define them Different test data for each one No comparisons between them

Distance Metrics Normal distance is Euclidean distance A couple of other distance metrics have been found to increase the contrast between distances when there are many dimensions – Nearest neighbour ranking dist(x, y) = k where y is the k-th nearest point to x – Fractional L p norm dist(x, y) = where p < 1 These have not been tried in outlier detection

Research Plan Compare – MOIS – Lazarevic Feature Bagging – Two benchmark algorithms LOF Distance-based outlier (these are non-subspace outlier detection algorithms)‏ Try new distance metrics nearest neighbour rank fractional Lp norm – Use Lazarevic Feature Bagging and LOF – replace the Euclidean distance function with the chosen metric

Evaluation ROC curve Find a parameter that controls sensitivity – number of reported outliers (positives) per number of points Run the algorithm for many values of that parameter Draw a scatter plot of the true positive vs false positive rates Connect the dots The area under the curve (AUC) is the quality of the algorithm for the test data set

Achievement Implementation New framework

Implementation Only one existing algorithm, MOIS, had an available implementation The others need new implementations Implementing all the algorithms involves much repetition Loading datasets Accessing data points Calculating distances A system for code reuse is desirable Variations of the algorithms must be easy to create For any improved algorithms I design Running many tests should be easy

Framework I decided to use a software framework Standardised API for algorithms Inversion of control User commands framework Framework commands algorithms Existing frameworks considered Weka RapidMiner ELKI Weka and ELKI (0.1) don't natively support outlier detection algorithms RapidMiner carries a high implementation overhead Due to architecture

Framework Looking for the quickest way to implement the algorithms The drawbacks mentioned made those frameworks unsuitable for my task Unless they were extended My decision: create a new framework, created just for subspace outlier detection

The New Framework Some design decisions – Use of Weka API to use functionality already available in Weka – Interactive command-line interface scriptability e.g. easy to run tests on an arbitrary set of datasets – Inheritance-friendly design for quick creation of modified algorithms, metrics and data structures – The Metric class some functions are implement as subclasses of Metric makes those functions easier to replace with new ones used for distance metrics

Left to Do Complete LOF implementation Complete Lazarevic Feature Bagging implementation Implement new distance metrics Run tests Analyse results

Thankyou for listening Questions?