University of Crete Department Computer Science CS-562

Slides:

Advertisements

Similar presentations

Object Recognition Using Locality-Sensitive Hashing of Shape Contexts Andrea Frome, Jitendra Malik Presented by Ilias Apostolopoulos.

Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.

Similarity Search in High Dimensions via Hashing

電腦視覺 Computer and Robot Vision I Chapter2: Binary Machine Vision: Thresholding and Segmentation Instructor: Shih-Shinh Huang 1.

A (1+  )-Approximation Algorithm for 2-Line-Center P.K. Agarwal, C.M. Procopiuc, K.R. Varadarajan Computational Geometry 2003.

Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.

Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.

Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.

Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.

1 Chapter 20 Two Categorical Variables: The Chi-Square Test.

Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.

80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.

CS654: Digital Image Analysis

Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer.

Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

Privacy Preserving Outlier Detection using Locality Sensitive Hashing

IIIT Hyderabad Private Outlier Detection and Content based Encrypted Search Nisarg Raval MS by Research, CSE Advisors : Prof. C. V. Jawahar & Dr. Kannan.

Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,

Distance-Based Outlier Detection: Consolidation and Renewed Bearing Gustavo Henrique Orair Federal University of Minas Gerais Wagner Meira Jr. Federal.

Optimization Problems

Spatial Data Management

Cohesive Subgraph Computation over Large Graphs

How to forecast solar flares?

More on Clustering in COSC 4335

SIMILARITY SEARCH The Metric Space Approach

Data Driven Resource Allocation for Distributed Learning

Data Mining Soongsil University

Optimizing Parallel Algorithms for All Pairs Similarity Search

Influence sets based on Reverse Nearest Neighbor Queries

A paper on Join Synopses for Approximate Query Answering

Data Mining K-means Algorithm

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Ch8: Nonparametric Methods

Core-Sets and Geometric Optimization problems.

Chapter 15 QUERY EXECUTION.

Data Mining Anomaly Detection

Outlier Discovery/Anomaly Detection

K Nearest Neighbor Classification

6. Introduction to nonparametric clustering

Association Rule Mining

Optimization Problems

Data Mining Anomaly/Outlier Detection

Locality Sensitive Hashing

COSC 4335: Other Classification Techniques

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Data Transformations targeted at minimizing experimental variance

Nearest Neighbors CSC 576: Data Mining.

CS5112: Algorithms and Data Structures for Applications

Data Mining Classification: Alternative Techniques

Complexity Theory in Practice

Minwise Hashing and Efficient Search

President’s Day Lecture: Advanced Nearest Neighbor Search

Data Mining Anomaly Detection

Data Mining Anomaly Detection

Presentation transcript:

University of Crete Department Computer Science CS-562 LSH Based Outlier Detection and Its Application in Distributed Setting Madhuchand Rushi Pillutla, Nisarg Raval, Piyush Bansal, Kannan Srinathan and C. V. Jawahar IIIT Hyderabad, India Dimitris Linaritis csdp1104

Table of Contents Introduction Algorithmic families for Outlier Detection Approximate Detection of Outliers Distance-based Outliers Modified Distance-based Outliers Problem Statement Locality Sensitive Hashing (LSH) Centralized LSH-based outlier detection Quality of Approximation Distributed LSH-based outlier detection Analysis Conclusion Assessment

Algorithmic families for Outlier Detection (1/2) Techniques for detecting outliers in datasets Statistical methods Geometric methods Density-based approaches Distance-based approaches

Algorithmic families for Outlier Detection (2/2) Why Distance-based approach Statistical methods are appropriate only if one has a good sense for the background distribution  but typically do not scale well to large datasets

Algorithmic families for Outlier Detection (2/2) Why Distance-based approach Statistical methods are appropriate only if one has a good sense for the background distribution  but typically do not scale well to large datasets Geometric methods rely on variants of the convex hull algorithm which has a complexity that is exponential  they are often impractical

Algorithmic families for Outlier Detection (2/2) Why Distance-based approach Statistical methods are appropriate only if one has a good sense for the background distribution  but typically do not scale well to large datasets Geometric methods rely on variants of the convex hull algorithm which has a complexity that is exponential  they are often impractical Density-based approaches to outlier detection rely on the computation of the local neighborhood density of a point  This can be expensive to compute particularly in high dimensions.

Algorithmic families for Outlier Detection (2/2) Why Distance-based approach Statistical methods are appropriate only if one has a good sense for the background distribution  but typically do not scale well to large datasets Geometric methods rely on variants of the convex hull algorithm which has a complexity that is exponential  they are often impractical Density-based approaches to outlier detection rely on the computation of the local neighborhood density of a point  This can be expensive to compute particularly in high dimensions. Distance-based approaches rely on a well defined notion of outlier based on the nearest neighbor principle  using smart structures and pruning technique have ensured that scale quite well to large datasets of moderate dimensionality

Approximate Detection of Outliers The need for Approximate algorithms The computational complexity of exact algorithms are quadratic O(n2) More effective and accurate than approximate algorithms Not appropriate for “big data” The computational complexity of approximate algorithms are "near- linear" O(n1+ε) ε>0 Not so effective as exact algorithms but more efficient Αppropriate for “big data” using smart data structures and pruning techniques

Distance-based Outliers (1/12) Knorr VLDB 1998 An object o is an outlier if a very large fraction pt of the total objects in the dataset D lie outside the radius dt from o

Distance-based Outliers (2/12) Knorr VLDB 1998 An object o is an outlier if a very large fraction pt of the total objects in the dataset D lie outside the radius dt from o Non Neighbors Neighbors

Distance-based Outliers (3/12) Knorr VLDB 1998 An object o is an outlier if a very large fraction pt of the total objects in the dataset D lie outside the radius dt from o pt is very large Outlier Non Neighbors Neighbors

Distance-based Outliers (4/12) Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7

Distance-based Outliers (5/12) Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4

Distance-based Outliers (6/12) Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 5

Distance-based Outliers (7/12) Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 7 D3 = 5

Distance-based Outliers (8/12) Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 7 D3 = 5 D3 = 9

Distance-based Outliers (9/12) Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 6 D3 = 7 D3 = 5 D3 = 9

Distance-based Outliers (10/12) Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 6 D3 = 7 D3 = 5 D3 = 9 D3 = 15

Distance-based Outliers (11/12) Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 8 D3 = 4 D3 = 6 D3 = 7 D3 = 5 D3 = 9 D3 = 15

Distance-based Outliers (12/12) Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 8 D3 = 4 D3 = 6 Outlier D3 = 7 D3 = 5 D3 = 9 D3 = 15

Modified Distance-based Outliers (1/3) Converse of the Knorr’s definition An object is non-outlier if it has enough neighbors within the radius dt.

Modified Distance-based Outliers (2/3) Converse of the Knorr’s definition An object is non-outlier if it has enough neighbors within the radius dt. Neighbors Non Neighbors

Modified Distance-based Outliers (3/3) Converse of the Knorr’s definition An object is non-outlier if it has enough neighbors within the radius dt. Non - Outlier Neighbors Non Neighbors

Modified Distance-based Outliers (3/3) Converse of the Knorr’s definition An object is non-outlier if it has enough neighbors within the radius dt. Non - Outlier Neighbors Non Neighbors p’t is very low

Problem Statement (1/2) The problem To find all those non-outliers in the data which have many neighbors we must calculate the distances to every other object in the dataset. Number of queries  Number of objects in database

Databases are very large! Problem Statement (1/2) The problem To find all those non-outliers in the data which have many neighbors we must calculate the distances to every other object in the dataset. Number of queries  Number of objects in database Databases are very large!

Databases are very large! Problem Statement (2/2) The problem To find all those non-outliers in the data which have many neighbors we must calculate the distances to every other object in the dataset. Number of queries  Number of objects in database Need to find Approximate Algorithms Databases are very large!

Locality Sensitive Hashing (LSH) (1/2) Very efficient to find near neighbors Similar objects are hashed to the same bin with high probability Most of the non-outliers in the dataset can be easily pruned < 1% of total objects in the dataset need to be processed Hash Objects LSH Bin Structure

Locality Sensitive Hashing (LSH) (2/2) Definitions Conditions to be satisfied To amplify gap between probabilities g(p) = ( h1(p), h2(p), ..., hk(p) )

Centralized LSH-based outlier detection (1/10) Pruning Bin Threshold Compute Parameters Find Near Neighbors Removing False Negatives Generate Bin Structure Prune Non Outliers Reducing False Positives Phase 1 Phase 2 Phase 3

Centralized LSH-based outlier detection (2/10) Compute parameters: Distance R = r1 < dt Fraction p’t

Centralized LSH-based outlier detection (3/10) Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset LSH Bin Structure

Centralized LSH-based outlier detection (4/10) Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins LSH Bin Structure

Centralized LSH-based outlier detection (5/10) Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins LSH Bin Structure

Centralized LSH-based outlier detection (6/10) Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins LSH Bin Structure

Centralized LSH-based outlier detection (7/10) Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins LSH Bin Structure

Centralized LSH-based outlier detection (8/10) Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins Finding Neighbors of o  objects in same bins LSH Bin Structure Neighbors

Centralized LSH-based outlier detection (9/10) Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins Finding Neighbors of o  objects in same bins If number of neighbors > p’t  Non-Outlier LSH Bin Structure Neighbors Non-Outlier Outlier

Centralized LSH-based outlier detection (10/10) Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins Finding Neighbors of o  objects in same bins If number of neighbors > p’t  Non-Outlier Pruning  Neighbors of a non-outlier are also non-outliers LSH Bin Structure Neighbors Non-Outliers Outlier

Quality of Approximation (1/4) Within distance dt the probability of being near neighbor is at most p2K So false near neighbors may cause pruning of an outlier Probably set may contains False Negatives False Negatives  Label an outlier to be a non-outlier

Quality of Approximation (2/4) To reduce false negatives insert in our algorithm the Bin Threshold An object is counted as neighbor only if it appears in at least bt bins out of L bins. On the right, we use the value for bt = 3 So the purple object is not non-outlier as it appears only in 1 bin with object o LSH Bin Structure bt = 3 Neighbors Non-Outliers Outlier

Quality of Approximation (3/4) High value of bin threshold decrease the efficiency of pruning technique Removes actual neighbors The result of removing neighbors is to label a non-outlier as outlier False Positive Optimal value for bt would be one which would remove the false negatives at the cost of introducing minimal false positives High value of bt False positive

Quality of Approximation (4/4) To reduce false positives, for each object of set calculate the distances with all objects of dataset. Set of probable outliers Yes: count +1 Yes Distance > dt count >= pt Outlier Dataset

Distributed LSH-based outlier detection (1/17) Horizontal Distribution Each player has a subset of the total number of objects with same attributes Name Segment Country City State Claire Gute Consumer United States Henderson Kentucky Darrin Van Huff Corporate Los Angeles California Sean O'Donnell Fort Lauderdale Florida Player A Player B

Distributed LSH-based outlier detection (2/17) Local Outlier An object with Player Pi is a local outlier if the number of objects in the local dataset Di lying at a distance greater dt is at least a fraction pt of total dataset Dataset Local dataset of player A Local outlier

Distributed LSH-based outlier detection (3/17) Global Outlier An object o is an outlier if a very large fraction pt of the total objects in the dataset D lie outside the radius dt from o Dataset Global Outlier Non Neighbors Neighbors

Distributed LSH-based outlier detection (4/17) Three Phases Generate LSH bin structure and compute local probable outliers by running the centralized algorithm on local dataset Finding Global approximate outliers using the generated bin structure Reducing false positives from Global returning set of outliers

Distributed LSH-based outlier detection (5/17) Player A Size of local dataset Player B

Compute size of entire dataset Distributed LSH-based outlier detection (6/17) Player A Size of local dataset Player B Compute size of entire dataset

Compute size of entire dataset Distributed LSH-based outlier detection (7/17) Player A Size of local dataset Player B Compute size of entire dataset Size of entire dataset

Distributed LSH-based outlier detection (8/17) Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset

Distributed LSH-based outlier detection (9/17) Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Run Centralized algorithm to get probable Local Outliers M`

Distributed LSH-based outlier detection (10/17) Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M`

Distributed LSH-based outlier detection (11/17) Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M`

Distributed LSH-based outlier detection (12/17) Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M`

Distributed LSH-based outlier detection (13/17) Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M”

Distributed LSH-based outlier detection (14/17) Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M” Probable Outliers M”

Distributed LSH-based outlier detection (15/17) Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M” Probable Outliers M” For each object in M” compute distance to each object in DB, count number of objects that distance > dt

Distributed LSH-based outlier detection (16/17) Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M” Probable Outliers M” For each object in M” compute distance to each object in DB count number of objects that distance > dt Counts of Objects

Distributed LSH-based outlier detection (17/17) Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M” Probable Outliers M” For each object in M” compute distance to each object in DB count number of objects that distance > dt Counts of Objects Make same actions as PB on its dataset. Computes the sum of two counts and if count >= pt mark as outlier

Analysis: Complexities (1/2) Centralized algorithm The overall computational complexity is O(ndL) n = |D| d is the dimensionality of the dataset L = n1/(1+ε) ε > 0 is an approximation factor.

Analysis: Complexities (2/2) Distributed algorithm The computational complexity of a player A is O(nAdL) nA = |DA| d is the dimensionality of the dataset L = n1/(1+ε) ε > 0 is an approximation factor. The overall communication complexity of a player A is O(m’AL + m”Ad) m’A = |M’A| (Probable Local Outliers) m”A = |M”A| (Probable Global Outliers)

Analysis: Experiments (1/3) Centralized algorithm Computed optimal bin threshold for the datasets below Achieved very low false positive rate and zero false negative rate Achieved 100% detection rate on optimal bin threshold.

Analysis: Experiments (2/3) Centralized algorithm Percentage of objects for which the FindNeighbors algorithm is invoked For small datasets the percentage of queried objects is close to 1% Dataset gets larger this value keeps decreasing rapidly

Analysis: Experiments (3/3) Distributed algorithm On experiments objects have been uniformly distributed among the players The experiment have been repeated by varying the number of players from 2 to 5 and studied the effect on the communication cost The percentage of the communicated objects is indeed very less. (<1%)

Conclusion Approximate Distance-Based Outlier Detection Converse of the Knorr’s definition Efficient pruning technique using LSH The percentage of queried objects is close to 1% Horizontally distributed setting Highly efficient in terms of the communication cost

Assessment The paper tries to tackle the problem of distance-based outlier detection using smart structure and pruning technique The approximate algorithm of this paper is a very efficient way to find outliers in large datasets using LSH bin structure The algorithm of this paper is based on the definition introduced by Knorr The experiments which made over 5 datasets look realistic with the hardware that has been reported No need any knowledge background for this paper Detailed description of algorithms but lack of examples to help you to understand them

THANK YOU! Dimitris Linaritis, 1104, dimilin@csd.uoc.gr