University of Crete Department Computer Science CS-562

University of Crete Department Computer Science CS-562
LSH Based Outlier Detection and Its Application in Distributed Setting Madhuchand Rushi Pillutla, Nisarg Raval, Piyush Bansal, Kannan Srinathan and C. V. Jawahar IIIT Hyderabad, India Dimitris Linaritis csdp1104

Table of Contents Introduction
Algorithmic families for Outlier Detection Approximate Detection of Outliers Distance-based Outliers Modified Distance-based Outliers Problem Statement Locality Sensitive Hashing (LSH) Centralized LSH-based outlier detection Quality of Approximation Distributed LSH-based outlier detection Analysis Conclusion Assessment

Algorithmic families for Outlier Detection (1/2)
Techniques for detecting outliers in datasets Statistical methods Geometric methods Density-based approaches Distance-based approaches

Why Distance-based approach Statistical methods are appropriate only if one has a good sense for the background distribution  but typically do not scale well to large datasets

Why Distance-based approach Statistical methods are appropriate only if one has a good sense for the background distribution  but typically do not scale well to large datasets Geometric methods rely on variants of the convex hull algorithm which has a complexity that is exponential  they are often impractical

Why Distance-based approach Statistical methods are appropriate only if one has a good sense for the background distribution  but typically do not scale well to large datasets Geometric methods rely on variants of the convex hull algorithm which has a complexity that is exponential  they are often impractical Density-based approaches to outlier detection rely on the computation of the local neighborhood density of a point  This can be expensive to compute particularly in high dimensions.

Why Distance-based approach Statistical methods are appropriate only if one has a good sense for the background distribution  but typically do not scale well to large datasets Geometric methods rely on variants of the convex hull algorithm which has a complexity that is exponential  they are often impractical Density-based approaches to outlier detection rely on the computation of the local neighborhood density of a point  This can be expensive to compute particularly in high dimensions. Distance-based approaches rely on a well defined notion of outlier based on the nearest neighbor principle  using smart structures and pruning technique have ensured that scale quite well to large datasets of moderate dimensionality

Approximate Detection of Outliers
The need for Approximate algorithms The computational complexity of exact algorithms are quadratic O(n2) More effective and accurate than approximate algorithms Not appropriate for “big data” The computational complexity of approximate algorithms are "near- linear" O(n1+ε) ε>0 Not so effective as exact algorithms but more efficient Αppropriate for “big data” using smart data structures and pruning techniques

Distance-based Outliers (1/12)
Knorr VLDB 1998 An object o is an outlier if a very large fraction pt of the total objects in the dataset D lie outside the radius dt from o

Knorr VLDB 1998 An object o is an outlier if a very large fraction pt of the total objects in the dataset D lie outside the radius dt from o Non Neighbors Neighbors

Knorr VLDB 1998 An object o is an outlier if a very large fraction pt of the total objects in the dataset D lie outside the radius dt from o pt is very large Outlier Non Neighbors Neighbors

Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7

Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4

Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 5

Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 7 D3 = 5

Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 7 D3 = 5 D3 = 9

Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 6 D3 = 7 D3 = 5 D3 = 9

Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 6 D3 = 7 D3 = 5 D3 = 9 D3 = 15

Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 8 D3 = 4 D3 = 6 D3 = 7 D3 = 5 D3 = 9 D3 = 15

Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 8 D3 = 4 D3 = 6 Outlier D3 = 7 D3 = 5 D3 = 9 D3 = 15

Modified Distance-based Outliers (1/3)
Converse of the Knorr’s definition An object is non-outlier if it has enough neighbors within the radius dt.

Converse of the Knorr’s definition An object is non-outlier if it has enough neighbors within the radius dt. Neighbors Non Neighbors

Converse of the Knorr’s definition An object is non-outlier if it has enough neighbors within the radius dt. Non - Outlier Neighbors Non Neighbors

Converse of the Knorr’s definition An object is non-outlier if it has enough neighbors within the radius dt. Non - Outlier Neighbors Non Neighbors p’t is very low

Problem Statement (1/2) The problem
To find all those non-outliers in the data which have many neighbors we must calculate the distances to every other object in the dataset. Number of queries  Number of objects in database

Databases are very large!
Problem Statement (1/2) The problem To find all those non-outliers in the data which have many neighbors we must calculate the distances to every other object in the dataset. Number of queries  Number of objects in database Databases are very large!

Databases are very large!
Problem Statement (2/2) The problem To find all those non-outliers in the data which have many neighbors we must calculate the distances to every other object in the dataset. Number of queries  Number of objects in database Need to find Approximate Algorithms Databases are very large!

Locality Sensitive Hashing (LSH) (1/2)
Very efficient to find near neighbors Similar objects are hashed to the same bin with high probability Most of the non-outliers in the dataset can be easily pruned < 1% of total objects in the dataset need to be processed Hash Objects LSH Bin Structure

Locality Sensitive Hashing (LSH) (2/2)
Definitions Conditions to be satisfied To amplify gap between probabilities g(p) = ( h1(p), h2(p), ..., hk(p) )

Centralized LSH-based outlier detection (1/10)
Pruning Bin Threshold Compute Parameters Find Near Neighbors Removing False Negatives Generate Bin Structure Prune Non Outliers Reducing False Positives Phase 1 Phase 2 Phase 3

Compute parameters: Distance R = r1 < dt Fraction p’t

Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset LSH Bin Structure

Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins LSH Bin Structure

Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins Finding Neighbors of o  objects in same bins LSH Bin Structure Neighbors

Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins Finding Neighbors of o  objects in same bins If number of neighbors > p’t  Non-Outlier LSH Bin Structure Neighbors Non-Outlier Outlier

Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins Finding Neighbors of o  objects in same bins If number of neighbors > p’t  Non-Outlier Pruning  Neighbors of a non-outlier are also non-outliers LSH Bin Structure Neighbors Non-Outliers Outlier

Quality of Approximation (1/4)
Within distance dt the probability of being near neighbor is at most p2K So false near neighbors may cause pruning of an outlier Probably set may contains False Negatives False Negatives  Label an outlier to be a non-outlier

To reduce false negatives insert in our algorithm the Bin Threshold An object is counted as neighbor only if it appears in at least bt bins out of L bins. On the right, we use the value for bt = 3 So the purple object is not non-outlier as it appears only in 1 bin with object o LSH Bin Structure bt = 3 Neighbors Non-Outliers Outlier

High value of bin threshold decrease the efficiency of pruning technique Removes actual neighbors The result of removing neighbors is to label a non-outlier as outlier False Positive Optimal value for bt would be one which would remove the false negatives at the cost of introducing minimal false positives High value of bt False positive

To reduce false positives, for each object of set calculate the distances with all objects of dataset. Set of probable outliers Yes: count +1 Yes Distance > dt count >= pt Outlier Dataset

Distributed LSH-based outlier detection (1/17)
Horizontal Distribution Each player has a subset of the total number of objects with same attributes Name Segment Country City State Claire Gute Consumer United States Henderson Kentucky Darrin Van Huff Corporate Los Angeles California Sean O'Donnell Fort Lauderdale Florida Player A Player B

Local Outlier An object with Player Pi is a local outlier if the number of objects in the local dataset Di lying at a distance greater dt is at least a fraction pt of total dataset Dataset Local dataset of player A Local outlier

Global Outlier An object o is an outlier if a very large fraction pt of the total objects in the dataset D lie outside the radius dt from o Dataset Global Outlier Non Neighbors Neighbors

Three Phases Generate LSH bin structure and compute local probable outliers by running the centralized algorithm on local dataset Finding Global approximate outliers using the generated bin structure Reducing false positives from Global returning set of outliers

Player A Size of local dataset Player B

Compute size of entire dataset
Distributed LSH-based outlier detection (6/17) Player A Size of local dataset Player B Compute size of entire dataset

Compute size of entire dataset
Distributed LSH-based outlier detection (7/17) Player A Size of local dataset Player B Compute size of entire dataset Size of entire dataset

Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset

Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Run Centralized algorithm to get probable Local Outliers M`

Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M`

Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M`

Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M`

Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M”

Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M” Probable Outliers M”

Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M” Probable Outliers M” For each object in M” compute distance to each object in DB, count number of objects that distance > dt

Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M” Probable Outliers M” For each object in M” compute distance to each object in DB count number of objects that distance > dt Counts of Objects

Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M” Probable Outliers M” For each object in M” compute distance to each object in DB count number of objects that distance > dt Counts of Objects Make same actions as PB on its dataset. Computes the sum of two counts and if count >= pt mark as outlier

Analysis: Complexities (1/2)
Centralized algorithm The overall computational complexity is O(ndL) n = |D| d is the dimensionality of the dataset L = n1/(1+ε) ε > 0 is an approximation factor.

Analysis: Complexities (2/2)
Distributed algorithm The computational complexity of a player A is O(nAdL) nA = |DA| d is the dimensionality of the dataset L = n1/(1+ε) ε > 0 is an approximation factor. The overall communication complexity of a player A is O(m’AL + m”Ad) m’A = |M’A| (Probable Local Outliers) m”A = |M”A| (Probable Global Outliers)

Analysis: Experiments (1/3)
Centralized algorithm Computed optimal bin threshold for the datasets below Achieved very low false positive rate and zero false negative rate Achieved 100% detection rate on optimal bin threshold.

Centralized algorithm Percentage of objects for which the FindNeighbors algorithm is invoked For small datasets the percentage of queried objects is close to 1% Dataset gets larger this value keeps decreasing rapidly

Distributed algorithm On experiments objects have been uniformly distributed among the players The experiment have been repeated by varying the number of players from 2 to 5 and studied the effect on the communication cost The percentage of the communicated objects is indeed very less. (<1%)

Conclusion Approximate Distance-Based Outlier Detection
Converse of the Knorr’s definition Efficient pruning technique using LSH The percentage of queried objects is close to 1% Horizontally distributed setting Highly efficient in terms of the communication cost

Assessment The paper tries to tackle the problem of distance-based outlier detection using smart structure and pruning technique The approximate algorithm of this paper is a very efficient way to find outliers in large datasets using LSH bin structure The algorithm of this paper is based on the definition introduced by Knorr The experiments which made over 5 datasets look realistic with the hardware that has been reported No need any knowledge background for this paper Detailed description of algorithms but lack of examples to help you to understand them

THANK YOU! Dimitris Linaritis, 1104,

University of Crete Department Computer Science CS-562

Similar presentations

Presentation on theme: "University of Crete Department Computer Science CS-562"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Crete Department Computer Science CS-562

Similar presentations

Presentation on theme: "University of Crete Department Computer Science CS-562"— Presentation transcript:

Similar presentations

About project

Feedback