Download presentation
Presentation is loading. Please wait.
Published byOscar Hermansen Modified over 6 years ago
1
University of Crete Department Computer Science CS-562
LSH Based Outlier Detection and Its Application in Distributed Setting Madhuchand Rushi Pillutla, Nisarg Raval, Piyush Bansal, Kannan Srinathan and C. V. Jawahar IIIT Hyderabad, India Dimitris Linaritis csdp1104
2
Table of Contents Introduction
Algorithmic families for Outlier Detection Approximate Detection of Outliers Distance-based Outliers Modified Distance-based Outliers Problem Statement Locality Sensitive Hashing (LSH) Centralized LSH-based outlier detection Quality of Approximation Distributed LSH-based outlier detection Analysis Conclusion Assessment
3
Algorithmic families for Outlier Detection (1/2)
Techniques for detecting outliers in datasets Statistical methods Geometric methods Density-based approaches Distance-based approaches
4
Algorithmic families for Outlier Detection (2/2)
Why Distance-based approach Statistical methods are appropriate only if one has a good sense for the background distribution but typically do not scale well to large datasets
5
Algorithmic families for Outlier Detection (2/2)
Why Distance-based approach Statistical methods are appropriate only if one has a good sense for the background distribution but typically do not scale well to large datasets Geometric methods rely on variants of the convex hull algorithm which has a complexity that is exponential they are often impractical
6
Algorithmic families for Outlier Detection (2/2)
Why Distance-based approach Statistical methods are appropriate only if one has a good sense for the background distribution but typically do not scale well to large datasets Geometric methods rely on variants of the convex hull algorithm which has a complexity that is exponential they are often impractical Density-based approaches to outlier detection rely on the computation of the local neighborhood density of a point This can be expensive to compute particularly in high dimensions.
7
Algorithmic families for Outlier Detection (2/2)
Why Distance-based approach Statistical methods are appropriate only if one has a good sense for the background distribution but typically do not scale well to large datasets Geometric methods rely on variants of the convex hull algorithm which has a complexity that is exponential they are often impractical Density-based approaches to outlier detection rely on the computation of the local neighborhood density of a point This can be expensive to compute particularly in high dimensions. Distance-based approaches rely on a well defined notion of outlier based on the nearest neighbor principle using smart structures and pruning technique have ensured that scale quite well to large datasets of moderate dimensionality
8
Approximate Detection of Outliers
The need for Approximate algorithms The computational complexity of exact algorithms are quadratic O(n2) More effective and accurate than approximate algorithms Not appropriate for “big data” The computational complexity of approximate algorithms are "near- linear" O(n1+ε) ε>0 Not so effective as exact algorithms but more efficient Αppropriate for “big data” using smart data structures and pruning techniques
9
Distance-based Outliers (1/12)
Knorr VLDB 1998 An object o is an outlier if a very large fraction pt of the total objects in the dataset D lie outside the radius dt from o
10
Distance-based Outliers (2/12)
Knorr VLDB 1998 An object o is an outlier if a very large fraction pt of the total objects in the dataset D lie outside the radius dt from o Non Neighbors Neighbors
11
Distance-based Outliers (3/12)
Knorr VLDB 1998 An object o is an outlier if a very large fraction pt of the total objects in the dataset D lie outside the radius dt from o pt is very large Outlier Non Neighbors Neighbors
12
Distance-based Outliers (4/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7
13
Distance-based Outliers (5/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4
14
Distance-based Outliers (6/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 5
15
Distance-based Outliers (7/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 7 D3 = 5
16
Distance-based Outliers (8/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 7 D3 = 5 D3 = 9
17
Distance-based Outliers (9/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 6 D3 = 7 D3 = 5 D3 = 9
18
Distance-based Outliers (10/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 6 D3 = 7 D3 = 5 D3 = 9 D3 = 15
19
Distance-based Outliers (11/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 8 D3 = 4 D3 = 6 D3 = 7 D3 = 5 D3 = 9 D3 = 15
20
Distance-based Outliers (12/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 8 D3 = 4 D3 = 6 Outlier D3 = 7 D3 = 5 D3 = 9 D3 = 15
21
Modified Distance-based Outliers (1/3)
Converse of the Knorr’s definition An object is non-outlier if it has enough neighbors within the radius dt.
22
Modified Distance-based Outliers (2/3)
Converse of the Knorr’s definition An object is non-outlier if it has enough neighbors within the radius dt. Neighbors Non Neighbors
23
Modified Distance-based Outliers (3/3)
Converse of the Knorr’s definition An object is non-outlier if it has enough neighbors within the radius dt. Non - Outlier Neighbors Non Neighbors
24
Modified Distance-based Outliers (3/3)
Converse of the Knorr’s definition An object is non-outlier if it has enough neighbors within the radius dt. Non - Outlier Neighbors Non Neighbors p’t is very low
25
Problem Statement (1/2) The problem
To find all those non-outliers in the data which have many neighbors we must calculate the distances to every other object in the dataset. Number of queries Number of objects in database
26
Databases are very large!
Problem Statement (1/2) The problem To find all those non-outliers in the data which have many neighbors we must calculate the distances to every other object in the dataset. Number of queries Number of objects in database Databases are very large!
27
Databases are very large!
Problem Statement (2/2) The problem To find all those non-outliers in the data which have many neighbors we must calculate the distances to every other object in the dataset. Number of queries Number of objects in database Need to find Approximate Algorithms Databases are very large!
28
Locality Sensitive Hashing (LSH) (1/2)
Very efficient to find near neighbors Similar objects are hashed to the same bin with high probability Most of the non-outliers in the dataset can be easily pruned < 1% of total objects in the dataset need to be processed Hash Objects LSH Bin Structure
29
Locality Sensitive Hashing (LSH) (2/2)
Definitions Conditions to be satisfied To amplify gap between probabilities g(p) = ( h1(p), h2(p), ..., hk(p) )
30
Centralized LSH-based outlier detection (1/10)
Pruning Bin Threshold Compute Parameters Find Near Neighbors Removing False Negatives Generate Bin Structure Prune Non Outliers Reducing False Positives Phase 1 Phase 2 Phase 3
31
Centralized LSH-based outlier detection (2/10)
Compute parameters: Distance R = r1 < dt Fraction p’t
32
Centralized LSH-based outlier detection (3/10)
Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset LSH Bin Structure
33
Centralized LSH-based outlier detection (4/10)
Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins LSH Bin Structure
34
Centralized LSH-based outlier detection (5/10)
Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins LSH Bin Structure
35
Centralized LSH-based outlier detection (6/10)
Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins LSH Bin Structure
36
Centralized LSH-based outlier detection (7/10)
Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins LSH Bin Structure
37
Centralized LSH-based outlier detection (8/10)
Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins Finding Neighbors of o objects in same bins LSH Bin Structure Neighbors
38
Centralized LSH-based outlier detection (9/10)
Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins Finding Neighbors of o objects in same bins If number of neighbors > p’t Non-Outlier LSH Bin Structure Neighbors Non-Outlier Outlier
39
Centralized LSH-based outlier detection (10/10)
Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins Finding Neighbors of o objects in same bins If number of neighbors > p’t Non-Outlier Pruning Neighbors of a non-outlier are also non-outliers LSH Bin Structure Neighbors Non-Outliers Outlier
40
Quality of Approximation (1/4)
Within distance dt the probability of being near neighbor is at most p2K So false near neighbors may cause pruning of an outlier Probably set may contains False Negatives False Negatives Label an outlier to be a non-outlier
41
Quality of Approximation (2/4)
To reduce false negatives insert in our algorithm the Bin Threshold An object is counted as neighbor only if it appears in at least bt bins out of L bins. On the right, we use the value for bt = 3 So the purple object is not non-outlier as it appears only in 1 bin with object o LSH Bin Structure bt = 3 Neighbors Non-Outliers Outlier
42
Quality of Approximation (3/4)
High value of bin threshold decrease the efficiency of pruning technique Removes actual neighbors The result of removing neighbors is to label a non-outlier as outlier False Positive Optimal value for bt would be one which would remove the false negatives at the cost of introducing minimal false positives High value of bt False positive
43
Quality of Approximation (4/4)
To reduce false positives, for each object of set calculate the distances with all objects of dataset. Set of probable outliers Yes: count +1 Yes Distance > dt count >= pt Outlier Dataset
44
Distributed LSH-based outlier detection (1/17)
Horizontal Distribution Each player has a subset of the total number of objects with same attributes Name Segment Country City State Claire Gute Consumer United States Henderson Kentucky Darrin Van Huff Corporate Los Angeles California Sean O'Donnell Fort Lauderdale Florida Player A Player B
45
Distributed LSH-based outlier detection (2/17)
Local Outlier An object with Player Pi is a local outlier if the number of objects in the local dataset Di lying at a distance greater dt is at least a fraction pt of total dataset Dataset Local dataset of player A Local outlier
46
Distributed LSH-based outlier detection (3/17)
Global Outlier An object o is an outlier if a very large fraction pt of the total objects in the dataset D lie outside the radius dt from o Dataset Global Outlier Non Neighbors Neighbors
47
Distributed LSH-based outlier detection (4/17)
Three Phases Generate LSH bin structure and compute local probable outliers by running the centralized algorithm on local dataset Finding Global approximate outliers using the generated bin structure Reducing false positives from Global returning set of outliers
48
Distributed LSH-based outlier detection (5/17)
Player A Size of local dataset Player B
49
Compute size of entire dataset
Distributed LSH-based outlier detection (6/17) Player A Size of local dataset Player B Compute size of entire dataset
50
Compute size of entire dataset
Distributed LSH-based outlier detection (7/17) Player A Size of local dataset Player B Compute size of entire dataset Size of entire dataset
51
Distributed LSH-based outlier detection (8/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset
52
Distributed LSH-based outlier detection (9/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Run Centralized algorithm to get probable Local Outliers M`
53
Distributed LSH-based outlier detection (10/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M`
54
Distributed LSH-based outlier detection (11/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M`
55
Distributed LSH-based outlier detection (12/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M`
56
Distributed LSH-based outlier detection (13/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M”
57
Distributed LSH-based outlier detection (14/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M” Probable Outliers M”
58
Distributed LSH-based outlier detection (15/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M” Probable Outliers M” For each object in M” compute distance to each object in DB, count number of objects that distance > dt
59
Distributed LSH-based outlier detection (16/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M” Probable Outliers M” For each object in M” compute distance to each object in DB count number of objects that distance > dt Counts of Objects
60
Distributed LSH-based outlier detection (17/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M” Probable Outliers M” For each object in M” compute distance to each object in DB count number of objects that distance > dt Counts of Objects Make same actions as PB on its dataset. Computes the sum of two counts and if count >= pt mark as outlier
61
Analysis: Complexities (1/2)
Centralized algorithm The overall computational complexity is O(ndL) n = |D| d is the dimensionality of the dataset L = n1/(1+ε) ε > 0 is an approximation factor.
62
Analysis: Complexities (2/2)
Distributed algorithm The computational complexity of a player A is O(nAdL) nA = |DA| d is the dimensionality of the dataset L = n1/(1+ε) ε > 0 is an approximation factor. The overall communication complexity of a player A is O(m’AL + m”Ad) m’A = |M’A| (Probable Local Outliers) m”A = |M”A| (Probable Global Outliers)
63
Analysis: Experiments (1/3)
Centralized algorithm Computed optimal bin threshold for the datasets below Achieved very low false positive rate and zero false negative rate Achieved 100% detection rate on optimal bin threshold.
64
Analysis: Experiments (2/3)
Centralized algorithm Percentage of objects for which the FindNeighbors algorithm is invoked For small datasets the percentage of queried objects is close to 1% Dataset gets larger this value keeps decreasing rapidly
65
Analysis: Experiments (3/3)
Distributed algorithm On experiments objects have been uniformly distributed among the players The experiment have been repeated by varying the number of players from 2 to 5 and studied the effect on the communication cost The percentage of the communicated objects is indeed very less. (<1%)
66
Conclusion Approximate Distance-Based Outlier Detection
Converse of the Knorr’s definition Efficient pruning technique using LSH The percentage of queried objects is close to 1% Horizontally distributed setting Highly efficient in terms of the communication cost
67
Assessment The paper tries to tackle the problem of distance-based outlier detection using smart structure and pruning technique The approximate algorithm of this paper is a very efficient way to find outliers in large datasets using LSH bin structure The algorithm of this paper is based on the definition introduced by Knorr The experiments which made over 5 datasets look realistic with the hardware that has been reported No need any knowledge background for this paper Detailed description of algorithms but lack of examples to help you to understand them
68
THANK YOU! Dimitris Linaritis, 1104,
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.