Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Crete Department Computer Science CS-562

Similar presentations


Presentation on theme: "University of Crete Department Computer Science CS-562"— Presentation transcript:

1 University of Crete Department Computer Science CS-562
LSH Based Outlier Detection and Its Application in Distributed Setting Madhuchand Rushi Pillutla, Nisarg Raval, Piyush Bansal, Kannan Srinathan and C. V. Jawahar IIIT Hyderabad, India Dimitris Linaritis csdp1104

2 Table of Contents Introduction
Algorithmic families for Outlier Detection Approximate Detection of Outliers Distance-based Outliers Modified Distance-based Outliers Problem Statement Locality Sensitive Hashing (LSH) Centralized LSH-based outlier detection Quality of Approximation Distributed LSH-based outlier detection Analysis Conclusion Assessment

3 Algorithmic families for Outlier Detection (1/2)
Techniques for detecting outliers in datasets Statistical methods Geometric methods Density-based approaches Distance-based approaches

4 Algorithmic families for Outlier Detection (2/2)
Why Distance-based approach Statistical methods are appropriate only if one has a good sense for the background distribution  but typically do not scale well to large datasets

5 Algorithmic families for Outlier Detection (2/2)
Why Distance-based approach Statistical methods are appropriate only if one has a good sense for the background distribution  but typically do not scale well to large datasets Geometric methods rely on variants of the convex hull algorithm which has a complexity that is exponential  they are often impractical

6 Algorithmic families for Outlier Detection (2/2)
Why Distance-based approach Statistical methods are appropriate only if one has a good sense for the background distribution  but typically do not scale well to large datasets Geometric methods rely on variants of the convex hull algorithm which has a complexity that is exponential  they are often impractical Density-based approaches to outlier detection rely on the computation of the local neighborhood density of a point  This can be expensive to compute particularly in high dimensions.

7 Algorithmic families for Outlier Detection (2/2)
Why Distance-based approach Statistical methods are appropriate only if one has a good sense for the background distribution  but typically do not scale well to large datasets Geometric methods rely on variants of the convex hull algorithm which has a complexity that is exponential  they are often impractical Density-based approaches to outlier detection rely on the computation of the local neighborhood density of a point  This can be expensive to compute particularly in high dimensions. Distance-based approaches rely on a well defined notion of outlier based on the nearest neighbor principle  using smart structures and pruning technique have ensured that scale quite well to large datasets of moderate dimensionality

8 Approximate Detection of Outliers
The need for Approximate algorithms The computational complexity of exact algorithms are quadratic O(n2) More effective and accurate than approximate algorithms Not appropriate for “big data” The computational complexity of approximate algorithms are "near- linear"  O(n1+ε) ε>0 Not so effective as exact algorithms but more efficient Αppropriate for “big data” using smart data structures and pruning techniques

9 Distance-based Outliers (1/12)
Knorr VLDB 1998 An object o is an outlier if a very large fraction pt of the total objects in the dataset D lie outside the radius dt from o

10 Distance-based Outliers (2/12)
Knorr VLDB 1998 An object o is an outlier if a very large fraction pt of the total objects in the dataset D lie outside the radius dt from o Non Neighbors Neighbors

11 Distance-based Outliers (3/12)
Knorr VLDB 1998 An object o is an outlier if a very large fraction pt of the total objects in the dataset D lie outside the radius dt from o pt is very large Outlier Non Neighbors Neighbors

12 Distance-based Outliers (4/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7

13 Distance-based Outliers (5/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4

14 Distance-based Outliers (6/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 5

15 Distance-based Outliers (7/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 7 D3 = 5

16 Distance-based Outliers (8/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 7 D3 = 5 D3 = 9

17 Distance-based Outliers (9/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 6 D3 = 7 D3 = 5 D3 = 9

18 Distance-based Outliers (10/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 4 D3 = 6 D3 = 7 D3 = 5 D3 = 9 D3 = 15

19 Distance-based Outliers (11/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 8 D3 = 4 D3 = 6 D3 = 7 D3 = 5 D3 = 9 D3 = 15

20 Distance-based Outliers (12/12)
Ramaswamy SIGMOD 2000 Given a k and n, a point p is an outlier if no more than n - 1 other points in the data set have a higher value for Dk than p. Dk denote the distance of the kth nearest neighbor of a point. k = 3, n = 7 D3 = 8 D3 = 4 D3 = 6 Outlier D3 = 7 D3 = 5 D3 = 9 D3 = 15

21 Modified Distance-based Outliers (1/3)
Converse of the Knorr’s definition An object is non-outlier if it has enough neighbors within the radius dt.

22 Modified Distance-based Outliers (2/3)
Converse of the Knorr’s definition An object is non-outlier if it has enough neighbors within the radius dt. Neighbors Non Neighbors

23 Modified Distance-based Outliers (3/3)
Converse of the Knorr’s definition An object is non-outlier if it has enough neighbors within the radius dt. Non - Outlier Neighbors Non Neighbors

24 Modified Distance-based Outliers (3/3)
Converse of the Knorr’s definition An object is non-outlier if it has enough neighbors within the radius dt. Non - Outlier Neighbors Non Neighbors p’t is very low

25 Problem Statement (1/2) The problem
To find all those non-outliers in the data which have many neighbors we must calculate the distances to every other object in the dataset. Number of queries  Number of objects in database

26 Databases are very large!
Problem Statement (1/2) The problem To find all those non-outliers in the data which have many neighbors we must calculate the distances to every other object in the dataset. Number of queries  Number of objects in database Databases are very large!

27 Databases are very large!
Problem Statement (2/2) The problem To find all those non-outliers in the data which have many neighbors we must calculate the distances to every other object in the dataset. Number of queries  Number of objects in database  Need to find Approximate Algorithms Databases are very large!

28 Locality Sensitive Hashing (LSH) (1/2)
Very efficient to find near neighbors Similar objects are hashed to the same bin with high probability Most of the non-outliers in the dataset can be easily pruned < 1% of total objects in the dataset need to be processed Hash Objects LSH Bin Structure

29 Locality Sensitive Hashing (LSH) (2/2)
Definitions Conditions to be satisfied To amplify gap between probabilities g(p) = ( h1(p), h2(p), ..., hk(p) )

30 Centralized LSH-based outlier detection (1/10)
Pruning Bin Threshold Compute Parameters Find Near Neighbors Removing False Negatives Generate Bin Structure Prune Non Outliers Reducing False Positives Phase 1 Phase 2 Phase 3

31 Centralized LSH-based outlier detection (2/10)
Compute parameters: Distance R = r1 < dt Fraction p’t

32 Centralized LSH-based outlier detection (3/10)
Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset LSH Bin Structure

33 Centralized LSH-based outlier detection (4/10)
Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins LSH Bin Structure

34 Centralized LSH-based outlier detection (5/10)
Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins LSH Bin Structure

35 Centralized LSH-based outlier detection (6/10)
Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins LSH Bin Structure

36 Centralized LSH-based outlier detection (7/10)
Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins LSH Bin Structure

37 Centralized LSH-based outlier detection (8/10)
Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins Finding Neighbors of o  objects in same bins LSH Bin Structure Neighbors

38 Centralized LSH-based outlier detection (9/10)
Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins Finding Neighbors of o  objects in same bins If number of neighbors > p’t  Non-Outlier LSH Bin Structure Neighbors Non-Outlier Outlier

39 Centralized LSH-based outlier detection (10/10)
Compute parameters: Distance R = r1 < dt Fraction p’t LSH scheme applied on dataset Hashing object o in bins Finding Neighbors of o  objects in same bins If number of neighbors > p’t  Non-Outlier Pruning  Neighbors of a non-outlier are also non-outliers LSH Bin Structure Neighbors Non-Outliers Outlier

40 Quality of Approximation (1/4)
Within distance dt the probability of being near neighbor is at most p2K So false near neighbors may cause pruning of an outlier Probably set may contains False Negatives False Negatives  Label an outlier to be a non-outlier

41 Quality of Approximation (2/4)
To reduce false negatives insert in our algorithm the Bin Threshold An object is counted as neighbor only if it appears in at least bt bins out of L bins. On the right, we use the value for bt = 3 So the purple object is not non-outlier as it appears only in 1 bin with object o LSH Bin Structure bt = 3 Neighbors Non-Outliers Outlier

42 Quality of Approximation (3/4)
High value of bin threshold decrease the efficiency of pruning technique Removes actual neighbors The result of removing neighbors is to label a non-outlier as outlier False Positive Optimal value for bt would be one which would remove the false negatives at the cost of introducing minimal false positives High value of bt False positive

43 Quality of Approximation (4/4)
To reduce false positives, for each object of set calculate the distances with all objects of dataset.  Set of probable outliers Yes: count +1 Yes Distance > dt count >= pt Outlier Dataset

44 Distributed LSH-based outlier detection (1/17)
Horizontal Distribution Each player has a subset of the total number of objects with same attributes Name Segment Country City State Claire Gute Consumer United States Henderson Kentucky Darrin Van Huff Corporate Los Angeles California Sean O'Donnell Fort Lauderdale Florida Player A Player B

45 Distributed LSH-based outlier detection (2/17)
Local Outlier An object with Player Pi is a local outlier if the number of objects in the local dataset Di lying at a distance greater dt is at least a fraction pt of total dataset Dataset Local dataset of player A Local outlier

46 Distributed LSH-based outlier detection (3/17)
Global Outlier An object o is an outlier if a very large fraction pt of the total objects in the dataset D lie outside the radius dt from o Dataset Global Outlier Non Neighbors Neighbors

47 Distributed LSH-based outlier detection (4/17)
Three Phases Generate LSH bin structure and compute local probable outliers by running the centralized algorithm on local dataset Finding Global approximate outliers using the generated bin structure Reducing false positives from Global returning set of outliers

48 Distributed LSH-based outlier detection (5/17)
Player A Size of local dataset Player B

49 Compute size of entire dataset
Distributed LSH-based outlier detection (6/17) Player A Size of local dataset Player B Compute size of entire dataset

50 Compute size of entire dataset
Distributed LSH-based outlier detection (7/17) Player A Size of local dataset Player B Compute size of entire dataset Size of entire dataset

51 Distributed LSH-based outlier detection (8/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset

52 Distributed LSH-based outlier detection (9/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Run Centralized algorithm to get probable Local Outliers M`

53 Distributed LSH-based outlier detection (10/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M`

54 Distributed LSH-based outlier detection (11/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M`

55 Distributed LSH-based outlier detection (12/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M`

56 Distributed LSH-based outlier detection (13/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M”

57 Distributed LSH-based outlier detection (14/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M” Probable Outliers M”

58 Distributed LSH-based outlier detection (15/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M” Probable Outliers M” For each object in M” compute distance to each object in DB, count number of objects that distance > dt

59 Distributed LSH-based outlier detection (16/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M” Probable Outliers M” For each object in M” compute distance to each object in DB count number of objects that distance > dt Counts of Objects

60 Distributed LSH-based outlier detection (17/17)
Player A Size of local dataset Player B Compute size of entire dataset Parameters computation, LSH bin structure Size of entire dataset Hash labels of M` Run Centralized algorithm to get probable Local Outliers M` Find Neighbors of each object of outliers M` Set of counts of neighbors of M` Compute its count of neighbors and get Global Probable Outliers M” Probable Outliers M” For each object in M” compute distance to each object in DB count number of objects that distance > dt Counts of Objects Make same actions as PB on its dataset. Computes the sum of two counts and if count >= pt mark as outlier

61 Analysis: Complexities (1/2)
Centralized algorithm The overall computational complexity is O(ndL) n = |D| d is the dimensionality of the dataset L = n1/(1+ε) ε > 0 is an approximation factor.

62 Analysis: Complexities (2/2)
Distributed algorithm The computational complexity of a player A is O(nAdL) nA = |DA| d is the dimensionality of the dataset L = n1/(1+ε) ε > 0 is an approximation factor. The overall communication complexity of a player A is O(m’AL + m”Ad) m’A = |M’A| (Probable Local Outliers) m”A = |M”A| (Probable Global Outliers)

63 Analysis: Experiments (1/3)
Centralized algorithm Computed optimal bin threshold for the datasets below Achieved very low false positive rate and zero false negative rate Achieved 100% detection rate on optimal bin threshold.

64 Analysis: Experiments (2/3)
Centralized algorithm Percentage of objects for which the FindNeighbors algorithm is invoked For small datasets the percentage of queried objects is close to 1% Dataset gets larger this value keeps decreasing rapidly

65 Analysis: Experiments (3/3)
Distributed algorithm On experiments objects have been uniformly distributed among the players The experiment have been repeated by varying the number of players from 2 to 5 and studied the effect on the communication cost The percentage of the communicated objects is indeed very less. (<1%)

66 Conclusion Approximate Distance-Based Outlier Detection
Converse of the Knorr’s definition Efficient pruning technique using LSH The percentage of queried objects is close to 1% Horizontally distributed setting Highly efficient in terms of the communication cost

67 Assessment The paper tries to tackle the problem of distance-based outlier detection using smart structure and pruning technique The approximate algorithm of this paper is a very efficient way to find outliers in large datasets using LSH bin structure The algorithm of this paper is based on the definition introduced by Knorr The experiments which made over 5 datasets look realistic with the hardware that has been reported No need any knowledge background for this paper Detailed description of algorithms but lack of examples to help you to understand them

68 THANK YOU! Dimitris Linaritis, 1104,


Download ppt "University of Crete Department Computer Science CS-562"

Similar presentations


Ads by Google