Presentation is loading. Please wait.

Presentation is loading. Please wait.

IIIT Hyderabad Private Outlier Detection and Content based Encrypted Search Nisarg Raval MS by Research, CSE Advisors : Prof. C. V. Jawahar & Dr. Kannan.

Similar presentations


Presentation on theme: "IIIT Hyderabad Private Outlier Detection and Content based Encrypted Search Nisarg Raval MS by Research, CSE Advisors : Prof. C. V. Jawahar & Dr. Kannan."— Presentation transcript:

1 IIIT Hyderabad Private Outlier Detection and Content based Encrypted Search Nisarg Raval MS by Research, CSE Advisors : Prof. C. V. Jawahar & Dr. Kannan Srinathan IIIT – Hyderabad Thesis Defense

2 IIIT Hyderabad Need for Privacy

3 IIIT Hyderabad Need for Privacy

4 IIIT Hyderabad Privacy Preserving Data Mining (PPDM) Sharing information leads to mutual gain – Patient records help medical research Data confidentiality – Sensitive information Privacy with Utility – Randomization – Cryptographic

5 IIIT Hyderabad Privacy in Information Retrieval Cloud based solutions – Storage and Processing Loose control over data Private database – Encrypted Search Public database – Private Information Retrieval

6 IIIT Hyderabad Contributions Distributed Outlier Detection using Locality Sensitive Hashing Privacy Preserving Outlier Detection using Locality Sensitive Hashing Private Content Based Search on Encrypted Data using Hierarchical Index Structures Private Content Based Image Retrieval

7 IIIT Hyderabad Private Outlier Detection

8 IIIT Hyderabad Motivation Trusted Third Party (TTP)

9 IIIT Hyderabad Motivation Trusted Third Party (TTP) Can we avoid TTP ?

10 IIIT Hyderabad Motivation Simulate Trusted Third Party

11 IIIT Hyderabad Previous Results Vaidya et al. ICDM 2004 – Secure Distance and Secure Comparison Protocol Zhou et al. EBISS 2009 – Homomorphic Encryption and Randomization Quadratic Cost How do we reduce Quadratic Cost ?

12 IIIT Hyderabad Approximation No crisp definition of Outliers Approximation is as good as exact results Reducing Quadratic cost by approximation Trade off between Accuracy and Cost

13 IIIT Hyderabad Outlier Detection Distance based outlier [Knorr et al. VLDB 1998] – An object is an outlier if very large fraction of total objects lie outside the specified radius. Neighbors Outlier Non Neighbors

14 IIIT Hyderabad Our Approach Converse of the definition – An object is non-outlier if it has enough neighbors within specified radius. Non Neighbors Non - Outlier Neighbors Easy to find small number of neighbors!

15 IIIT Hyderabad Locality Sensitive Hashing (LSH) Property Condition Hash Family Similar objects are hashed to the same bin

16 IIIT Hyderabad Centralized Outlier Detection Compute Parameters Generate Bin Structure LSH Find Near Neighbors Prune Non Outliers Pruning Phase IPhase II Madhuchand Rushi Pillutla, Nisarg Raval, Piyush Bansal, Kannan Srinathan and C.V. Jawahar; LSH Based Outlier Detection and Its Application in Distributed Setting; CIKM 2011

17 IIIT Hyderabad Distributed Setting Horizontally partitioned data – Each player has the same attributes for a subset of the total objects Account Number First Name Last Name EmailPhone AB2305DavidAlexanderdavid@gmail.com9083895393 AB2303JarredCarisonjaredc@duke.edu3944952345 BY3456SnehaCharlessnehac@contoso.com8938989849 BY2343MihirDavemihir@yahoo.com9384543593 BY0910RichardZengzeng@outlook.com2458924589 BA

18 IIIT Hyderabad Distributed Outlier Detection Global LSH parameters Local Pruning – Centralized Algorithm Local Outliers – May have non-outliers What about the non-outliers which have neighbours in other player’s dataset? Player A Player B

19 IIIT Hyderabad Distributed System Overview Local Pruning Global Pruning Exhaustive Pruning Global Parameters

20 IIIT Hyderabad Private Outlier Detection Three Phases – Global parameter computation – Local pruning – Secure global pruning Secure Exhaustive Pruning – Secure distance computation – High cost – Minimal increase in False Positives Trade off between Accuracy and Cost

21 IIIT Hyderabad Private System Overview Local Pruning Construct global LSH bin structure using secure union and secure sum protocols Secure Exhaustive Pruning is costly One player computes LSH parameters and publishes Secure Sum Secure Union Secure Sum

22 IIIT Hyderabad Analysis Computational Cost: – L is sub-linear in n Communication Cost: – N b << n – Independent of dimensionality Round Complexity : constant Security of the algorithm depends on the security of the secure union and secure sum protocols – Honest But Curious Model (HBC) n : Number of objects d : Number of dimensions L : Number of hash functions N b : Number of bins

23 IIIT Hyderabad Experimental Results DatasetsObjectsDimFalse Positives [%]DA [%] PTNN Distributed ODPrivateOD Corel68040320.0110.0410.6282147 Landsat275465600.0140.0250.36331929 Darpa458301230.0310.0590.685501103 Household100000030.0090.020.1412001326 False Positives can be considered as borderline outliers! DistributedOD : Distributed Outlier Detection PrivateOD : Private Outlier Detection DA : Processed Points PT : Point Threshold NN : Near Neighbors Only 0.02% increase in False Positives of PrivateOD

24 IIIT Hyderabad Comparison Corel Landsat Darpa HouseHold Less than Quadratic Superior than previous known best results Up to 10000 times less communication on datasets of size 10 6 !

25 IIIT Hyderabad Private Content Based Search

26 IIIT Hyderabad Motivation Content Base Image Retrieval Google Goggles

27 IIIT Hyderabad Motivation Query Image Potential Privacy Breach!

28 IIIT Hyderabad Existing Solutions Download the entire database and search at client side – Trivial but Impractical Kuzu et al. ICDE 2012 – Multiple rounds of Similarity SSE (CSL) – Low accuracy and High Cost Shashank et al. CVPR 2008 – Private Content Based Image Retrieval (PCBIR) – Complexity linear in database size Single server PIR has to access every element in the database!

29 IIIT Hyderabad Our Approach Two server solution – Content Owner and Database Server Hierarchical Index Structures – Client and Server jointly perform secure search Improve Accuracy – Bag of Words Reduce Complexity – Multi-round Protocol

30 IIIT Hyderabad Vocabulary Tree Bag of Visual Words – Visual Words = Vector quantization of feature vectors

31 IIIT Hyderabad CS-SSE using Hierarchical Indexing Secure Index – Content Owner – Encryption and Permutation Private Search – Authorized Users – Oblivious Traversal

32 IIIT Hyderabad Secure Index Construction

33 IIIT Hyderabad Private Searching on Encrypted Data

34 IIIT Hyderabad Analysis Computational cost : O(m log k n) – Optimal cost Communication cost : O(m log k n) Round complexity : O(log k n) – Vocabulary tree with one Million leaf nodes : 6 Adaptive semantic secure against polynomial time adversary – Honest But Curious adversary model (HBC) m : size of a node k : branching factor n : number of leaf nodes

35 IIIT Hyderabad Comparison with CSL Communication is independent of database size! 30% improvement in accuracy!

36 IIIT Hyderabad Comparison with PCBIR Caltech256 – Var20 Scene15 PCBIR : + CS-SSE : x Our algorithm is O(10 5 ) times faster than PCBIR!

37 IIIT Hyderabad Conclusions Addressed issues of privacy in the domain of Data Mining and Information Retrieval Private Outlier Detection – Use LSH to achieve less than quadratic cost – Distributed and Private algorithms Private Content based Encrypted Search – Use Hierarchical Indexing for efficient encrypted search – Private Content based Image Search

38 IIIT Hyderabad Related Publications M Pillutla, N Raval, P Bansal, K Srinathan, C.V. Jawahar, LSH based outlier detection and its application in distributed setting, CIKM 2011. N Raval, M Pillutla, P Bansal, K Srinathan, C.V. Jawahar, Privacy Preserving Outlier Detection using Locality Sensitive Hashing, ICDMW 2011. N Raval, M Pillutla, P Bansal, K Srinathan, C.V. Jawahar, Efficient Content Similarity Search on Encrypted Data using Hierarchical Index Structures, TDP (Under Review)

39 IIIT Hyderabad nisarg.raval@research.iiit.ac.in

40 IIIT Hyderabad LSH Example Courtesy: Fergus et al. 0 1 0 1 0 1 101

41 IIIT Hyderabad Cryptographic Primitives Secure Union Secure Sum

42 IIIT Hyderabad PPOD Example {a 1,a 2,a 3,a 4 }{a 6 }{a 5 } {a 6 }{a 1,a 2,a 3,a 4 }{a 5 } {a 1,a 2,a 3 }{a 4,a 5 }{a 6 } {b 1, b 4 }{}{b 2, b 3 } {}{b 4 }{b 1,b 2,b 3 } {b 1,b 4 }{b 2,b 3 }{} Player A : a 1 = (1,1), a 2 = (1,3), a 3 = (2,1), a 4 = (2,3), a 5 = (5,1), a 6 =(4,5) Player B: b 1 = (3,1), b 2 = (4,2), b 3 = (5,2), b 4 = (4,1) Total Dataset Size N = 10 Point Threshold PT = 0.8 (PT’ = (1 – PT) x N = 2) Distance Threshold DT = 2 Approximation Factor AF = 0 LSH Radius R = DT/(1 + AF) = 2 Local Probable Outliers of A = {a 5,a 6 } Global Probable Outliers of A = {a 6 } Player A’s LSH Bin Structure Player B’s LSH Bin Structure

43 IIIT Hyderabad CS-SSE Example Vocabulary TreeSecure Index

44 IIIT Hyderabad Index Construction Courtesy: Nister et al. Img1, 1Img2, 1 Img1, 2

45 IIIT Hyderabad Query Courtesy: Nister et al.

46 IIIT Hyderabad PPOD Results Bins << Data Points Communication Cost increases with number of players The rate of increase in communication cost is slow

47 IIIT Hyderabad CS-SSE Results Retrieval quality of CS-SSE Search results on Ukbench dataset 30% improvement in accuracy over previous methods!

48 IIIT Hyderabad Comparison with CSL Caltech256Scene15 CSL : + CS-SSE : x Communication is independent of database size! 30% improvement in accuracy!

49 IIIT Hyderabad Our Approach Outlier Detection Pruning Non Outliers Near Neighbor Queries LSH LSH is efficient for near neighbor queries!

50 IIIT Hyderabad LSH Hash Objects LSH Bin Structure

51 IIIT Hyderabad Find Near Neighbours Hash Objects Count Neighbors Outlier Non Outlier No Yes Unlike traditional LSH queries No explicit distance calculation. We need rough estimation of number of neighbors. Pruning using Bin Threshold

52 IIIT Hyderabad Near neighbor query for all objects Running time depends on the size of the database Databases are very large Need for Pruning We need algorithm independent of the database size!

53 IIIT Hyderabad Pruning Hash Objects Neighbors Outlier No Yes Non Outliers Neighbors of a non outlier are also non outliers < 1 % of total database needs to be processed!

54 IIIT Hyderabad Error in Pruning LSH is probabilistic Probability of being near neighbor is at least False neighbors may cause pruning of an outlier False Negatives How do we reduce False Negatives ?

55 IIIT Hyderabad Reducing False Negatives Hash Objects Count Neighbors Outlier Non Outlier No Yes Bin Threshold Neighbor only if it appears in at least bins Increasing Bin Threshold decrease False Negatives

56 IIIT Hyderabad Bin Threshold Bin Threshold may remove actual neighbors High Bin Threshold reduce pruning efficiency False Positives How do we reduce False Positives without increasing False Negatives?

57 IIIT Hyderabad Reducing False Positives Intersection of Results Multiple RunsOutput Final Set of Outliers Compute Parameters Generate Bin Structure LSH Find Near Neighbors Prune Non Outliers Pruning Iteration 1 Compute Parameters Generate Bin Structure LSH Find Near Neighbors Prune Non Outliers Pruning Iteration 2 Compute Parameters Generate Bin Structure LSH Find Near Neighbors Prune Non Outliers Pruning Iteration n

58 IIIT Hyderabad Global Pruning Communicate to obtain global neighbour information Each player sends hash labels of local outliers and receives corresponding bin counts Exhaustive pruning – Communicate to compute distances with global outliers

59 IIIT Hyderabad Secure Global Pruning Secure Union – Global bin labels Secure Sum – Global LSH parameters – Global bin count Any protocols for secure union and secure sum can be used

60 IIIT Hyderabad Previous Results Goh et al. Cryptology 2003 – Secure Searchable Encryption (SSE) – Exact matching only Kuzu et al. ICDE 2012 – Similarity SSE using LSH – Only keyword based search Shashank et al. CVPR 2008 – Private Content Based Image Retrieval (PCBIR) – Complexity linear in database size


Download ppt "IIIT Hyderabad Private Outlier Detection and Content based Encrypted Search Nisarg Raval MS by Research, CSE Advisors : Prof. C. V. Jawahar & Dr. Kannan."

Similar presentations


Ads by Google