Download presentation
Presentation is loading. Please wait.
Published byBarbra Horn Modified over 9 years ago
1
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department of Computer Science Wayne State University Northern Kentucky University Wayne State University Northern Kentucky University
2
Outlines What is Similarity Join What is Similarity Join Long String Values Long String Values Our Contribution Our Contribution Privacy Preserving Protocol For Long String Values Privacy Preserving Protocol For Long String Values Experiments and Results Experiments and Results Conclusions/Future Work Conclusions/Future Work Contact Information Contact Information
3
Motivation NameAddressMajor… John Smith 4115 Main St. Biology Mary Jones 2619 Ford Rd. Chemical Eng. NameAddress Monthly Sal. … Smith, John 4115 Main Street 1645 Mary Jons 2619 Ford Rd. 2100 Is Natural Join always suitable?
4
Similarity Join Joining a pair of records if they have SIMILAR values in the join attribute. Joining a pair of records if they have SIMILAR values in the join attribute. Formally, similarity join consists of grouping pairs of records whose similarity is greater than a threshold, T. Formally, similarity join consists of grouping pairs of records whose similarity is greater than a threshold, T. Studied widely in the literature, and referred to as record linkage, entity matching, duplicate detection, citation resolution, … Studied widely in the literature, and referred to as record linkage, entity matching, duplicate detection, citation resolution, …
5
Our Previous Contribution: Long String Values (ICDM MMIS10) The term long string refers to the data type representing any string value with unlimited length. The term long string refers to the data type representing any string value with unlimited length. The term long attribute refers to any attribute of long string data type. The term long attribute refers to any attribute of long string data type. Most tables contain at least one attribute with long string values. Most tables contain at least one attribute with long string values. Examples are Paper Abstract, Product Description, Movie Summary, User Comment, … Examples are Paper Abstract, Product Description, Movie Summary, User Comment, … Most of the previous work studied similarity join on short fields. Most of the previous work studied similarity join on short fields. In our previous work, we showed that using long attributes as join attributes under supervised learning can enhance the similarity join performance. In our previous work, we showed that using long attributes as join attributes under supervised learning can enhance the similarity join performance.
6
Example P1 Title P1 Kwds P1 Authrs P1 Abstract … P2 Title P2 Kwds P2 Authrs P2 Abstract … P3 Title P3 Kwds P3 Authrs P3 Abstract … … P10 Title P10 Kwds P10 Authrs P10 Abstract … P11 Title P11 Kwds P11 Authrs P11 Abstract … …
7
Our Paper (Motivation) Some sources may not allow sharing its whole data in the similarity join process. Some sources may not allow sharing its whole data in the similarity join process. Solution: Privacy Preserved Similarity Join. Solution: Privacy Preserved Similarity Join. Using long attributes as join attributes can increase the similarity join accuracy. Using long attributes as join attributes can increase the similarity join accuracy. Up to our knowledge, all the current Privacy Preserved SJ algorithms use short attributes. Up to our knowledge, all the current Privacy Preserved SJ algorithms use short attributes. Most of the current privacy preserved SJ algorithms ignore the semantic similarities among the values. Most of the current privacy preserved SJ algorithms ignore the semantic similarities among the values.
8
Problem Formulation Our goal is to find a Privacy Preserved Similarity Join Algorithm when the join attribute is a long attribute and consider the semantic similarities among such long values. Our goal is to find a Privacy Preserved Similarity Join Algorithm when the join attribute is a long attribute and consider the semantic similarities among such long values.
9
Our Work Plan Phase1: Compare multiple similarity methods for long attributes when similarity thresholds are used. Phase1: Compare multiple similarity methods for long attributes when similarity thresholds are used. Phase2: Use the best method as part in the privacy preserved SJ protocol. Phase2: Use the best method as part in the privacy preserved SJ protocol.
10
Phase1: Finding Best SJ Method for Long Strings with Threshold Candidate Methods: Candidate Methods: Diffusion Maps. Diffusion Maps. Latent Semantic Indexing. Latent Semantic Indexing. Locality Preserving Projection. Locality Preserving Projection.
11
Performance Measurements F1 Measurement: the harmonic mean between recall R and precision P. F1 Measurement: the harmonic mean between recall R and precision P. Where recall is the ratio of the relevant data among the retrieved data, and precision is the ratio of the accurate data among the retrieved data. Where recall is the ratio of the relevant data among the retrieved data, and precision is the ratio of the accurate data among the retrieved data.
12
Performance Measurements(Cont.) Preprocessing time is the time needed to read the dataset and generate matrices that could be used later as an input to the semantic operation. Operation time is the time needed to apply the semantic method. Matching time is the time required by the third party, C, to find the cosine similarity among the records provided by both A and B in the reduced space and compare the similarities with the predefined similarity threshold.
13
Datasets IMDB Internet Movies Dataset: IMDB Internet Movies Dataset: Movie Summary Field Movie Summary Field Amazon Dataset: Amazon Dataset: Product Title Product Title Product Description Product Description
14
Phase1 Results Finding best dimensionality reduction method using Movie Summary from IMDB Dataset (Left) and Product Descriptions from Amazon (Right). Finding best dimensionality reduction method using Movie Summary from IMDB Dataset (Left) and Product Descriptions from Amazon (Right).
15
Phase2 Results Preprocessing Time: Preprocessing Time: Read Dataset (1000 Movie Summaries) 12 Sec. TF.IDF Weighting1 Sec. Reduce Dimensionality using Mean TF.IDF 0.5 Sec. Find Shared FeaturesNegligible
16
Phase2 Results Operation Time for the best performing methods from phase 1. Operation Time for the best performing methods from phase 1. Matching Time is negligible. Matching Time is negligible.
17
Our Protocol Both sources A and B share the Threshold value T to decide similar pairs later. Both sources A and B share the Threshold value T to decide similar pairs later.
18
Our Protocol P1 Title P1 Authors P1 Abstract … P2 Title P2 Authors P2 Abstract … P3 Title P3 Authors P3 Abstract Source A Source B Px Title Px Authors Px Abstract …
19
Find Term_LSV Frequency Matrix for Each Source LSV1LSV2LSV3 Image400 Classify500 Similarity065 Join064 MMaMMa
20
Find TD_Weighted Matrix Using TF.IDF Weighting LSV1LSV2LSV3 Image0.900 Classify0.700 Similarity00.850.9 Join00.70.85 WeightedM WeightedM a
21
TF.IDF Weighting TF.IDF weighting of a term W in a long string value x is given as: where tf w,x is the frequency of the term w in the long string value x, and idf w is, where N is the number of long string values in the relation, and n w is the number of long string values in the relation that contains the term w.
22
MeanTF.IDF Feature Selection MeanTF.IDF is an unsupervised feature selection method. MeanTF.IDF is an unsupervised feature selection method. Every feature (term) is assigned a value according to its importance. Every feature (term) is assigned a value according to its importance. The Value of a term feature w is given as The Value of a term feature w is given as Where TF.IDF(w, x) is the weighting of feature w in long string value x, and N is the total number of long string values. Where TF.IDF(w, x) is the weighting of feature w in long string value x, and N is the total number of long string values.
23
Apply MeanTF.IDF on WeightedM a and Get Important Features to Imp_Fe a. Apply MeanTF.IDF on WeightedM a and Get Important Features to Imp_Fe a. Add Random features to Imp_Fe a to get Add Random features to Imp_Fe a to get Rand_ Imp_Fe a. Rand_ Imp_Fe a. Rand_ Imp_Fe a and Rand_ Imp_Fe b are Rand_ Imp_Fe a and Rand_ Imp_Fe b are returned to C. returned to C. C Finds the intersection and return the C Finds the intersection and return the shared important features SF to both A shared important features SF to both A and B. and B.
24
Reduced WeightedM Dimensions in Both Sources using SF. LSV1LSV2LSV3 Image0.900 Similarity00.850.9 … SF SF a
25
Add Random Vectors to SF LSV1LSV2LSV3 Random Cols Image0.9000.6 Similarity00.850.90.2 … Rand_Weighted_ a
26
Find W a (The Kernel) 1-Cos_Sim(LSV1,LSV1)=01-Cos_Sim(LSV1,LSV2)=0.21-Cos_Sim(LSV1,LSV3)=0.3… 1-Cos_Sim(LSV2,LSV1)=0.21-Cos_Sim(LSV2,LSV2)=01-Cos_Sim(LSV2,LSV3)=0.87… 1-Cos_Sim(LSV3,LSV1)=0.31-Cos_Sim(LSV3,LSV2)=0.871-Cos_Sim(LSV3,LSV3)=0… ………… |W a | = D x D, where D is total number of columns in Rand_Weighted a
27
Use Diffusion Maps to Find Red_Rand_Weighted_a [Red_Rand_Weighted_a,S a,V a,A a ] = Diffusion_Map(W a, 10, 1, red_dim), red_dim < D [Red_Rand_Weighted_a,S a,V a,A a ] = Diffusion_Map(W a, 10, 1, red_dim), red_dim < D Red_Rand_Weighted_a= Diffusion Map Representation of first row of W a Red_Rand_Weighted_a= Diffusion Map Representation of first row of W a Diffusion Map Representation of second row of W a Diffusion Map Representation of second row of W a Diffusion Map Representation of third row of W a Diffusion Map Representation of third row of W a Col1Col2… Col red_dim 0.40.1 0.80.6 0.750.5 …
28
C Finds Pairwise Similarity Between Red_Rand_Weighted_a and Red_Rand_Weighted_b Red_Rand_ Weighted_a Red_Rand_ Weighted_b Cos_Sim 110.77 120.3 ……… 210.9
29
If Cos_Sim>T, Insert the tuple in Matched Red_Rand_ Weighted_a Red_Rand_ Weighted_b Cos_Sim 110.77 210.9 270.85 ……… Matched
30
Matched is returned to both A and B. A and B remove random vectors from Matched and share their matrices.
31
Our Protocol (Part1)
32
Our Protocol (Part2)
33
Phase2 Results Effect of adding random columns on the accuracy. Effect of adding random columns on the accuracy.
34
Phase2 Results Effect of adding random columns on the number of suggested matches. Effect of adding random columns on the number of suggested matches.
35
Conclusions Efficient secure SJ semantic protocol for long string attributes is proposed. Efficient secure SJ semantic protocol for long string attributes is proposed. Diffusion maps is the best method (among compared) to semantically join long string attributes when threshold values are used. Diffusion maps is the best method (among compared) to semantically join long string attributes when threshold values are used. Mapping into diffusion maps space and adding random records can hide the original data without affecting the accuracy. Mapping into diffusion maps space and adding random records can hide the original data without affecting the accuracy.
36
Future Work Potential further works: Compare diffusion maps with more candidate semantic methods for joining long string attributes. Compare diffusion maps with more candidate semantic methods for joining long string attributes. Study the performance of the protocol on huge databases. Study the performance of the protocol on huge databases.
37
Thank You … Dr. Farshad Fotouhi. Dr. Farshad Fotouhi. Dr. Traian Marius Truta. Dr. Traian Marius Truta.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.