Download presentation
Presentation is loading. Please wait.
Published byBertina Pitts Modified over 9 years ago
1
ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica Università di Roma “La Sapienza”
2
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia2 Outline Privacy-aware integration –Privacy risk assessment –Private record linkage Quality-aware integration –Flexible and fully automatic record linkage Summary New!!!
3
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia3 PrivateIDSSNDOBZIPHealth_Problem a11/20/6700198Shortness of breath b02/07/8100159Headache c02/07/8100156Obesity d08/07/7600198Shortness of breath PrivateIDSSNDOBZIPEmploymentMarital Status 1A11/20/6700198ResearcherMarried 5E08/07/7600114Private Employee Married 3C02/07/8100156Public Employee Widow T1 T2 Linkage of Anonymous Data QUASI-IDENTIFIER
4
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia4 Our Proposal A framework for assessing privacy risk that takes into accounts both facets of privacy –based on statistical decision theory Definition and analysis of: –disclosure policies modelled by disclosure rules –several privacy risk functions Estimated risk as an upper-bound of true risk and related complexity analysis Algorithm for finding the disclosure rule minimizing the privacy risk
5
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia5 The Formal Framework Disclosure Rule δ Loss function l(δ, ) - representing attacker’s knowledge Risk R(δ, )=f(l(δ, ) ) identification sensitivity
6
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia6 K-anonimity K anonimity is SIMPLY a special case of our framework in which: 1.θ true = relation T, more strict assumption on the attacker’s knowledge. We proved that under some assumption we can bound the true risk by our “more general” risk 2. is a costant, questionable: independence on the type of disclosed attributes (HIV result same loss as last doctor visit) 3. is underspecified, we can specify the set of disclosure rules in several ways Our framework underlies some questionable hypotheses of k-anonimity!!!
7
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia7 Private Record Linkage Being P and Q be two peers owning the relations R P (A1,…An) and R Q (B1,…,Bn), respectively, the privacy- preserving record matching problem is to perform record matching between R P and R Q, such that at the end of the process –P will know only a set P Match, consisting of records in R P that match with records in R Q. Similarly Q will know only the set Q Match. Of particular importance is that no information will be revealed to P and Q concerning records that do not match each other Published at SIGMOD 07
8
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia8 Key Ideas and Solutions (1) Cannot just encrypt data and then compute distances among them – by definition encryption functions do not preserve distances Let’s work on numbers, instead of records!!! Mapping of records in a vector space, and record matching performed in such a space
9
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia9 Key Ideas and Solutions (2) Third-party based protocol in which: –The two parties build together the embedding space by using a method (SparseMap) with “secure” features –Each of the two parties embeds its own dataset and sends it to the third party –The third party W performs the intersection and sends back to the parties Mapping of records in a vector space, and record matching performed in such a space
10
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia10 Key Ideas and Solutions (3) Th1: Given the two relations R P (D1,…,Ds) and R Q (D1,…,Dx), the set of matching records RecMatch, DBSize the database, the following result is proven, the record matching protocol ¯finds the matched records between the two relations with the following assurance: – RecMatch is not disclosed to W; –R P - RecMatch is not disclosed to Q –R Q - RecMatch is not disclosed to P –DBSize is disclosed to W and bounded by P and Q
11
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia11 Schema Matching Features Th2: Given the schemas R P and R Q, owned by parties P and Q respectively and the set of matching attributes AttrMatch, the schema matching protocol finds the attributes common to the two schemas with the following assurance: –AttrMatch is not disclosed to W –AttrMatch is not disclosed to P and Q –AttrMatchSize is not disclosed to P and Q
12
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia12 How good are we? Time: better than record linkage without privacy preservation Effectiveness: Comparable wrt recall and precision
13
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia13 Flexible and Automatic RL P2P systems are loosely coupled, dynamic, open Manual phases of record linkage can be problematic: –Time consuming vs. dynamic feature/open –Syncronous interactions vs. loosely coupled systems Need for flexible and automatic RL
14
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia14 Background: Record Linkage Techniques Search Space Reduction: –Sorted Neighborhood Method –Blocking –Hierarchical grouping –… Decision Rules: –Probabilistic: Fellegi&Sunter –Empirical –Knowledge-based Comparison Functions: –Edit distance –Smith-Waterman –Q-grams –Jaro string comparator –Soundex code –TF-IDF –…
15
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia15 Key Idea Record Linkage is a complex process and should be decomposed as much as possible in its constituting phases For each phase the most appropriate technique should be chosen depending on application and data requirements In order to dynamically build ad-hoc record linkage workflows RELAIS: toolkit serving such a purpose – developed at Istat –UNIROMA contribution on data profiling stuff (wait a couple of slides )
16
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia16 RELAIS Toolkit RELAIS Application Constraints: Admissible error-rates Privacy issues Cost … Database Features: Size Quality Domain features … Record Linkage Workflow
17
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia17 RL Workflows Preprocessing Search Space Reduction Comparison Function Decision Model Normalization UpperLowerCase Schema reconciliation Blocking SNM Edit Distance Jaro Equality Probabilistic Empirical RecLink WF Appl2 SNM Probabilistic RecLink WF Appl1 Normalization UpperLowerCase Blocking Jaro Empirical Equality
18
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia18 Making Automatic Some Phases Data profiling for choosing matching keys Automatic extraction of: –Completeness –Consistency –Identification power On going
19
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia19 Status of RELAIS Currently guided execution of RL workflows with all phases automatic Future: –Definition of RELAIS's architecture as a service- oriented, web-accessible architecture. Formal specification of (i) input/output of services, and (ii) pre/post conditions by semantic Web Services technologies –Automatic generation of RL workflows by reasoning on service specification usage of either automatic [Berardi et al VLDB 2005] or semi automatic [Bouguettaya et al. VLDBJ 2003] service composition techniques
20
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia20 Implementation View PQ-RELAIS Record Linkage Workflow Q-RELAIS P-RELAIS Data Source profiling (quality metadata) Quality-based trust evaluation Automatic and flexible RL Privacy risk assessment Private RL
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.