Privacy Preserving Schema and Data Matching Scannapieco, Bertino, Figotin and Elmargarmid Presented by : Vidhi Thapa.

Slides:

Advertisements

Similar presentations

Service Bus Service Bus Access Control.

Advertisements

Secure Multiparty Computations on Bitcoin

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Efficient Top-k Algorithms for Approximate Substring Matching Presented by Jagadeesh Potluri Shiva Krishna Imminni.

ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica.

CSCE 715 Ankur Jain 11/16/2010. Introduction Design Goals Framework SDT Protocol Achievements of Goals Overhead of SDT Conclusion.

YSLInformation Security -- Public-Key Cryptography1 Elliptic Curve Cryptography (ECC) For the same length of keys, faster than RSA For the same degree.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Co-operative Private Equality Test(CPET) Ronghua Li and Chuan-Kun Wu (received June 21, 2005; revised and accepted July 4, 2005) International Journal.

1 Haiguang Li 01. Dec Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University Class Presentation.

3D Hand Pose Estimation by Finding Appearance-Based Matches in a Large Database of Training Views

Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.

Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.

A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.

©Silberschatz, Korth and Sudarshan14.1Database System Concepts 3 rd Edition Chapter 14: Query Optimization Overview Catalog Information for Cost Estimation.

CSE 597E Fall 2001 PennState University1 Digital Signature Schemes Presented By: Munaiza Matin.

A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.

嵌入式視覺 Pattern Recognition for Embedded Vision Template matching Statistical / Structural Pattern Recognition Neural networks.

Privacy Preserving Query Processing in Cloud Computing Wen Jie

A Method for Protecting Digital Images from Being Copied Illegally Chin-Chen Chang, Jyh-Chiang Yeh, and Ju-Yuan Hsiao.

Database Management 9. course. Execution of queries.

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu

Privacy-Preserving Trust Negotiations* Mikhail Atallah CERIAS and Department of Computer Sciences Purdue University * Joint work with Keith Frikken and.

Machine Learning Approach to Report Prioritization with an Application to Travel Time Dissemination Piotr Szczurek Bo Xu Jie Lin Ouri Wolfson.

Customer Order Order Number Date Cust ID Last Name First Name State Amount Tax Rate Product 1 ID Product 1 Description Product 1 Quantity Product 2 ID.

XML Schema Integration Ray Dos Santos July 19, 2009.

CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Data Access and Security in Multiple Heterogeneous Databases Afroz Deepti.

Security Protection on Trust Delegated Medical Data in Public Mobile Networks Dasun Weerasinghe, Muttukrishnan Rajarajan and Veselin Rakocevic Mobile Networks.

1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.

Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Mining Multiple Private Databases Using a kNN Classifier (2007)

1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.

Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.

Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.

Wei-Shinn Ku Slide 1 Auburn University Computer Science and Software Engineering Query Integrity Assurance of Location-based Services Accessing Outsourced.

1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.

Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University Class Presentation by Rhonda Kost, 06.April.

Fast Similarity Search in Image Databases CSE 6367 – Computer Vision Vassilis Athitsos University of Texas at Arlington.

1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.

Location aware CHORD Ashwin, Vivek, Manu CS-7460 Project Presentation.

INTRODUCTION TO BIOMATRICS ACCESS CONTROL SYSTEM Prepared by: Jagruti Shrimali Guided by : Prof. Chirag Patel.

Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Hashes Lesson Introduction ●The birthday paradox and length of hash ●Secure hash function ●HMAC.

CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.

Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Privacy Preserving Outlier Detection using Locality Sensitive Hashing

A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.

Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem M. Hernandez & S. Stolfo: Columbia University Class Presentation by Jeff Maynard.

Cryptographic Hash Function. A hash function H accepts a variable-length block of data as input and produces a fixed-size hash value h = H(M). The principal.

Privacy Preserving Similarity Evaluation of Time Series Data

Relational Algebra Chapter 4, Part A

Sublinear Algorithmic Tools 3

Lecture 16: Earth-Mover Distance

Relational Algebra Chapter 4, Sections 4.1 – 4.2

Data Integration for Relational Web

Real-World Data Is Dirty

One-Pass Algorithms for Database Operations (15.2)

Efficient Record Linkage in Large Data Sets

CS561-Spring 2012 WPI, Mohamed eltabakh

Presentation transcript:

Privacy Preserving Schema and Data Matching Scannapieco, Bertino, Figotin and Elmargarmid Presented by : Vidhi Thapa

INTRODUCTION  Record Matching Process of identifying records representing same real world entity Can be executed in  Single source  Across sources Goal: Record matching that preserves privacy of both data and schema

RECORD MATCHING  Record matching involves: Sharing and integrating data Protecting privacy of data  Two major innovations: Approximate matching Awareness of schema information

EMBEDDING  Embed records in Euclidean space  Method used SparseMap  Comparison Functions edit distance  Matching Decision Rule Classify records as a match/ non-match  Record Matching

EXAMPLE EDIT DISTANCE  e( “Virginia”, “Vermont”) = 5 Virginia Verginia Verminia Vermonia Vermonta Vermont

HYPOTHESIS  Two hypothesis: Parties P and Q store the records to be matched in the relations R P (A 1,…A n ) and R Q (B 1,…B n ) respectively, 1. having identical schemas 2. having possible schema-level conflicts  Record matching between R P and R Q  P will know only a set P Match, consisting of records in R P that match with records in R Q.  Similarly Q will know only the set Q Match.

SECURE DATA MATCHING  Pairs of records compared by means of comparison function  Third party introduced to assure privacy  SparseMap reference set  metric space No. of subsets = [log 2 N] 2

HEURISTIC  Distance Approximation Input: Object o, Set S i Output: Approx d(o, S i )  Greedy Sampling Input: m co-ordinates Output: t <= m most discriminating co-ordinates

DATA MATCHING PROTOCOL  assume parties P and Q store records to be matched in the relations R P (A 1,…A n ) and R Q (B 1,…B n ) respectively  a third party-based protocol consists of the three following phases Phase 1: Setting of the embedding space Phase 2:Embedding of R P and R Q values Phase 3:Comparison to decide matching records

Phase 1

Phase 2

ILLUSTRATION  Stress  Eg: Academic(8.0,5.0,7.0,7.0) and usefull(6.0,6.0,6.0,7.0) Using 1 st co-ordinate – , Using 2 nd co-ordinate – Using 3 rd co-ordinate – Using 4 th co-ordinate – 1.0  Choose 1 st co-ordinate Using 1 st and 2 nd co-ordinate – Using 1 st and 3 rd co-ordinate – Using 1 st and 4 th co-ordinate –

Phase 3  Given a vector v in P str and w in Q str, the Euclidean distance calculated  Decision rule applied to all records comparisons: If true, records of P str and Q str inserted in two sets P Match and Q Match respectively  Final sets sent to two parties respectively

SECURE SCHEMA MATCHING  S W : global schema owned by third party W  L W : language  α w : alphabet  S P and S Q are the source schemas owned by two parties  if S W is Customer (Name, DateofBirth, ResidenceAddress) and S P is Cust( FirstName, LastName, DateofBirth), it is mapped as concatenate( Cust.FirstName, Cust.LastName) = Customer.Name

SECURE SCHEMA MATCHING (contd)  P generates SP’ (D1,..., Ds) from the mapping of SP with SW(D1,..., DL);  Q generates SQ’(D1,..., Dx) from the mapping of SQ with SW(D1,..., DL);  P and Q negotiate: secret key k Embedding parameters ( Lx, N, dist); Hash function h  P sends HP =(h(D1, k),..., h(Ds, k)) to W;  Q sends HQ = (h(D1, k)..., h(Dx, k)) to W;  W computes the intersection HP ∩ HQ

SECURITY ANALYSIS  Length of the database  Database size  Set of matching records  Set of matching attributes  Number of matching attributes

EXPERIMENTAL EVALUATION

CONCLUSION  Privacy-preserving record matching between two parties that can have different schemas  Requires privacy at schema level  Obtain privacy by embedding records in vector space  Applications: DNA sequences, Images, Proteins, etc.