Database Laboratory Regular Seminar 2013-08-05 TaeHoon Kim.

Slides:



Advertisements
Similar presentations
1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,
Advertisements

ACHIEVING NETWORK LEVEL PRIVACY IN WIRELESS SENSOR NETWORKS.
CLOSENESS: A NEW PRIVACY MEASURE FOR DATA PUBLISHING
Distributed Data Processing
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia.
Indexing and Range Queries in Spatio-Temporal Databases
PRIVACY AND SECURITY ISSUES IN DATA MINING P.h.D. Candidate: Anna Monreale Supervisors Prof. Dino Pedreschi Dott.ssa Fosca Giannotti University of Pisa.
CSCE 715 Ankur Jain 11/16/2010. Introduction Design Goals Framework SDT Protocol Achievements of Goals Overhead of SDT Conclusion.
Subscription Subsumption Evaluation for Content-Based Publish/Subscribe Systems Hojjat Jafarpour, Bijit Hore, Sharad Mehrotra, and Nalini Venkatasubramanian.
Finding Personally Identifying Information Mark Shaneck CSCI 5707 May 6, 2004.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Privacy and Integrity Preserving in Distributed Systems Presented for Ph.D. Qualifying Examination Fei Chen Michigan State University August 25 th, 2009.
Spatio-Temporal Databases. Introduction Spatiotemporal Databases: manage spatial data whose geometry changes over time Geometry: position and/or extent.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.
Cloud Usability Framework
Cloud Computing Cloud Computing Class-1. Introduction to Cloud Computing In cloud computing, the word cloud (also phrased as "the cloud") is used as a.
R 18 G 65 B 145 R 0 G 201 B 255 R 104 G 113 B 122 R 216 G 217 B 218 R 168 G 187 B 192 Core and background colors: 1© Nokia Solutions and Networks 2014.
Cong Wang1, Qian Wang1, Kui Ren1 and Wenjing Lou2
Privacy Preserving Query Processing in Cloud Computing Wen Jie
Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.
Publishing Microdata with a Robust Privacy Guarantee
APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August.
Privacy-Preserving Public Auditing for Secure Cloud Storage
Shiyuan Wang, Divyakant Agrawal, Amr El Abbadi Department of Computer Science UC Santa Barbara DBSec 2010.
Privacy Preserving Data Sharing With Anonymous ID Assignment
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
m-Privacy for Collaborative Data Publishing
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.
Protecting Sensitive Labels in Social Network Data Anonymization.
Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.
Refined privacy models
Hiding in the Mobile Crowd: Location Privacy through Collaboration.
Systems and Internet Infrastructure Security (SIIS) LaboratoryPage Systems and Internet Infrastructure Security Network and Security Research Center Department.
Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
Additive Data Perturbation: the Basic Problem and Techniques.
Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Mining Multiple Private Databases Using a kNN Classifier (2007)
Abstract With the advent of cloud computing, data owners are motivated to outsource their complex data management systems from local sites to the commercial.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
On Computing Top-t Influential Spatial Sites Authors: T. Xia, D. Zhang, E. Kanoulas, Y.Du Northeastern University, USA Appeared in: VLDB 2005 Presenter:
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Privacy-preserving data publishing
m-Privacy for Collaborative Data Publishing
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.
Privacy-Preserving Location- Dependent Query Processing Mikhail J. Atallah and Keith B. Frikken Purdue University.
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Harnessing the Cloud for Securely Outsourcing Large- Scale Systems of Linear Equations.
Security Analysis of a Privacy-Preserving Decentralized Key-Policy Attribute-Based Encryption Scheme.
Secure Data Outsourcing
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
HCBE: Achieving Fine-Grained Access Control in Cloud-based PHR Systems Xuhui Liu [1], Qin Liu [1], Tao Peng [2], and Jie Wu [3] [1] Hunan University, China.
Database Laboratory Regular Seminar TaeHoon Kim Article.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.
CAM: Cloud-Assisted Privacy Preserving Mobile Health Monitoring.
Talal H. Noor, Quan Z. Sheng, Lina Yao,
Practical Database Design and Tuning
Efficient Multi-User Indexing for Secure Keyword Search
Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
Hybrid Cloud Architecture for Software-as-a-Service Provider to Achieve Higher Privacy and Decrease Securiity Concerns about Cloud Computing P. Reinhold.
A Privacy-Preserving Index for Range Queries
Practical Database Design and Tuning
Donghui Zhang, Tian Xia Northeastern University
Presentation transcript:

Database Laboratory Regular Seminar TaeHoon Kim

/21 Contents 1.Introduction 2.Related work 3.Problem Statement 4.Distributed Anonymization 5.R-Tree Generalization 6.Performance Analysis 7.Conclusion

/21 1. Introduction  Cloud computing is a long dreamed vision of Computing Cloud consumers can remotely store their data into the cloud  To enjoy the on-demand though quality applications and services from a shared pool of configurable computing resources  Successful third party cases Examples of success cases on EC 2 include Nimbus Health[2]  Manages patient medical records Examples of success cases on ShareThis[3]  A social content-sharing network that has shared 340 million items across 30,000 web sites 3

/21 1. Introduction  Vulnerable data privacy Unfortunately, such data sharing is subject to constraints impose by privacy of individuals  Consistent with related works on cloud security[4][6][7][8] Researchers have show that attackers could effectively target and observe information  Third party clouds[9]  To protect data privacy, the sensitive information of individuals should be preserved Partition-based privacy preserving data publishing techniques  K-anonymity, (a,k)anonymity, l-diversity, t-closeness, m-invarance, etc.. 4

/21 1. Introduction  Privacy preserving data publishing for single dataset has been extensively studied Generalization, suppression, perturbation  Xiong et al,[5] Data anonymization for horizontally partitioned datasets A distributed anonymization protocol  Only gave a uniform approach that exerts the same level of protection for all data providers  How to design a new distributed anonymization protocol over cloud servers Propose a new distributed anonymization protocol We design an algorithm which inserts data object into an R-Tree for anonymization on top of the k-anonymity and l-diversity principle 5

/21 2. Related Work  Privacy preserving data publishing K-anonymity[11], (a,k)anonymity[12], l-diversity[13], t-closeness[30], m- invarance[14] designed a criteria for judging whether a published dataset provides a certain privacy preservation In this study, our distributed anonymization protocol is built top of the k- anonimity and l-diversity principle We propose new anonymization algorithm by inserting all the data object into an R-tree to achieve high quality generalization 6

/21 2. Related Work  Distributed anonymization solutions Naïve solution  Each data provider to implement data anonymization independently  Since the data is anonymized before integration, main drawback of this solution is that it will cause low data utility  Assumes the existence of a third party that can trusted by all data providers Trusted third party is not always feasible  Compromise of the server by attackers could lead to a complete privacy loss for all participating parties and data subject 7

/21 2. Related Work  Jiang et al.,[26] presented a two-party framework along with and application  Zhong et al.[27] proposed provably private solutions without disclosing data from one site to the other  Xiong et al.[25] presented a distributed anonymization protocol  In contrast to the above work, our work is aimed at outsourcing data provider provider’s private dataset to cloud servers for data sharing 8

/21 3. Problem Statement  The union of all local databases denoted as microdata set D as given in Definition 1  Each site produces a local anonymized databases d i * Meets its own privacy principle k i since data providers have different privacy requirements for publishing 9

/21 3. Problem Statement 10 Each site produces a local anonymized database d i * Node1Node2Node3

/21 3. Problem Statement(Goal)  Privacy for Data Objects Based on Anonymity k-anonymity[11][19]  A set of k records to be indistinguishable from each other based on a quasi- identifier group(sensitive attribute group) l-diversity[13]  each equivalence class contains at least l diverse sensitive values  Privacy between Data providers Our second privacy goal is to avoid the attack between data providers, in which individual dataset reveal nothing about data to the other data providers apart from the virtual anonymized database  We use distributed anonymization algorithm to build a virtual K-anonymous database and ensure the locally anonymized table d i * to be k i -anonymous –Use R-tree 11

/21 4. Distribute Anonymization  Protocol The main idea of the distributed anonymization protocol is to use secure multi-servers computation protocols to realize the R- tree generalization method for the cloud settings  Notation I : d-dimensional rectangle which is the bounding box of the QI group’s QI values Num : the total number of data objects in the equivalence class 12

/21 4. Distribute Anonymization  Example of generalization Equivalence class(QI group) of Node0 from [11-13][ ] to [11-30][ ] Equivalence class(QI group) of Node1 from [73-80][ ] to [65-80][ ] Equivalence class(QI group) of Node2 from [65-76][ ] to [65-80][ ] 13

/21 4. Distribute Anonymization  Example of Split Process When e3 is inserted, the R-tree node splits into two group, e1 and e3 into one group When the r4 comes, e1 and e3 will be split into one group, e2 e4 into other At last, e5 comes, e2 and e4 in one group and e5 the other 14

/21 5. R-Tree Generalization  Index structure Leaf node  (I, SI) –I : d-dimensional rectangle which is the bounding box of the QI group’s QI values –SI : sensitive information for a tuple Non-leaf node  (I, childPointer) –I : covers all rectangles in the lower nodes entries –childPointer : the address of a lower node in the R-tree 15 IchildPointer ISII

/21 5. R-Tree Generalization  Insertion At the root level, the algorithm choose the entry whose rectangle needs the least area enlargement to cover a, so R 1 is selected for its rectangle dose not need to be enlarged, while the rectangle of R 2 needs to expand considerably  Node Splitting(when leaf node occurs overflow) Picks two seeds from the entries that would get the largest area enlargement when covered by a single rectangle One at a time is chosen to be put in one of the two groups 16

/21 6. Performance Analysis  Experimental environment Amazon’s EC2 platform Implement in Java and run on set of EC 2 computing units Each computing unit is a small instance of EC2 with 1.7GHz Xeon processor 1.7GB memory, and 160 Hard disk Computing units are connected via 250Mbps network links  We use three different dataset with Uniform, Gaussian and Zipf distribution to evaluate our distributed anonymization scheme 17

/21 6. Performance Analysis  Dataset and Setup All the 100K tuples is located in one centralized database  Data are distributed among the 10 nodes and we use the distributed anonymizaion approach presented in Section 4 R-tree generalization algorithm was used to generalize the database to be K-anonymous DM(discernibility metric) assigns each tuple r i * in D * a penalty which is determined by the size of the equivalence class containing it 18

/21 6. Performance Analysis  Absolute error = | actual – estimate | Actual is the correct range query answer number Estimate is the number of candidate set computed from the anonymous table 19

/21 Conclusion  Two direction have presented A distributed anonymization protocol for privacy-preserving data publishing from multiple data providers in a cloud system. A new anonymization algorithm using R-Tree index structure  Future work Developing a protocol toolkit incorporating more privacy principle like differential privacy Building indexes based on anonymized cloud data to offer more efficient and reliable data analysis 20

/21 Q/A  Thank you for listening my presentation 21

/21 References  22

/21  Differential privacy aims to provide means to maximize the accuracy of queries from statistical databases 23