Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org

Talk Outline ● Introduction – Privacy preserving data mining – what problem is it aimed to address? – Focus of this talk: data transformation ● Some data transformation approaches ● My current research: Euclidean distance preserving data transformation ● Wrap-up summary

An Example Problem ● The U.S. Census Bureau collects lots of data ● If released in raw form, this data would provide – a wealth of valuable information regarding broad population patterns – Access to private information regarding individuals  How to allow analysts to extract population patterns without learning private information?

Privacy-Preserving Data Mining “ The study of how to produce valid mining models and patterns without disclosing private information. ” - F. Giannotti and F. Bonchi, “ Privacy Preserving Data Mining, ” KDUbiq Summer School, 2006. Several broad approaches … this talk  data transformation (the “census model”)

Data Transformation ( the “Census Model”) Private DataTransformed Data Data Miner Researcher

DT Objectives Minimize risk of disclosing private information Maximize the analytical utility of the transformed data DT is also studied in the field of Statistical Disclosure Control.

Some things DT does not address… Preventing unauthorized access to the private data (e.g. hacking). Securely communicating private data.  DT and cryptography are quite different. (Moreover, standard encryption does not solve the DT problem)

Assessing Transformed Data Utility How accurately does a transformation preserve certain kinds of patterns, e.g.: ● data mean, covariance ● Euclidean distance between data records ● Underlying generating distribution? How useful are the patterns at drawing conclusions/inferences?

Assessing Privacy Disclosure Risk Some efforts in the literature to develop rigorous definitions of disclosure risk –no widely accepted agreement This talk will take an ad-hoc approach: –for a specific attack, how closely can any private data record be estimated?

Talk Outline ● Introduction – Privacy preserving data mining – what problem is it aimed to address? – Focus of this talk: data transformation ● Some data transformation approaches ● My current research: Euclidean distance preserving data transformation ● Wrap-up summary

Some DT approaches ● Discussed in this talk: – Additive independent noise – Euclidean distance preserving transformation ● My current research ● Others: – Data swapping/shuffling, multiplicative noise, micro- aggregation, K-anonymization, replacement with synthetic data, etc…

Additive Independent Noise For each private data record, (x 1,…,x n ), add independent random noise to each entry: –(y 1,…,y n ) = (x 1 +e 1,…,x n +e n ) –e i is generated independently as N(0, d*Var(i)) –Increasing d reduces privacy disclosure risk

Additive Independent Noise d = 0.5

Additive Independent Noise Difficult to set d producing low privacy disclosure risk high data utility Some enhancements on the basic idea exist E.g. Muralidhar et al.

Talk Outline ● Introduction – Privacy preserving data mining – what problem is it aimed to address? – Focus of this talk: data transformation ● Some data transformation approaches ● My current research: Euclidean distance preserving data transformation (EDPDT) ● Wrap-up summary

EDPDT – High Data Utility! ● Many data clustering algorithms use Euclidean distance to group records, e.g. – K-means clustering, hierarchical agglomerative clustering ● If Euclidean distance is accurately preserved, these algorithms will produce the same clusters on the transformed data as the original data.

EDPDT – High Data Utility! Original dataTransformed data

EDPDT – Unclear Privacy Disclosure Risk ● Focus of the research... approach  – Develop attacks combining the transformed data with plausible prior knowledge. – How well can these attacks estimate private data records?

Two Different Prior Knowledge Assumptions ● Known input: The attacker knows a small subset of the private data records. – Focus of this talk. ● Known sample: The attacker knows a set of data records drawn independently from the same underlying distribution as the private data records. – happy to discuss “off-line”.

Known Input Prior Knowledge Underlying assumption: Individuals know a) if there is a record for them along the private data records, and b) know the attributes of the private data records.  Each individual knows one private record.  A small group of malicious individuals could cooperate to produce a small subset of the private data records.

Known Input Attack Given: {Y 1,…,Y m } (transformed data records) {X 1,…,X k } (known private data records) 1) Determine the transformation constraints i.e. which transformed records came from which known private records. 2) Choose T randomly from the set of all distance preserving transformations that satisfy the constraints. 3) Apply T -1 to the transformed data.

Know Input Attack – 2D data, 1 known private data record

Known Input Attack – General Case Y = MX ● Each column of X (Y) is a private (transformed) data record. ● M is an orthogonal matrix. [Y kn Y un ] = M[X known X unkown ] Attack: Choose T randomly from {T an orthogonal matrix: TX known = Y kn }. Produce T -1 (Y un ). 23

Known Input Attack -- Experiments 18,000 record, 16-attribute real data set. Given k known private data records, computed P k, the probability that the attack estimates one unknown private record with > 85% accuracy. P 2 = 0.16 P 4 = 1 … P 16 = 1

Wrap-Up Summary ● Introduction – Privacy preserving data mining – what problem is it aimed to address? – Focus of this talk: data transformation ● Some data transformation approaches ● My current research: Euclidean distance preserving data transformation

Thanks to … ● You: – for your attention ● Kun Liu: – joint research & some material used in this presentation ● Krish Muralidhar: – some material used in this presentation ● Hillol Kargupta: – joint research

Distance Preserving Perturbation Attributes Records

Distance Preserving Perturbation Y 2,5782,899Tax 1,3241,889Rent 83,82198,563Wages 10021001ID 0.4527 0.8887 0.0726 0.2559-0.0514-0.9653 -0.8542 0.4556-0.2507 × 8,432 10,151Tax -80,324-94,502Rent -22,613-26,326Wages 10021001ID = MX

Known Sample Attack [more]more

Known Sample Attack Experiments backup Fig. Known sample attack for Adult data with 32,561 private tuples. The attacker has 2% samples from the same distribution. The average relative error of the recovered data is 0.1081 (10.81%).

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.

Similar presentations

Presentation on theme: "Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.

Similar presentations

Presentation on theme: "Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org."— Presentation transcript:

Similar presentations

About project

Feedback