A Condensation Approach for Privacy Preserving Data Mining

A Condensation Approach for Privacy Preserving Data Mining
Charu Aggarwal IBM TJ Watson Research Philip S. Yu University of Illinois at Chicago

Introduction In the era of big data, the large amount of data being collected has made the problem of privacy increasingly important In many cases, users are not willing to divulge personal information unless its privacy is assured Techniques are required to process the data in a way that the privacy is protected

The Perturbation Approach
An innovative & fundamental approach (called the perturbation approach) has been proposed for privacy preserving data mining (Agrawal et al) The perturbation approach adds noise to the data in order to create new records with an increased level of privacy In order to run data mining algorithms, only aggregate distributions are reconstructed instead of individual records

Correlation Preservation
The perturbation approach treats individual dimensions independently Thus inter-attribute correlations are not very well preserved by the perturbation approach In many data mining problems, a lot of important information can be hidden in the inter-attribute correlations

Locality Sensitivity An additional aspect is the locality sensitivity of the privacy preservation process Perturbing all parts of the data equally is not necessarily useful when some of the attributes are known publicly Consider an individual with an age value of 132 Small perturbations do not protect the privacy of this record On the other hand, for an individual with age 30, small perturbations can protect the privacy of the record

The Condensation Approach
The condensation approach partitions the data into several groups The number of data points in each group is the indistinguishability level of the group A record cannot be distinguished from any other record in the same group For each group, first order and second order statistical data values are maintained

Observation The condensed statistics is also kept private (as the original data) This is because the condensed data is susceptible to hack attacks from partial knowledge about the system Pseudo-data is not as sensitive to hack attacks

Statistical Data Stored with Each Group
For each group, we store d2+d+1 counts, where d is the dimension of the data records For each pair of dimensions, we store Scij , which is the sum of the product of attributes i and j For each group, we store the sum (Fsi ) of the attribute value for dimension i in the group We store the number of records in each group

Remarks Additivity property: Each value is the sum of the corresponding values over different records The covariance between attributes i and j can be computed from the statistical data values stored

Anonymized Data Generation
Generate co-variance matrix for each group Decompose Columns represent eigenvectors Eigenvectors represent directions of zero correlations Eigenvalues represent the variances along these eigenvectors

Anonymized Data Generation
Properties of eigenvectors and eigenvalues can be used to generate the anonymized data for each group Since each group represents a data locality, the uniform data distribution assumption is used within each group For each group, anonymized data records are constructed assuming Data is independently and uniformly distributed along each eigenvector With variance equal to the corresponding eigenvalue

Static Group Construction
The static case is relatively straightforward In this case, the entire database is available a-priori In static group construction, clusters of records of size k (or k+1) are created The values of Scij and Fsi are computed over each group

Dynamic Case In contrast to the static case, data points continue to stream in Challenge How to apply privacy preservation in one scan to generate group of size between k and 2k

Stream based Privacy Preservation
Dynamic Condensation based Privacy Preservation Algorithm Pseudo Data Stream Stream Mining Algorithm, e.g., Classifier Data Stream

Initialization Procedure
In the initialization process, we start off with a small fraction of data points We first apply the static approach on the initial set of data points We then apply the dynamic update process

Dynamic Updates As a new data point arrives it needs to be added to one of the groups We determine the closest centroid among all groups and assign the data point to that group The closest centroid is computable from the first order statistics After addition of the data point to the group, the corresponding statistics need to be updated

Updating First and Second Order Values
Since the condensed statistics are additive, we can add to the condensed data in each group In case of overflow of a group above 2k points, we need to split that group into two Not possible to do exactly, since the data has already been condensed Use the uniformly distributed assumption to split the data

Steps in Splitting Split along direction with the largest eigenvalue
Eigenvectors of both split groups remain the same since the directions of correlation are unchanged The largest eigenvalues gets divided by 4, whereas others remain unchanged (uniform distribution assumption) Let and be the new eigenvector and diagonal matrices respectively Recompute

Empirical Results Used both static and dynamic condensation methods
Conduct sensitivity analysis with differing group sizes Higher group sizes lead to a greater level of privacy, but also a greater number of information loss Tested on data sets from the UCI machine learning repository

Empirical Results Tested for the level of classification accuracy with varying group size Tested the level of correlation between the covariance matrices of the original and perturbed data

Ionosphere Data Set

Ecoli Data Set

Pima Indian Data Set

Conclusions and Summary
New approach for condensation based privacy preserving data mining Uses constrained clustering for anonymization Partitions data into clusters, each with size at least k Generates synthetic data, which mirrors the distribution of each cluster Use eigenvector analysis to generate synthetic data Synthetic data often provides better privacy because it is usually more difficult to map synthetic records to their relevant groups of size k

Going Forward based on different types of network attributes
From structured data to unstructured network data Attack on lack of Structure Diversity Friendship Attack Mutual Friend Attack Linkage Covariance Attack based on different types of network attributes

??? Challenges Network related attributes are derived
Unclear what information should be protected vs preserved Utility preservation Privacy Protection ???

Attack on Lack of Structural Diversity
C. Tai, P. S. Yu, D. Yang, M.S. Chen, Structural Diversity for Resisting Community Identification in Published Social Network, TKDE 2014. C. Tai, P. Tseng, P. S. Yu, M.S. Chen, Identity Protection in Sequential Releases of Dynamic Social Networks , TKDE 2014.

Privacy in Social Networks
Vertex identification is considered to be an important privacy issue in publishing social networks. k-degree anonymity, k-neighborhood anonymity, … In addition to a vertex identify, each individual is also associated with a community identity. Could infer to the political party affiliation or disease information sensitive to the public. Is a kind of structural information In the problem of publishing social networks, vertex identification has been considered to be an important privacy issue. Recent years, several works propose some privacy models, such as k-degree anonymity, k-neighborhood anonymity, and to name a few, to provide protections against attacks of various knowledge. These privacy models work well. However, we note that it is insufficient, because in addition to the vertex identities, individuals are also associated with the community identities. The community identity could infer to personal political party affiliation or disease information which are sensitive to the public, and the community identity is not like general labels, it is a kind of structural information. Therefore, the previous privacy model cannot protect the community identity.

Privacy in Social Networks
Alice knows Bob participates in this social networks and has 5 friends. (vertex degree attack)  Bob has AIDS! AIDS Com. SLE Com. Let us consider the vertex degree attack, which is essential, and understand the problem more through this example. Here is the social network of an online disease forum. Suppose that Alice knows Bob participates in it and makes 5 friends. Then Alice can infer that Bob has AIDS, even though she does not know which vertex is corresponding to Bob. This example shows that k-degree anonymity cannot protect the community identity from the attacks of vertex degrees.

Structural Diversity k-Structural Diversity
To protect against vertex degree attack, for each vertex, there should be other vertices with the same degree locating in at least k-1 other communities. If a graph satisfies k-structural diversity, then it also satisfies k-degree anonymity. Then, for this problem, we propose structural diversity. The key point of k-structure diversity is that there should be at least k vertices of the same degree distributing widely apart from each other, no matter the community information is explicit or implicit. No matter the community information is explicit or implicit, there are at least k vertices with the same degree distributing widely apart from each other.

The Problem in Dynamic Scenarios…
G2 G1 Ex. John has two friends at time 1, and three friends at time 2. A dynamic social network will be sequentially released. An attacker can monitor a victim for a period w. The challenges: The anonymization depends on not only the current social network but also previous w-1 releases. Searching through all the w-1 releases to eliminate privacy leak can be time consuming.

Friendship Attacks C.Tai , P. S. Yu, D. Yang†, M.S. Chen, Privacy-Preserving Social Network Publication Against Friendshship Attacks, KDD 2011.

Privacy Concerns In Data Sharing
Personal information leaked Attributes: Name, Salary, … Links: Degrees, Neighborhood, … Communities: Interests, Activities, …

Friendship Attack Still there are another type of information for vertex re-identification – friendship attack

Friendship Attack Given a target individual A and the degree pair information D2 = (d1,d2), a friendship attack (D2,A) exploits D2 to identify a vertex v1 corresponding to A in a published social network where v1 connects to another vertex v2 with the degree pair (dv1,dv2) =(d1,d2). 10 1 2 3 4 5 Alice 6 7 9 8 Ex 1. Assume that an attacker knows that Alice has 3 connections, Bob has 2 connections, and Alice and Bob are friends. The attacker identifies v9 as Alice with 100% confidence.

Friendship Attack In DBLP data set, the percentages of vertices that can be re-identified with a probability larger than 1/k by degree and friendship attacks. Original Social Network k-degree anonymized k Degree Attack Friendship Attack 5 0.28% 5.37% 2.89% 10 0.53% 10.69% 4.65% 15 0.73% 14.71% 5.82% 20 0.93% 18.44% 7.23%

Mutual Friend Attacks C. Sun, P. S. Yu, X. Kong, Y. Fu, Privacy Preserving Social Network Publication Against Mutual Friend Attacks, ICDM PADM workshop, 2013.

Background Just anonymizing the node id is not sufficient.
Carl Dell Ed Gary Frank Alice Bob C B A E D G F 2 1 3 4 Original social network G with vertex identities Naively anonymized social network G’ Just anonymizing the node id is not sufficient. The adversary can uniquely re-identify the edge (D,E) is (Alice,Bob). Also, (Alice,Carl) can be uniquely re-identified as (D,A). What attributes to protect? Degree Degree pair of each edge (friendship attack) Number of Mutual friends

Challenge The edge (3, 4) does not have mutual friends while all the others have one. anonymization k-automorphism with k=2 1 5 2 6 3 4

Problem Definition The NMF of an edge. For an edge e between two vertices v1 and v2 in a graph G(V,E), and e = (v1,v2), the number of mutual friends of the edge e is the number of mutual friends of v1 and v2. C B A E D G F 2 1 3 4 The NMF of edge (A,B) is 2. The NMF of edge (A,D) is 3. The NMF of edge (D,E) is 4.

Linkage Covariance Attack
C. Aggarwal, P. S. Yu, On the Hardness of Graph Anonymization, ICDM 11

Linkage Covariance Each node p is associated with a vector XP
Its k-th component is 1, if node p is connected to node k The linkage covariance LinkCov(p, q) between nodes p and q is equal to the covariance between Xp and Xq.

Robustness of Linkage Covaraiance
Linkage covariance is robust to edge additions and deletions for massive and sparse graphs. Let L be the estimated value of the link covariance between nodes p and q (with mpq common neighbors) after the addition of edges with probability fa. Then, we have: E[L] = LinkCov(p, q) − 2 . mpq . fa/N For deletion probability fd, the expected value of the estimated link covariance L is related to the true link covariance LinkCov(p, q) as follows: E[L] = LinkCov(p, q) . (1 − 2 . fd)

??? Summary Network data is becoming increasingly important
Privacy preserving publishing on network data is critically needed. Utility preservation Privacy Protection ???

A Condensation Approach for Privacy Preserving Data Mining

Similar presentations

Presentation on theme: "A Condensation Approach for Privacy Preserving Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Condensation Approach for Privacy Preserving Data Mining

Similar presentations

Presentation on theme: "A Condensation Approach for Privacy Preserving Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback