Minimum Spanning Tree Partitioning Algorithm for Microaggregation

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

ECE 667 Synthesis and Verification of Digital Circuits
Greedy Algorithms.
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Weighted graphs Example Consider the following graph, where nodes represent cities, and edges show if there is a direct flight between each pair of cities.
O(N 1.5 ) divide-and-conquer technique for Minimum Spanning Tree problem Step 1: Divide the graph into  N sub-graph by clustering. Step 2: Solve each.
PARTITIONAL CLUSTERING
Greedy Algorithms Greed is good. (Some of the time)
Greedy Algorithms Spanning Trees Chapter 16, 23. What makes a greedy algorithm? Feasible –Has to satisfy the problem’s constraints Locally Optimal –The.
Greed is good. (Some of the time)
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Combinatorial Algorithms
Chapter 3 The Greedy Method 3.
Chapter 23 Minimum Spanning Trees
Approximation Algorithms
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Recent Development on Elimination Ordering Group 1.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
CSE 421 Algorithms Richard Anderson Dijkstra’s algorithm.
Introduction to Bioinformatics Algorithms Clustering.
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
4. Ad-hoc I: Hierarchical clustering
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
2015/6/201 Minimum Spanning Tree Partitioning Algorithm for Microaggregation 報告者:林惠珍.
1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
Backtracking.
Hardness Results for Problems
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
1 Minimum Spanning Tree Problem Topic 10 ITS033 – Programming & Algorithms Asst. Prof. Dr. Bunyarit Uyyanonvara IT Program, Image and Vision Computing.
MST Many of the slides are from Prof. Plaisted’s resources at University of North Carolina at Chapel Hill.
Advanced Algorithm Design and Analysis (Lecture 13) SW5 fall 2004 Simonas Šaltenis E1-215b
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Lecture 19 Greedy Algorithms Minimum Spanning Tree Problem.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
::Network Optimization:: Minimum Spanning Trees and Clustering Taufik Djatna, Dr.Eng. 1.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Semi-Supervised Clustering
Greedy Technique.
Data Mining K-means Algorithm
CSE572, CBS598: Data Mining by H. Liu
Clustering.
CSE572, CBS572: Data Mining by H. Liu
Lecture 14 Shortest Path (cont’d) Minimum Spanning Tree
Backtracking and Branch-and-Bound
CSE572: Data Mining by H. Liu
Major Design Strategies
Lecture 13 Shortest Path (cont’d) Minimum Spanning Tree
Clustering.
Major Design Strategies
Minimum Spanning Trees
Presentation transcript:

Minimum Spanning Tree Partitioning Algorithm for Microaggregation Gokcen Cilingir 10/11/2011

Challenge How do you publicly release a medical record database without compromising individual privacy? (or any database that contains record-specific private information) The Wrong Approach: Just leave out any unique identifiers like name and SSN and hope to preserve privacy. Why? The triple (DOB, gender, zip code) suffices to uniquely identify at least 87% of US citizens in publicly available databases.* Quasi-identifiers *Latanya Sweeney. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 557-570.

A model for protecting privacy: k-anonymity Definition: A dataset is said to satisfy k-anonymity for k > 1 if, for each combination of quasi-identifier values, at least k records exist in the dataset sharing that combination. If each row in the table cannot be distinguished from at least other k-1 rows by only looking a set of attributes, then this table is said to be k-anonymized on these attributes. Example: If you try to identify a person from a k-anonymized table by the triple (DOB, gender, zip code), you’ll find at least k entries that meet with this triple.

Statistical Disclosure Control (SDC) Methods Statistical Disclosure Control (SDC) methods have two conflicting goals: Minimize Disclosure Risk (DR) Minimize Information Loss (IL) Objective: Maximize data utility while limiting disclosure risk to an acceptable level Many measures of IL is out there: Mean variation of data, Mean variation of data means, Mean variation of data variances, Mean variation of data covariates

One approach for k-anonymity: Microaggregation Microaggregation can be operationally defined in terms of two steps: Partition: original records are partitioned into groups of similar records containing at least k elements (result is a k-partition of the set) Aggregation: each record is replaced by the group centroid. Microaggregation was originally designed for continuous numerical data and recently extended for categorical data by basically defining distance and aggregation operators suitable for categorical data types. aggregation operator: for example, the mean for numerical data or the median for categorical data

Optimal microaggregation Optimal microaggregation: find a k-partition of a set that maximizes the total within-group homogeneity More homogenous groups mean lower information loss How to measure within-group homogeneity? within-groups sums of squares(SSE) Are there any other measures for information loss? a large number of measures which quantify the ‘group homogeneity’ have been reported in the literature. These are usually based on several distance definitions, such as the Euclidean distance, the Minkowski distance and the Chebyshev distance. The most common homogeneity measure for clustering is the within-group sum of squares, the SSE analysis of variance methods can be used as alternative methods to investigate the degree of information that is retained For univariate data, polynomial time optimal microaggregation is possible. Optimal microaggregation is NP-hard for multivariate data!

Heuristic methods for microaggregation on multivariate data Approach 1: Use univariate projections of multivariate data Approach 2: Adopt clustering algorithms to enforce group size constraint: each cluster size should be at least k and at most 2k-1 Fixed-size microaggregation: all groups have size k, except perhaps one group which has size between k and 2k−1. Data-oriented microaggregation: all groups have sizes varying between k and 2k−1. Since the problem of k-anonymization is essentially a search over a space of possible multi-dimensional solutions, standard heuristic search techniques such as genetic algorithms or simulated annealing can be effectively used.

Fixed-size microaggregation Pick a point p and gather its nearest k-1 neighbors to form a cluster. Recursively apply the idea to the rest of the data How do we pick p at each step of cluster formation.

A data-oriented approach: k-Ward Ward’s algorithm (Hierarchical - agglomerative) Start with considering every element as a single group Find nearest two groups and merge them Stop recursive merging according to a criteria (like distance threshold or cluster size threshold) k-Ward Algorithm Use Ward’s method until all elements in the dataset belong to a group containing k or more data elements (additional rule of merging: never merge 2 groups with k or more elements)

Minimum spanning tree (MST) A minimum spanning tree (MST) for a weighted undirected graph G is a spanning tree (a tree containing all the vertices of G) with minimum total weight. Prim's algorithm for finding an MST is a greedy algorithm. Starts by selecting an arbitrary vertex and assigning it to be the current MST. Grows the current MST by inserting the vertex closest to one of the vertices that are already in the current MST. Exact algorithm; finds MST independent of the starting vertex Assuming a complete graph of n vertices, Prim’s MST construction algorithm runs in O(n2) time and space

MST-based clustering Which edges we should remove? → need an objective to decide Most simple objective: minimize the total edge distance of all the resultant N sub-trees (each corresponding to a cluster) Polynomial-time optimal solution: Cut N-1 longest edges. When using MST-representation of a dataset, one needs an objective to decide which edge to remove next (inconsistent edges). Each edge removal results in one more cluster) More sophisticated objectives can be defined, but global optimization of those objectives will likely to be costly.

MST partitioning algorithm for microaggregation MST construction: Construct the minimum spanning tree over the data points using Prim’s algorithm. Edge cutting: Iteratively visit every MST edge in length order, from longest to shortest, and delete the removable edges* while retaining the remaining edges. This phase produces a forest of irreducible trees+ each of which corresponds to a cluster. Cluster formation: Traverse the resulting forest to assign each data point to a cluster. Further dividing oversized clusters: Either by the diameter-based or by the centroid-based fixed size method * Removable edge: when cut, resulting clusters do not violate the minimum size constraint + Irreducible tree: tree with all non-removable edges. Ex: MST partitioning algorithm has 3 phases An additional phase is needed to further divide the oversized clusters

MST partitioning algorithm for microaggregation – Experiment results Methods compared: Diameter-based fixed size method: D Centroid-based fixed size method : C MST partitioning alone: M MST partitioning followed by the D: M-d MST partitioning followed by the C: M-c Experiments on real data sets Terragona, Census and Creta: C or D beats the other methods on all of these datasets D beats C on Terragona, C beats D on Census and D beats C marginally on Creta M-d and M-c got comparable information loss In such cases, the fixed-size methods are forced to group points belonging to distinct clusters, hence, points that are well separated in space.

MST partitioning algorithm for microaggregation – Experiment results(2) Findings of the experiments on 29 simulated datasets: M-d and M-c works better on well-separated datasets Whenever well separated clusters contained fixed number y of data points, M-d and M-c beats fixed-size methods when y is not a multiple of k MST- construction phase is the bottleneck of the algorithm (quadratic time complexity) Dimensionality of the data has little impact on the total running time In such cases, the fixed-size methods are forced to group points belonging to distinct clusters, hence, points that are well separated in space. Constant factors are very different

MST partitioning algorithm for microaggregation – Strengths Simple approach, well-documented, easy to implement Not many clustering approaches existed in the domain at the time, proposed alternatives → centroid idea inspired improvements on the diameter-based fixed method Effect of data set properties on the performance is addressed systematically. Comparable information loss values with the existing methods, better in the case of well separated clusters Holds time-efficiency advantage over the existing fixed-size method When multiple parsing of the data set is needed (perhaps for trying different k values), algorithm is efficiently useful (since single MST construction will be needed) Like Natural clustering in the data, size of the data set, data dimensionality and the number of points in natural clusters is observed on simulated data sets.

MST partitioning algorithm for microaggregation – Weaknesses Higher information loss than the fixed-size methods on real datasets that are less naturally clustered. Still not efficient enough for massive data sets due to requiring MST construction. Upper bound on the group size cannot be controlled with the given MST partitioning algorithm. Real datasets used for testing were rather small in terms of cardinality and dimensionality (!) Other clustering approaches that may apply to the problem are not discussed to establish the merits of their choice. forced to resort to combination of fixed sized methods with MST partitioning to control the upper bound.

Discussion on microaggregation At what value of k is microaggregated data safe? Is one measure of information loss sufficient for the comparison of algorithms? How can we modify an efficient data clustering algorithm to solve the microaggregation problem? What approaches one can take? What are the similar problems in other domains (clustering with lower and upper size constraints on the cluster size)?

Discussion on microaggregation(2) Finding benchmarks may be difficult due to the confidentiality of the datasets as they are protected How reversible are different SDC methods? If a hacker knows about what SDC algorithm was used to create a protected dataset, can he launch an algorithm specific re-identification attack? Should this be considered in DR measurements? How much information loss is “worth it” to use a single algorithm (e.g. MST) for a wider variety of applications?

Discussion on the paper How can we make this algorithm more scalable? How could we modify this algorithm to put an upper bound on the size of a cluster? Was there a necessity to consider centroid-based fixed size microaggregation over diameter-based?

References Microaggregation Michael Laszlo and Sumitra Mukherjee. Minimum Spanning Tree Partitioning Algorithm for Microaggregation. IEEE Trans. on Knowl. and Data Eng. 17(7): 902-911 (2005) J. Domingo-Ferrer and J.M. Mateo-Sanz. Practical Data-Oriented Microaggregation for Statistical Disclosure Control. IEEE Trans. Knowledge and Data Eng. 14(1):189-201 (2002) Ebaa Fayyoumi and B. John Oommen. A survey on statistical disclosure control and micro-aggregation techniques for secure statistical databases. Softw. Pract. Exper. 40(12):1161-1188 (2010) Josep Domingo-Ferrer, Francesc Sebe, and Agusti Solanas. A polynomial-time approximation to optimal multivariate microaggregation. Comput. Math. Appl. 55(4): 714-732 (2008) MST-based clustering C.T. Zahn. Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters. IEEE Trans. Computers. 20(4):68-86 (1971) Y. Xu, V. Olman, and D. Xu, Clustering Gene Expression Data Using a Graph-Theoretic Approach: An Application of Minimum Spanning Tree, Bioinformatics, 18(4): 526-535 (2001)

Additional slides In such cases, the fixed-size methods are forced to group points belonging to distinct clusters, hence, points that are well separated in space.

Additional slides

Additional slides

Additional slides