1 Privacy Protection with Genetic Algorithms 報告者：林惠珍運用基因演算法來作隱私保護.

1 Privacy Protection with Genetic Algorithms 報告者：林惠珍運用基因演算法來作隱私保護

2 Outline Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic-Algorithm- Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges

4 Privacy!! Privacy V.S. Data utility Data collection Statistics Data aggregation Releasing Respondent Safe

5 Contribution Micro-aggregation for distorting data and guaranteeing respondents privacy. Optimal micro-aggregation is NP-hard, so the author uses GA and some modification to solve the problem. A hybrid method for solving above problem.

7 SDC (Statistical Disclosure Control) (Statistical Disclosure Limitation ， SDL) Data Transform Public Data utility Statistical confidentiality Respondent Enough protection & Minimize information loss Method Micro-aggregation Micro-data 個人資料 Clustering problem Cluster size!

8 Two goals for micro-aggregation Preserving data utility. Protecting the privacy of the respondents.

9 Preserving data utility As less noise as possible into data So, we should aggregate similar elements instead of different ones.

10 Protecting the privacy of the respondents Data have to be sufficiently modified to make re-identification difficult. Increasing the number of aggregated elements can increase data privacy.

11 Whether two elements are similar Similarity function ex ： Euclidean Distance Univariate Data set Element numbers in Duni The i-th element in Duni Average element Multivariate Data set Dimension numbers of each element The j-th component of the average element The j-th component of the i-th element in Dmulti Multiple subsets Subset number Element numbers in the i-th subset The j-th element in the i-th subset The average element of the i-th subset

12 Micro-aggregation problem (k-micro-aggregation problem) SSEk A security parameter. Determines the minimum cardinality of the subsets. Data set D (n elements) To obtain a k-partition Homogeneity of is maximized A k-partition of D is a partition where its parts have, at least, k elements of D. ex: k=3 3 5 4 Average element = 4 4 4 (SSE 的值要小 ) NP-hard for multivariate data sets Use heuristic methods!! Definition

13 Multivariate Micro-Aggregation Methods Minimum Spanning Tree Partitioning (MSTP) Maximum Distance Method (MD) Maximum Distance to Average Vector Method (MDAV) Variable-MDAV

14 Minimum Spanning Tree Partitioning (MSTP) Step ： 1. MST construction 2. Edge cutting 3. Cluster generation Limitation ： In its foundation, MST. Fail to properly adapt to the scattered data points.

15 Maximum Distance Method (MD) The main advantage is its simplicity and it achieves very good results in most data sets. r s Most distant (by Euclidean Distance) Form a group with r(s) and the closet k-1 elements. Check the remaining element numbers. 1.num>=2k repeat 2.k<=num<=2k-1 a new group 3.num<=k-1 assign each element to the closet group Micro-aggregated data ： Replacing each record by the centroid of the group to which it belongs. Shortcoming ： computational complexity

16 Maximum Distance to Average Vector Method (MDAV) MDAV improves on MD in terms of computational complexity while maintaining the performance in terms of SSE. MDAV is the most popular method used for micro-aggregating data sets.

17 MDAV Algorithm Build two groups at each iteration. When (RR =k a new group

18 MDAV Process Centroid c Distance Matrix Most distant s r Distance Matrix Micro-aggregated data ： Replacing each record by the centroid of the group to which it belongs. Shortcoming ： Lack of flexibility It only generates subsets of fixed cardinality k.

19 Variable-MDAV V-MDAV intends to overcome the limitation by computing a variable-size k-partition with a computational cost similar to the MDAV cost.

20 V-MDAV Process Distance Matrix Centroid c Check the remaining element numbers. 1. RR>=k form groups 2. RR<=k-1 assign each element to the closet group Distance Matrix Most distant e Closet Distance ： d_in e_min Closet Distance ： d_out If (d_in < γ*d_out) assign e_min to the current group MDAV is the most popular one, so authors use it as a reference for comparison. extend the group ( up to k-1 )

21 Outline Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic- Algorithm-Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges

22 Coding sequence Initializing the population The fitness function Selection scheme and genetic operators (crossover & mutation)

23 Coding Sequence Binary codings ： N-ary codings ： Real-valued codings ： 0110100110….232.31.953.44.52.723.1….BADFEACCBF

24 Univariate V.S. Multivariate Univariate micro-aggregation ： binary codings  Data set ： 3 25 1 6 9 8 4 5 10 11 20 17  Sorted data set ： 1 3 4 5 6 8 9 10 11 17 20 25  Binary codings may be ：  But, there is no way of sorting multivariate records without giving a higher priority to one of the attributes. 000011001000

25 Univariate V.S. Multivariate (cont.) Multivariate micro-aggregation ： N-ary codings  Maximum number of groups  Each symbol represents one group of the k-partition.  Chromosome length ： the number of records in the data set  The i-th gene value →the group of the k-partition which the i-th record in the data set belongs to

26 Example n = 11 k = 3 G = 11/3 = 3 3-character alphabet ： A 、 B 、 C Chromosome length ： 11 ABCAABBCCAA 3-partition ： group A = {1,2,3,10,11} group B = {4,5,6} group C = {7,8,9}

27 Initializing the Population Generally using random method n records and G different alphabet symbols ： But, only a small fraction meets the cardinality constraints. “In an optimal k-partition, each group has between k and 2k-1 records.” (Domingo & Mateo) Minimum number of groups possible chromosomes

28 Initializing the Population (cont.) Random initialization is not suitable to obtain candidate optimal k-partitions. So, the cardinality constraints must be embedded in the initialization procedure. →Algorithm 2 Guarantee that each group( part) has at most 2k-1 elements.

29 The Fitness Function Obtain a measure of the homogeneity of the groups in the k-partition represented by a given chromosome through SSE. The goal is to minimize SSE. Thus, the fitness value of a chromosome is s ： group 的總數 ni ：第 i 個 group 的 record 數目 Penalize the chromosome which includes a non-optimal k-partition.

30 Selection Scheme and Genetic Operators Selection scheme ： roulette-wheel selection Genetic operators ： one-point crossover and mutation

32 A Hybrid Approach GAMDAV Good SSE Adapting to very large data sets Low performance to very large data sets Worse than GA in terms of SSE Hybrid approach 1. Good SSE 2. Adapting to very large data sets Name ： Two-step partitioning

33 Two-step partitioning k→ small value K→ larger than k and K% k = 0 ； small enough to be suitable for GA Ex ： k=3 ； K=21 Use MDAV to build 3-partition Use MDAV to build macro-groups (sets of average vectors) of size K/k (21/3=7) K-partition Replace the vectors by the k original records Finally, apply the GA to each macro-group in the K-partition in order to generate an optimal or near optimal k-partition of the macro-group.

34 One-step MDAV V.S. Two-step MDAV Better

36 Experiment Approaches ： GA-based micro-aggregation Hybrid micro-aggregation Comparison with MDAV and ES (exhaustive search). ES is only possible with tiny data sets of up to 11 elements. Data sets ： 1. The example data set (Table 1) 2. Small data sets 3. Real and large data sets Each experiment consists of 12,100 runs of GA. Mutation rate ： 0 、 0.1 、 0.2 、 0.3 、 0.4 、 0.5 、 0.6 、 0.7 、 0.8 、 0.9 、 1→11 種 Crossover rate ： 0 、 0.1 、 0.2 、 0.3 、 0.4 、 0.5 、 0.6 、 0.7 、 0.8 、 0.9 、 1→11 種 Population size ： 10 、 20 、 30 、 40 、 50 、 60 、 70 、 80 、 90 、 100→10 種 GA was run 10 times for each parameter setting.

37 Results for the Running Example GA running time depends on the number of generations. Most of the tests converge in less than 5,000 iterations. Although MDAV is faster, the SSE obtained with the GA is better. (90% →14.82)

38 Results in Small Data Sets Mutation rate should be low. Ex ： 0.1 GA-based approach cannot deal with large data sets. Same!!

39 Results in Real and Large Data Sets Use the hybrid technique. 1000 x 2 1080 x 13 4092 x 11 Better

41 Conclusions and Future Challenges The reported experimental results demonstrate the usefulness of the proposed methods and open the door to an invigorating research line. Lots of questions remain open ：  Look for better codings.  Test the efficiency of other selection algorithms.  Evaluate the importance of genetic operators such as multiple-point crossover or inversion.

1 Privacy Protection with Genetic Algorithms 報告者：林惠珍運用基因演算法來作隱私保護.

Similar presentations

Presentation on theme: "1 Privacy Protection with Genetic Algorithms 報告者：林惠珍運用基因演算法來作隱私保護."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Privacy Protection with Genetic Algorithms 報告者：林惠珍 運用基因演算法來作隱私保護.

Similar presentations

Presentation on theme: "1 Privacy Protection with Genetic Algorithms 報告者：林惠珍 運用基因演算法來作隱私保護."— Presentation transcript:

Similar presentations

About project

Feedback

1 Privacy Protection with Genetic Algorithms 報告者：林惠珍運用基因演算法來作隱私保護.

Presentation on theme: "1 Privacy Protection with Genetic Algorithms 報告者：林惠珍運用基因演算法來作隱私保護."— Presentation transcript: