Download presentation
Presentation is loading. Please wait.
Published byVerity Sharp Modified over 8 years ago
1
HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION Presented by: Michael Cheng Supervisor: Dr. William Cheung Co-Supervisor: Dr. Byron Choi
2
Presentation Flow Privacy-Preserving Data Publishing Introduction to Emerging Patterns (EPs) Introduction to Equivalence Class Introduction to Generalization Proposed Problem and Motivation Heuristic for the Problem Experimental Results Future research plan
3
Privacy Preserving Data Publishing - Introduction Organizations often need to publish or share their data for legitimate reasons Sensitive information (e.g. personal identities, restrictive patterns) maybe inferred from the published data
4
Privacy Preserving Data Publishing - Objective Transform the dataset before publishing, such that: 1. Sensitive information In our case: Emerging Patterns (EPs) 2. Subsequence analysis In our case: Frequent Itemset (FIS) Mining
5
Introduction to Emerging Patterns (EPs) Emerging Patterns (EPs) are itemsets exist in pair of datasets whose supports are significant in one dataset but insignificant in another EduOccupMarital BAExecMarried BAExecMarried BAExecMarried BAExecMarried MSEWorkerNever EduOccupMarital Married BAExecMarried BAManagerMarried BARepairNever MSEExec MSEExec {MSE, Exec} is an Emerging Pattern Income >= 50kIncome < 50k
6
Introduction to Emerging Patterns (EPs) Formally, growth rate and EPs are defined as follow:
7
Manager Introduction to Equivalence Class Tuples are said to be in the same Equivalence Class w.r.t. a set of Attribute A if they take same values of A IDEduOccupMarital 1MSE 2 3BA 4 Married 5BARepairNever ExecMarried ExecMarried ExecMarried Tuples {1,2,3} are in the same Equivalence Class w.r.t. {Occup, Marital}
8
Introduction to Generalization Extensively studied in achieving k-Anonymity Not studied before for hiding itemsets Modify the original values in dataset into more general values according to a user-given hierarchy such that more tuples will share the same set of attribute values Example: In Adult, “BA” and “MSE” maybe generalized to “Degree Holder”
9
Types of Generalization Single Dimensional Global Recoding Multi Dimensional Global Recoding Multi Dimensional Local Recoding Occupation White Collar ExecutiveManagerBlue CollarRepairWorker
10
Single Dimensional Global Recoding If we decide to generalize some values to a single value, all tuples which contains these values will be affected Occup Exec Manager Repair Occup Occupation Single Dimensional Global Recoding
11
Multi Dimensional Global Recoding If we decide to generalize some values to a single value, all tuples in the same equivalence class which contains those values will be affected Occup Exec Manager Repair Multi Dimensional Global Recoding Occup Manager Repair Occupation
12
Multi Dimensional Local Recoding Same as the Multi Dimensional Global Recoding except no Equivalence Class constraint Occup Exec Manager Repair Multi Dimensional Local Recoding Occup Manager Repair Exec Occupation
13
Proposed Problem - Why EP and FIS ? Emerging Pattern may reveal sensitive information E.g. In the Adult dataset from UCI Repository, we found that: {Never-Married, Own-Child} is an EP from the class “Income =50k” Growth Rate: 35 Frequent Itemset is a popular data mining task and supported by commercial data-mining software
14
Proposed Problem -Why Generalization ? Other methods studied in PPDP For example: Adding unknowns, remove tuples, adding fake tuples randomly Either Incomplete information Fake information In some applications, completeness and truthfulness of data are important By using generalization, we can preserve the completeness and truthfulness of the data
15
Proposed problem - Problem Illustration DD’ Transformation (Local Recoding) Emerging Patterns Frequent Itemsets Emerging Patterns Frequent Itemsets
16
Intuition of Local Recoding Support of FIS = 40% Growth Rate of EP = 3 Frequent Itemset = {Exec, Married} Emerging Pattern = {MSE,Exec} EduOccupMarital Married BAExecMarried BAManagerMarried BARepairNever MSEExec MSEExec Income >= 50kIncome < 50k EduOccupMarital BAExecMarried BAExecMarried BAExecMarried BAWorkerMarried MSEManagerNever
17
Intuition of Local Recoding EduOccupMarital Married BAExecMarried BAManagerMarried BARepairNever MSEExec MSEExec Income >= 50kIncome < 50k EduOccupMarital BAExecMarried BAExecMarried BAExecMarried BAWorkerMarried MSEManagerNever EduOccupMarital Married BAExecMarried BAManagerMarried BARepairNever MSEWhite col MSEWhite col Income >= 50kIncome < 50k EduOccupMarital BAExecMarried BAExecMarried BAExecMarried BAWorkerMarried MSEWhite ColNever
18
Heuristic for the Problem - Greedy Approach Repeat… Until… All Emerging Patterns are removed D Emerging Patterns Mining Applying the generalization EPs EP 1 EP 2 EP 3 EP 4 Equivalence ClassesUtility Gain Class140 Class 290 Class 360 Class 420 Class 515
19
Heuristic for the Problem -Greedy Approach Drawbacks: Trapped into some local minima Solution: Simulated Annealing Style Approach for choosing equivalence class
20
Heuristic for the Problem - Simulated Annealing Style Approach Choose Equivalence Class probabilistically Two parameters: Initial temperature ( T 0 ) Cooling Rate ( α ) Acceptance Probability: exp Utility Gain / Temperature Temperature updating: T n = α T n-1 Utility GainT=1000T=100T=10 900.2090.3020.945 600.2030.2230.047 400.1990.1830.006 200.1950.1500.0009 150.1940.1420.0005 Acceptance probability of different utility gain and temperature
21
Heuristic for the Problem - Simulated Annealing Style Approach Repeat… Until… All Emerging Patterns are removed D Emerging Patterns Mining Applying the generalization and Decrease the temperature EPs EP 1 EP 2 EP 3 EP 4 Equivalence ClassesProbability Class10.2 Class 20.4 Class 30.1 Class 40.25 Class 50.05
22
Two questions How to choose an EP for generalization? How to calculate the utility gain?
23
How to choose an EP for generalization? Choose the EP which overlaps with the remaining EPs the most More likely to hide other EPs simultaneously Emerging Patterns MSE Never Married BADivorced BADivorcedWorker BADivorced Repairman BA DivorcedOwn-Child
24
How to calculate utility gain? Utility gain is a function of: Recoding Distance (RD) Reduction of Growth Rate (RG)
25
How to calculate utility gain ? - Recoding Distance (RD) The detail derivation is stated in the paper Intuitively, it measures… How many and how much FIS have been generalized? How many FIS disappeared? High level definition of RD: θ q x (generalized FIS) + ( 1- θ q ) x (disappeared FIS),where θ q is user defined parameter The larger the value of RD, the more the distortion generated on the Frequent Itemset
26
How to calculate utility gain ? - Reduction of Growth Rate(RG) After taken a local recoding, RG is defined as: The reduction of growth rate of all EPs Emerging PatternsGrowth Rate Executive, Married10 BA, Divorced20 Executive30 Sum of Growth Rate 60 Emerging PatternsGrowth Rate White col, Married5 BA, Divorced20 Sum of Growth Rate 25 Local Recoding RG = 60 – 25 = 35
27
How to calculate utility gain? Putting all these together, utility gain is defined as: θ p x RG – (1- θ p ) x RD,where θ p is user defined parameters It favors: Local recoding which can reduce lots of growth rate It penalizes: Local recoding which generate large distortion on FIS
28
Experimental Setup Dataset: Adult dataset from UCI Repository Popular benchmark dataset used for generalization Total number of records: 30162 Income > 50k : 7508 Income <= 50k : 22654 Use only 8 categorical attributes for experiment A well accepted hierarchy is defined Parameters: Support of FIS : 40% Growth rate of EP : 5 Initial Temperature : 10 Cooling Rate : 0.4
29
Performance RD / No. of FIS disappeared of the Greedy Approach RD / No. of FIS disappeared of Simulated Annealing Style Approach (Best of 5) Maximum RD: 623.1
30
Runtime (in minutes) Greedy Approach Simulated Annealing Style Approach (Best of 5)
31
Future Research Plan Hide EPs in temporal datasets Consider multi-level FIS Hiding a group of emerging patterns at a time
32
Q & A Any Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.