HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION Presented by: Michael Cheng Supervisor: Dr. William Cheung Co-Supervisor: Dr. Byron Choi.

Slides:

Advertisements

Similar presentations

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Advertisements

Simulated Annealing Premchand Akella. Agenda Motivation The algorithm Its applications Examples Conclusion.

Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ

November 5, 2007 ACM WEASEL Tech Efficient Time-Aware Prioritization with Knapsack Solvers Sara Alspaugh Kristen R. Walcott Mary Lou Soffa University of.

Reconstruction-Based Association Rule Hiding Author: Yuhong Guo (MS-Ph.D. Candidate, Peking Univ., China) Advisor: Prof. Shiwei Tang Co-Advisors:

PRIVACY AND SECURITY ISSUES IN DATA MINING P.h.D. Candidate: Anna Monreale Supervisors Prof. Dino Pedreschi Dott.ssa Fosca Giannotti University of Pisa.

Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal.

A Probabilistic Framework for Semi-Supervised Clustering

Mining Frequent Itemsets from Uncertain Data Presented by Chun-Kit Chui, Ben Kao, Edward Hung Department of Computer Science, The University of Hong Kong.

MAE 552 – Heuristic Optimization Lecture 8 February 8, 2002.

Recent Development on Elimination Ordering Group 1.

Lecture 5: Learning models using EM

Finding Personally Identifying Information Mark Shaneck CSCI 5707 May 6, 2004.

MAE 552 – Heuristic Optimization Lecture 6 February 6, 2002.

Evaluating Performance for Data Mining Techniques

A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.

R 18 G 65 B 145 R 0 G 201 B 255 R 104 G 113 B 122 R 216 G 217 B 218 R 168 G 187 B 192 Core and background colors: 1© Nokia Solutions and Networks 2014.

DEXA 2005 Quality-Aware Replication of Multimedia Data Yicheng Tu, Jingfeng Yan and Sunil Prabhakar Department of Computer Sciences, Purdue University.

High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov.

Vilalta&Eick: Informed Search Informed Search and Exploration Search Strategies Heuristic Functions Local Search Algorithms Vilalta&Eick: Informed Search.

Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

Seongbo Shim, Yoojong Lee, and Youngsoo Shin Lithographic Defect Aware Placement Using Compact Standard Cells Without Inter-Cell Margin.

Time-focused density-based clustering of trajectories of moving objects Margherita D’Auria Mirco Nanni Dino Pedreschi.

Secure Incremental Maintenance of Distributed Association Rules.

March 20, 2007 ISPD An Effective Clustering Algorithm for Mixed-size Placement Jianhua Li, Laleh Behjat, and Jie Huang Jianhua Li, Laleh Behjat,

SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu

Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.

Thwarting Passive Privacy Attacks in Collaborative Filtering Rui Chen Min Xie Laks V.S. Lakshmanan HKBU, Hong Kong UBC, Canada UBC, Canada Introduction.

Protecting Sensitive Labels in Social Network Data Anonymization.

A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung

Refined privacy models

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.

Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.

Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.

MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:

Kanpur Genetic Algorithms Laboratory IIT Kanpur 25, July 2006 (11:00 AM) Multi-Objective Dynamic Optimization using Evolutionary Algorithms by Udaya Bhaskara.

Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.

Today’s Topics Introduction to Proofs Rules of Inference Rules of Equivalence.

Simulated Annealing G.Anuradha.

Schreiber, Yevgeny. Value-Ordering Heuristics: Search Performance vs. Solution Diversity. In: D. Cohen (Ed.) CP 2010, LNCS 6308, pp Springer-

Analysis and algorithms of the construction of the minimum cost content-based publish/subscribe overlay Yaxiong Zhao and Jie Wu

Temporal Database Paper Reading R 資工碩一馬智釗 Efficient Mining Strategy for Frequent Serial Episodes in Temporal Database, K Huang, C Chang.

Privacy-preserving data publishing

Preserving Privacy GPS Traces via Uncertainty-Aware Path Cloaking Baik Hoh, Marco Gruteser, Hui Xiong, Ansaf Alrabady Presenter:Yao Lu ECE 256, Spring.

Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.

Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.

Ramakrishna Lecture#2 CAD for VLSI Ramakrishna

Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) Jagdish Achara (INRIA)

Genetic algorithms: A Stochastic Approach for Improving the Current Cadastre Accuracies Anna Shnaidman Uri Shoshani Yerach Doytsher Mapping and Geo-Information.

Lecture 6 – Local Search Dr. Muhammad Adnan Hashmi 1 24 February 2016.

Multivariate Discretization of Continuous Variables for Set Mining Author:Stephen D. Bay Advisor: Dr. Hsu Graduate: Kuo-wei Chen.

Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.

An Extension of Table Lens CPSC 533 Information Visualization Course Project, Term 2, 2003 Fengdong Du.

The Value of USAP in Software Architecture Design Presentation by: David Grizzanti.

Intro. ANN & Fuzzy Systems Lecture 37 Genetic and Random Search Algorithms (2)

Metaheuristics for the New Millennium Bruce L. Golden RH Smith School of Business University of Maryland by Presented at the University of Iowa, March.

All Your Queries are Belong to Us: The Power of File-Injection Attacks on Searchable Encryption Yupeng Zhang, Jonathan Katz, Charalampos Papamanthou University.

Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.

1 Link Privacy in Social Networks Aleksandra Korolova, Rajeev Motwani, Shubha U. Nabar CIKM’08 Advisor: Dr. Koh, JiaLing Speaker: Li, HueiJyun Date: 2009/3/30.

Eick: Informed Search Informed Search and Exploration Search Strategies Heuristic Functions Local Search Algorithms Vilalta&Eick: Informed Search.

Privacy-Preserving Data Mining

Artificial Intelligence (CS 370D)

Structure learning with deep autoencoders

Results, Discussion, and Conclusion

Discriminative Frequent Pattern Analysis for Effective Classification

Introduction to Simulated Annealing

Presented by : SaiVenkatanikhil Nimmagadda

Exploiting the Power of Group Differences to Solve Data Analysis Problems Outlier & Intrusion Detection Guozhu Dong, PhD, Professor CSE

Presentation transcript:

HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION Presented by: Michael Cheng Supervisor: Dr. William Cheung Co-Supervisor: Dr. Byron Choi

Presentation Flow  Privacy-Preserving Data Publishing  Introduction to Emerging Patterns (EPs)  Introduction to Equivalence Class  Introduction to Generalization  Proposed Problem and Motivation  Heuristic for the Problem  Experimental Results  Future research plan

Privacy Preserving Data Publishing - Introduction  Organizations often need to publish or share their data for legitimate reasons  Sensitive information (e.g. personal identities, restrictive patterns) maybe inferred from the published data

Privacy Preserving Data Publishing - Objective  Transform the dataset before publishing, such that: 1. Sensitive information  In our case: Emerging Patterns (EPs) 2. Subsequence analysis  In our case: Frequent Itemset (FIS) Mining

Introduction to Emerging Patterns (EPs)  Emerging Patterns (EPs) are itemsets exist in pair of datasets whose supports are significant in one dataset but insignificant in another EduOccupMarital BAExecMarried BAExecMarried BAExecMarried BAExecMarried MSEWorkerNever EduOccupMarital Married BAExecMarried BAManagerMarried BARepairNever MSEExec MSEExec {MSE, Exec} is an Emerging Pattern Income >= 50kIncome < 50k

Introduction to Emerging Patterns (EPs)  Formally, growth rate and EPs are defined as follow:

Manager Introduction to Equivalence Class  Tuples are said to be in the same Equivalence Class w.r.t. a set of Attribute A if they take same values of A IDEduOccupMarital 1MSE 2 3BA 4 Married 5BARepairNever ExecMarried ExecMarried ExecMarried Tuples {1,2,3} are in the same Equivalence Class w.r.t. {Occup, Marital}

Introduction to Generalization  Extensively studied in achieving k-Anonymity  Not studied before for hiding itemsets  Modify the original values in dataset into more general values according to a user-given hierarchy such that more tuples will share the same set of attribute values  Example: In Adult, “BA” and “MSE” maybe generalized to “Degree Holder”

Types of Generalization  Single Dimensional Global Recoding  Multi Dimensional Global Recoding  Multi Dimensional Local Recoding Occupation White Collar ExecutiveManagerBlue CollarRepairWorker

Single Dimensional Global Recoding  If we decide to generalize some values to a single value, all tuples which contains these values will be affected Occup Exec Manager Repair Occup Occupation Single Dimensional Global Recoding

Multi Dimensional Global Recoding  If we decide to generalize some values to a single value, all tuples in the same equivalence class which contains those values will be affected Occup Exec Manager Repair Multi Dimensional Global Recoding Occup Manager Repair Occupation

Multi Dimensional Local Recoding  Same as the Multi Dimensional Global Recoding except no Equivalence Class constraint Occup Exec Manager Repair Multi Dimensional Local Recoding Occup Manager Repair Exec Occupation

Proposed Problem - Why EP and FIS ?  Emerging Pattern may reveal sensitive information  E.g. In the Adult dataset from UCI Repository, we found that:  {Never-Married, Own-Child} is an EP from the class “Income =50k”  Growth Rate: 35  Frequent Itemset is a popular data mining task and supported by commercial data-mining software

Proposed Problem -Why Generalization ?  Other methods studied in PPDP  For example: Adding unknowns, remove tuples, adding fake tuples randomly  Either Incomplete information Fake information  In some applications, completeness and truthfulness of data are important  By using generalization, we can preserve the completeness and truthfulness of the data

Proposed problem - Problem Illustration DD’ Transformation (Local Recoding) Emerging Patterns Frequent Itemsets Emerging Patterns Frequent Itemsets

Intuition of Local Recoding  Support of FIS = 40% Growth Rate of EP = 3  Frequent Itemset = {Exec, Married}  Emerging Pattern = {MSE,Exec} EduOccupMarital Married BAExecMarried BAManagerMarried BARepairNever MSEExec MSEExec Income >= 50kIncome < 50k EduOccupMarital BAExecMarried BAExecMarried BAExecMarried BAWorkerMarried MSEManagerNever

Intuition of Local Recoding EduOccupMarital Married BAExecMarried BAManagerMarried BARepairNever MSEExec MSEExec Income >= 50kIncome < 50k EduOccupMarital BAExecMarried BAExecMarried BAExecMarried BAWorkerMarried MSEManagerNever EduOccupMarital Married BAExecMarried BAManagerMarried BARepairNever MSEWhite col MSEWhite col Income >= 50kIncome < 50k EduOccupMarital BAExecMarried BAExecMarried BAExecMarried BAWorkerMarried MSEWhite ColNever

Heuristic for the Problem - Greedy Approach Repeat… Until… All Emerging Patterns are removed D Emerging Patterns Mining Applying the generalization EPs EP 1 EP 2 EP 3 EP 4 Equivalence ClassesUtility Gain Class140 Class 290 Class 360 Class 420 Class 515

Heuristic for the Problem -Greedy Approach  Drawbacks:  Trapped into some local minima  Solution:  Simulated Annealing Style Approach for choosing equivalence class

Heuristic for the Problem - Simulated Annealing Style Approach  Choose Equivalence Class probabilistically  Two parameters:  Initial temperature ( T 0 )  Cooling Rate ( α )  Acceptance Probability:  exp Utility Gain / Temperature  Temperature updating:  T n = α T n-1 Utility GainT=1000T=100T= Acceptance probability of different utility gain and temperature

Heuristic for the Problem - Simulated Annealing Style Approach Repeat… Until… All Emerging Patterns are removed D Emerging Patterns Mining Applying the generalization and Decrease the temperature EPs EP 1 EP 2 EP 3 EP 4 Equivalence ClassesProbability Class10.2 Class 20.4 Class 30.1 Class Class 50.05

Two questions  How to choose an EP for generalization?  How to calculate the utility gain?

How to choose an EP for generalization?  Choose the EP which overlaps with the remaining EPs the most  More likely to hide other EPs simultaneously Emerging Patterns MSE Never Married BADivorced BADivorcedWorker BADivorced Repairman BA DivorcedOwn-Child

How to calculate utility gain?  Utility gain is a function of:  Recoding Distance (RD)  Reduction of Growth Rate (RG)

How to calculate utility gain ? - Recoding Distance (RD)  The detail derivation is stated in the paper  Intuitively, it measures…  How many and how much FIS have been generalized?  How many FIS disappeared?  High level definition of RD: θ q x (generalized FIS) + ( 1- θ q ) x (disappeared FIS),where θ q is user defined parameter The larger the value of RD, the more the distortion generated on the Frequent Itemset

How to calculate utility gain ? - Reduction of Growth Rate(RG)  After taken a local recoding, RG is defined as:  The reduction of growth rate of all EPs Emerging PatternsGrowth Rate Executive, Married10 BA, Divorced20 Executive30 Sum of Growth Rate 60 Emerging PatternsGrowth Rate White col, Married5 BA, Divorced20 Sum of Growth Rate 25 Local Recoding RG = 60 – 25 = 35

How to calculate utility gain?  Putting all these together, utility gain is defined as: θ p x RG – (1- θ p ) x RD,where θ p is user defined parameters  It favors:  Local recoding which can reduce lots of growth rate  It penalizes:  Local recoding which generate large distortion on FIS

Experimental Setup  Dataset: Adult dataset from UCI Repository  Popular benchmark dataset used for generalization  Total number of records:  Income > 50k : 7508  Income <= 50k :  Use only 8 categorical attributes for experiment  A well accepted hierarchy is defined  Parameters:  Support of FIS : 40%  Growth rate of EP : 5  Initial Temperature : 10  Cooling Rate : 0.4

Performance RD / No. of FIS disappeared of the Greedy Approach RD / No. of FIS disappeared of Simulated Annealing Style Approach (Best of 5)  Maximum RD: 623.1

Runtime (in minutes) Greedy Approach Simulated Annealing Style Approach (Best of 5)

Future Research Plan  Hide EPs in temporal datasets  Consider multi-level FIS  Hiding a group of emerging patterns at a time

Q & A Any Questions?