An integer programming approach for frequent itemset hiding Aris Gkoulalas-Divanis Vassilos S. Verykios CIKM’06
outline Introduction Basic definitions Methodology Experimental results Conclusions
introduction It based on the notion of distance between original database and the sanitized database goal: minimized the distance based on the integer programming while hiding the sensitive itemsets and minimally affecting non-sensitive itemsets
Basic definitions :the support count of itemsets in bitmap representation a b c 1 Maximizing the number of 1 left in D’ non-sensitive itemsets should satisfy this rule in D’ sensitive itemsets should satisfy this rule in D’
(cont.) Solving this problem is NP-hard ,there are 2m-1 inequalities (m:transactions lists)
(cont.) SI={e,ae,bc} (sensitive itemsets) S={e,bc} (minimal sensitive itemsets) SS={e,ae,bc,ce,abc,……} set of all sensitive itemsets and their supersets Ideal case : F‘=F-SS ,santized database D’ to contain all the frequent itemsets of D expect from the sensitive ones
(cont.) Negative border Positive border ex: acd:infrequent ac,cd,ad:frequent ex: ac:frequent ac#:infrequent (#:anyitem)
Border revision B- (F)={CD,ABD} B+ (F)={AD,BD,ABC} Original border frequent infrequent null A B C D revised border AB AC AD BC BD CD ABC ABD ACD BCD ABCD
Problem size minimization C:the total set of affected itemsets Lc: the set of solutions of the corresponding inequalities :remove the inequality of C2 without affecting the global solution of the system then C2 covers C1
(cont.) Corollary :any itemset belonging in the positive border of F-SS covers all its subsets =>B+(F’) cover all itemset of F’ B-(F’) cover all itemsets of Ideal solution Lc:
(cont.)
example F={A,B,C,D,AB,AC,AD,CD,ACD} SI={AB},S={AB} F’={A,B,C,D,AC,AD,CD,ACD} B+(F’)={B,ACD} B:frequent ACD:frequent AB:infrequent msup=0.2
Constraint satisfaction problem A solution of a CSP is a complete assignment of values to the variables that satisfies all the constraints In CSP we usually wish to maximize or minimize an objective function subject to a number of constraints To solve this problem we use “binary integer programming (BIP)” that transform the CSP to an optimization problem
Binary integer problem
Experimental results 10,000 transactions,10items,msup=0.1
conclusions Defined a new metric to quantify the distance of the initial database D and its sanitized version D’ It has benefit of being exact when ideal solution can be identified
Exact knowledge hiding through database extension Aris Gkoulalas-Divanis Vassilos S. Verykios TKDE’08
introduction The goal of the hiding algorithm is to create a minimal extension DX to the original database DO D
(cont.) S={e,ae,bc}
methodology P=|D| N=|Do| Q=|Dx| ex: e:4,ae:3,bc:4
(cont.) The distance between Do and D is measured based on the extension Dx (minimize)
(cont.) Optimal solution set c: S={e,ae,bc} mfreq=0.3 Q=4 C={e,f,bc,bd,ab,acd} 0.3*(10+4)-4
Safety margin The lower bound of Q under certain circumstances be insufficient to allow for the identification of an exact solution Safety margin(SM): Expand the size of Q of Dx, it can be predefined or be computed dynamically Ex:s={abc} only 1 transaction is insufficient to provide an exact solution
(cont.) Null transaction: (i) an unnecessarily large safety margin Should be removed from Dx (ii) a large value of Q essential for proper hiding Need to be validated ,since Q denotes the lower bound in the number of transactions to ensure proper hiding
(cont.) To ensure minimum size of Dx, the hiding algorithm keeps only k null transactions Qinv:null transaction V=Q+SM-Qinv Ex: s={abc} ,Q=1 ,SM=3 K=max(1-3,0)=1 Null transaction
Experimental results
(cont.)
conclusions Use a minimal extension to the original database It has benefit of being exact when ideal solution can be identified