Privacy Preserving Data Mining Ping Chen Lila Ghemri U of Houston –Downtown Texas Southern University One Main Street 3100 Cleburne St Houston, Texas 77002 Houston, Texas 77004
Topics Overview Basic methods in PPDM Association rule mining Classification
Individual Privacy: Protect the “record” Individual item in database must not be disclosed Not necessarily a person – Information about a corporation – Transaction record Disclosure of parts of record may be allowed – Individually identifiable information
Individually Identifiable Information Data that can’t be traced to an individual not viewed as private – Remove “identifiers” But can we ensure it can’t be traced? – Candidate Key in non-identifier information – Unique values for some individuals
PPDM Methods Data Obfuscation – Nobody sees the real data Summarization – Only the needed facts are exposed Data Separation – Data remains with trusted parties
Data Obfuscation Goal: Hide the protected information Approaches – Randomly modify data – Swap values between records – Controlled modification of data to hide secrets Problems – Does it really protect the data? – Can we learn from the results?
Summarization Goal: Make only innocuous summaries of data available Approaches: – Overall collection statistics – Limited query functionality Problems: – Can we deduce data from statistics? – Is the information sufficient?
Data Separation Goal: Only trusted parties see the data Approaches: – Data held by owner/creator – Limited release to trusted third party – Operations/analysis performed by trusted party Problems: – Will the trusted party be willing to do the analysis? – Do the analysis results disclose private information?
Association Rules Association rules a common data mining task – Find A, B, C such that AB → C holds frequently Fast algorithms for centralized and distributed computation – Basic idea: For AB → C to be frequent, AB, AC, and BC must all be frequent – Require sharing data Secure Multiparty Computation too expensive – Given function f and n inputs distributed at n sites, compute f(x1, x2,…,xn) without revealing extra information.
Association Rule Mining: Horizontal Partitioning Distributed Association Rule Mining: Easy without sharing the individual data (Exchanging support counts is enough) What if we do not want to reveal which rule is supported at which site, the support count of each rule, or database sizes? • Hospitals want to participate in a medical study • But rules only occurring at one hospital may be a result of bad practices
Association Rules in Horizontally Partitioned Data
Overview of the Method Find the union of the locally large candidate itemsets securely After the local pruning, compute the globally supported large itemsets securely At the end check the confidence of the potential rules securely
Securely Computing Candidates Key: Commutative Encryption (E1(E2(x)) = E2(E1(x)) • Compute local candidate set • Encrypt and send to next site • Continue until all sites have encrypted all rules • Eliminate duplicates • Commutative encryption ensures if rules the same, encrypted rules the same, regardless of order • Each site decrypts • After all sites have decrypted, rules left Care needed to avoid giving away information through ordering/etc. Redundancy maybe added in order to increase the security.
Computing Candidate Sets
Compute Which Candidates Are Globally Supported?
Which Candidates Are Globally Supported? (Continued) Now securely compute Sum ≥ 0: - Site0 generates random number R and sends R + count0 – frequency * dbsize0 to Site1 - Sitek adds countk – frequency*dbsizek and sends it to Sitek+1 Final result: Is sum at Siten – R ≥ 0? Use secure two-party computation
Computing Frequent: Is ABC 5%?
Computing Confidence
Classification by Decision Tree Learning A classic machine learning / data mining problem Develop rules for when a transaction belongs to a class based on its attribute values Smaller decision trees are better ID3 is one particular algorithm
A Database… Outlook Temp Humidity Wind Play Tennis Sunny Hot High Weak No Sunny Hot High Strong No Overcast Mild High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No
… and its Decision Tree Outlook Sunny Rain Overcast Humidity Wind Yes High Normal Strong Weak Yes No Yes No
The ID3 Algorithm: Definitions R: The set of attributes Outlook, Temperature, Humidity, Wind C: the class attribute Play Tennis T: the set of transactions The 14 database entries
The ID3 Algorithm ID3(R,C,T) If R is empty, return a leaf-node with the most common class value in T If all transactions in T have the same class value c, return the leaf-node c Otherwise, Determine the attribute A that best classifies T Create a tree node labeled A, recur to compute child trees edge ai goes to tree ID3(R - {A},C,T(ai))
The Best Predicting Attribute Entropy Gain(A) =def HC(T) - HC(T|A) Find A with maximum gain