Download presentation
Presentation is loading. Please wait.
1
Privacy Preserving Data Mining
Ping Chen Lila Ghemri U of Houston –Downtown Texas Southern University One Main Street Cleburne St Houston, Texas Houston, Texas
2
Topics Overview Basic methods in PPDM Association rule mining
Classification
3
Individual Privacy: Protect the “record”
Individual item in database must not be disclosed Not necessarily a person – Information about a corporation – Transaction record Disclosure of parts of record may be allowed – Individually identifiable information
4
Individually Identifiable Information
Data that can’t be traced to an individual not viewed as private – Remove “identifiers” But can we ensure it can’t be traced? – Candidate Key in non-identifier information – Unique values for some individuals
5
PPDM Methods Data Obfuscation – Nobody sees the real data
Summarization – Only the needed facts are exposed Data Separation – Data remains with trusted parties
6
Data Obfuscation Goal: Hide the protected information Approaches
– Randomly modify data – Swap values between records – Controlled modification of data to hide secrets Problems – Does it really protect the data? – Can we learn from the results?
7
Summarization Goal: Make only innocuous summaries of data available
Approaches: – Overall collection statistics – Limited query functionality Problems: – Can we deduce data from statistics? – Is the information sufficient?
8
Data Separation Goal: Only trusted parties see the data Approaches:
– Data held by owner/creator – Limited release to trusted third party – Operations/analysis performed by trusted party Problems: – Will the trusted party be willing to do the analysis? – Do the analysis results disclose private information?
9
Association Rules Association rules a common data mining task
– Find A, B, C such that AB → C holds frequently Fast algorithms for centralized and distributed computation – Basic idea: For AB → C to be frequent, AB, AC, and BC must all be frequent – Require sharing data Secure Multiparty Computation too expensive – Given function f and n inputs distributed at n sites, compute f(x1, x2,…,xn) without revealing extra information.
10
Association Rule Mining: Horizontal Partitioning
Distributed Association Rule Mining: Easy without sharing the individual data (Exchanging support counts is enough) What if we do not want to reveal which rule is supported at which site, the support count of each rule, or database sizes? • Hospitals want to participate in a medical study • But rules only occurring at one hospital may be a result of bad practices
11
Association Rules in Horizontally Partitioned Data
12
Overview of the Method Find the union of the locally large candidate itemsets securely After the local pruning, compute the globally supported large itemsets securely At the end check the confidence of the potential rules securely
13
Securely Computing Candidates
Key: Commutative Encryption (E1(E2(x)) = E2(E1(x)) • Compute local candidate set • Encrypt and send to next site • Continue until all sites have encrypted all rules • Eliminate duplicates • Commutative encryption ensures if rules the same, encrypted rules the same, regardless of order • Each site decrypts • After all sites have decrypted, rules left Care needed to avoid giving away information through ordering/etc. Redundancy maybe added in order to increase the security.
14
Computing Candidate Sets
15
Compute Which Candidates Are Globally Supported?
16
Which Candidates Are Globally Supported? (Continued)
Now securely compute Sum ≥ 0: - Site0 generates random number R and sends R + count0 – frequency * dbsize0 to Site1 - Sitek adds countk – frequency*dbsizek and sends it to Sitek+1 Final result: Is sum at Siten – R ≥ 0? Use secure two-party computation
17
Computing Frequent: Is ABC 5%?
18
Computing Confidence
19
Classification by Decision Tree Learning
A classic machine learning / data mining problem Develop rules for when a transaction belongs to a class based on its attribute values Smaller decision trees are better ID3 is one particular algorithm
20
A Database… Outlook Temp Humidity Wind Play Tennis Sunny Hot High Weak No Sunny Hot High Strong No Overcast Mild High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No
21
… and its Decision Tree Outlook Sunny Rain Overcast Humidity Wind Yes
High Normal Strong Weak Yes No Yes No
22
The ID3 Algorithm: Definitions
R: The set of attributes Outlook, Temperature, Humidity, Wind C: the class attribute Play Tennis T: the set of transactions The 14 database entries
23
The ID3 Algorithm ID3(R,C,T)
If R is empty, return a leaf-node with the most common class value in T If all transactions in T have the same class value c, return the leaf-node c Otherwise, Determine the attribute A that best classifies T Create a tree node labeled A, recur to compute child trees edge ai goes to tree ID3(R - {A},C,T(ai))
24
The Best Predicting Attribute
Entropy Gain(A) =def HC(T) - HC(T|A) Find A with maximum gain
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.