Presentation is loading. Please wait.

Presentation is loading. Please wait.

Privacy Preserving Data Mining

Similar presentations


Presentation on theme: "Privacy Preserving Data Mining"— Presentation transcript:

1 Privacy Preserving Data Mining
Ping Chen Lila Ghemri U of Houston –Downtown Texas Southern University One Main Street Cleburne St Houston, Texas Houston, Texas

2 Topics Overview Basic methods in PPDM Association rule mining
Classification

3 Individual Privacy: Protect the “record”
Individual item in database must not be disclosed Not necessarily a person – Information about a corporation – Transaction record Disclosure of parts of record may be allowed – Individually identifiable information

4 Individually Identifiable Information
Data that can’t be traced to an individual not viewed as private – Remove “identifiers” But can we ensure it can’t be traced? – Candidate Key in non-identifier information – Unique values for some individuals

5 PPDM Methods Data Obfuscation – Nobody sees the real data
Summarization – Only the needed facts are exposed Data Separation – Data remains with trusted parties

6 Data Obfuscation Goal: Hide the protected information Approaches
– Randomly modify data – Swap values between records – Controlled modification of data to hide secrets Problems – Does it really protect the data? – Can we learn from the results?

7 Summarization Goal: Make only innocuous summaries of data available
Approaches: – Overall collection statistics – Limited query functionality Problems: – Can we deduce data from statistics? – Is the information sufficient?

8 Data Separation Goal: Only trusted parties see the data Approaches:
– Data held by owner/creator – Limited release to trusted third party – Operations/analysis performed by trusted party Problems: – Will the trusted party be willing to do the analysis? – Do the analysis results disclose private information?

9 Association Rules Association rules a common data mining task
– Find A, B, C such that AB → C holds frequently Fast algorithms for centralized and distributed computation – Basic idea: For AB → C to be frequent, AB, AC, and BC must all be frequent – Require sharing data Secure Multiparty Computation too expensive – Given function f and n inputs distributed at n sites, compute f(x1, x2,…,xn) without revealing extra information.

10 Association Rule Mining: Horizontal Partitioning
Distributed Association Rule Mining: Easy without sharing the individual data (Exchanging support counts is enough) What if we do not want to reveal which rule is supported at which site, the support count of each rule, or database sizes? • Hospitals want to participate in a medical study • But rules only occurring at one hospital may be a result of bad practices

11 Association Rules in Horizontally Partitioned Data

12 Overview of the Method Find the union of the locally large candidate itemsets securely After the local pruning, compute the globally supported large itemsets securely At the end check the confidence of the potential rules securely

13 Securely Computing Candidates
Key: Commutative Encryption (E1(E2(x)) = E2(E1(x)) • Compute local candidate set • Encrypt and send to next site • Continue until all sites have encrypted all rules • Eliminate duplicates • Commutative encryption ensures if rules the same, encrypted rules the same, regardless of order • Each site decrypts • After all sites have decrypted, rules left Care needed to avoid giving away information through ordering/etc. Redundancy maybe added in order to increase the security.

14 Computing Candidate Sets

15 Compute Which Candidates Are Globally Supported?

16 Which Candidates Are Globally Supported? (Continued)
Now securely compute Sum ≥ 0: - Site0 generates random number R and sends R + count0 – frequency * dbsize0 to Site1 - Sitek adds countk – frequency*dbsizek and sends it to Sitek+1 Final result: Is sum at Siten – R ≥ 0? Use secure two-party computation

17 Computing Frequent: Is ABC 5%?

18 Computing Confidence

19 Classification by Decision Tree Learning
A classic machine learning / data mining problem Develop rules for when a transaction belongs to a class based on its attribute values Smaller decision trees are better ID3 is one particular algorithm

20 A Database… Outlook Temp Humidity Wind Play Tennis Sunny Hot High Weak No Sunny Hot High Strong No Overcast Mild High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No

21 … and its Decision Tree Outlook Sunny Rain Overcast Humidity Wind Yes
High Normal Strong Weak Yes No Yes No

22 The ID3 Algorithm: Definitions
R: The set of attributes Outlook, Temperature, Humidity, Wind C: the class attribute Play Tennis T: the set of transactions The 14 database entries

23 The ID3 Algorithm ID3(R,C,T)
If R is empty, return a leaf-node with the most common class value in T If all transactions in T have the same class value c, return the leaf-node c Otherwise, Determine the attribute A that best classifies T Create a tree node labeled A, recur to compute child trees edge ai goes to tree ID3(R - {A},C,T(ai))

24 The Best Predicting Attribute
Entropy Gain(A) =def HC(T) - HC(T|A) Find A with maximum gain


Download ppt "Privacy Preserving Data Mining"

Similar presentations


Ads by Google