Privacy Preserving Data Mining

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Decision Tree Approach in Data Mining
Decision Tree Algorithm (C4.5)
Classification Techniques: Decision Tree Learning
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Naïve Bayes Classifier
Decision Tree Rong Jin. Determine Milage Per Gallon.
An overview of The IBM Intelligent Miner for Data By: Neeraja Rudrabhatla 11/04/1999.
Induction of Decision Trees
Privacy Preserving Data Mining Yehuda Lindell & Benny Pinkas.
Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.
CS573 Data Privacy and Security
Secure Incremental Maintenance of Distributed Association Rules.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Tools for Privacy Preserving Distributed Data Mining
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Privacy Preserving Data Mining Yehuda Lindell Benny Pinkas Presenter: Justin Brickell.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Chapter 20 Data Analysis and Mining. 2 n Decision Support Systems  Obtain high-level information out of detailed information stored in (DB) transaction-processing.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
COM24111: Machine Learning Decision Trees Gavin Brown
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
COMP24111 Machine Learning Naïve Bayes Classifier Ke Chen.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Oliver Schulte Machine Learning 726
Machine Learning Inductive Learning and Decision Trees
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
CS 9633 Machine Learning Decision Tree Learning
Decision Tree Learning
Machine Learning Lecture 2: Decision Tree Learning.
Data Science Algorithms: The Basic Methods
Naïve Bayes Classifier
Classification Algorithms
Decision Tree Learning
CSE543: Machine Learning Lecture 2: August 6, 2014
Prepared by: Mahmoud Rafeek Al-Farra
Data Science Algorithms: The Basic Methods
Decision Trees: Another Example
Naïve Bayes Classifier
Mining Time-Changing Data Streams
Bayes Net Learning: Bayesian Approaches
Frequent Pattern Mining
Naïve Bayes Classifier
Decision Tree Saed Sayad 9/21/2018.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Concept Description
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Clustering.
Farzaneh Mirzazadeh Fall 2007
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees.
Machine Learning: Lecture 3
Naïve Bayes Classifier
Play Tennis ????? Day Outlook Temperature Humidity Wind PlayTennis
Machine Learning Chapter 3. Decision Tree Learning
COMP61011 : Machine Learning Decision Trees
Decision Trees Decision tree representation ID3 learning algorithm
Privacy preserving cloud computing
Artificial Intelligence 9. Perceptron
Data Mining: Characterization
Data Mining CSCI 307, Spring 2019 Lecture 15
Data Mining CSCI 307, Spring 2019 Lecture 18
Data Mining CSCI 307, Spring 2019 Lecture 9
Presentation transcript:

Privacy Preserving Data Mining Ping Chen Lila Ghemri U of Houston –Downtown Texas Southern University One Main Street 3100 Cleburne St Houston, Texas 77002 Houston, Texas 77004

Topics Overview Basic methods in PPDM Association rule mining Classification

Individual Privacy: Protect the “record” Individual item in database must not be disclosed Not necessarily a person – Information about a corporation – Transaction record Disclosure of parts of record may be allowed – Individually identifiable information

Individually Identifiable Information Data that can’t be traced to an individual not viewed as private – Remove “identifiers” But can we ensure it can’t be traced? – Candidate Key in non-identifier information – Unique values for some individuals

PPDM Methods Data Obfuscation – Nobody sees the real data Summarization – Only the needed facts are exposed Data Separation – Data remains with trusted parties

Data Obfuscation Goal: Hide the protected information Approaches – Randomly modify data – Swap values between records – Controlled modification of data to hide secrets Problems – Does it really protect the data? – Can we learn from the results?

Summarization Goal: Make only innocuous summaries of data available Approaches: – Overall collection statistics – Limited query functionality Problems: – Can we deduce data from statistics? – Is the information sufficient?

Data Separation Goal: Only trusted parties see the data Approaches: – Data held by owner/creator – Limited release to trusted third party – Operations/analysis performed by trusted party Problems: – Will the trusted party be willing to do the analysis? – Do the analysis results disclose private information?

Association Rules Association rules a common data mining task – Find A, B, C such that AB → C holds frequently Fast algorithms for centralized and distributed computation – Basic idea: For AB → C to be frequent, AB, AC, and BC must all be frequent – Require sharing data Secure Multiparty Computation too expensive – Given function f and n inputs distributed at n sites, compute f(x1, x2,…,xn) without revealing extra information.

Association Rule Mining: Horizontal Partitioning Distributed Association Rule Mining: Easy without sharing the individual data (Exchanging support counts is enough) What if we do not want to reveal which rule is supported at which site, the support count of each rule, or database sizes? • Hospitals want to participate in a medical study • But rules only occurring at one hospital may be a result of bad practices

Association Rules in Horizontally Partitioned Data

Overview of the Method Find the union of the locally large candidate itemsets securely After the local pruning, compute the globally supported large itemsets securely At the end check the confidence of the potential rules securely

Securely Computing Candidates Key: Commutative Encryption (E1(E2(x)) = E2(E1(x)) • Compute local candidate set • Encrypt and send to next site • Continue until all sites have encrypted all rules • Eliminate duplicates • Commutative encryption ensures if rules the same, encrypted rules the same, regardless of order • Each site decrypts • After all sites have decrypted, rules left Care needed to avoid giving away information through ordering/etc. Redundancy maybe added in order to increase the security.

Computing Candidate Sets

Compute Which Candidates Are Globally Supported?

Which Candidates Are Globally Supported? (Continued) Now securely compute Sum ≥ 0: - Site0 generates random number R and sends R + count0 – frequency * dbsize0 to Site1 - Sitek adds countk – frequency*dbsizek and sends it to Sitek+1 Final result: Is sum at Siten – R ≥ 0? Use secure two-party computation

Computing Frequent: Is ABC 5%?

Computing Confidence

Classification by Decision Tree Learning A classic machine learning / data mining problem Develop rules for when a transaction belongs to a class based on its attribute values Smaller decision trees are better ID3 is one particular algorithm

A Database… Outlook Temp Humidity Wind Play Tennis Sunny Hot High Weak No Sunny Hot High Strong No Overcast Mild High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No

… and its Decision Tree Outlook Sunny Rain Overcast Humidity Wind Yes High Normal Strong Weak Yes No Yes No

The ID3 Algorithm: Definitions R: The set of attributes Outlook, Temperature, Humidity, Wind C: the class attribute Play Tennis T: the set of transactions The 14 database entries

The ID3 Algorithm ID3(R,C,T) If R is empty, return a leaf-node with the most common class value in T If all transactions in T have the same class value c, return the leaf-node c Otherwise, Determine the attribute A that best classifies T Create a tree node labeled A, recur to compute child trees edge ai goes to tree ID3(R - {A},C,T(ai))

The Best Predicting Attribute Entropy Gain(A) =def HC(T) - HC(T|A) Find A with maximum gain