Download presentation
Presentation is loading. Please wait.
1
Data Security against Knowledge Loss *) by Zbigniew W. Ras University of North Carolina, Charlotte, USA
2
Data Security against Knowledge Discovery Possible Challenge Problems: Centers for Disease Control (CDC) use data mining to identify trends and patterns in disease outbreaks, such as understanding and predicting the progression of a flu epidemic. Centers for Disease Control (CDC) use data mining to identify trends and patterns in disease outbreaks, such as understanding and predicting the progression of a flu epidemic.CDC Insurance companies have considerable data that would be useful, but are unwilling to disclose this due to patient privacy concerns. Insurance companies have considerable data that would be useful, but are unwilling to disclose this due to patient privacy concerns. Alternative approach is to have insurance companies provide knowledge extracted from their data that can not be traced to individual people, but can be used to identify the trends and patterns of interest to the CDC. Alternative approach is to have insurance companies provide knowledge extracted from their data that can not be traced to individual people, but can be used to identify the trends and patterns of interest to the CDC.
3
Data Security against Knowledge Discovery Collaborative Corporations Ford and Firestone shared a problem with a jointly produced product - Ford Explorer with Firestone tires. Ford and Firestone may have been able to use association rule techniques to detect problems earlier. This would have required extensive data sharing. Ford and Firestone shared a problem with a jointly produced product - Ford Explorer with Firestone tires. Ford and Firestone may have been able to use association rule techniques to detect problems earlier. This would have required extensive data sharing.Firestone Factors such as trade secrets and agreements with other manufacturers stand in the way of needed data sharing. Factors such as trade secrets and agreements with other manufacturers stand in the way of needed data sharing. Could we obtain the same results by sharing the knowledge, while still preserving the secrecy of each side's data? Could we obtain the same results by sharing the knowledge, while still preserving the secrecy of each side's data?
4
Data Security against Knowledge Discovery Possible Approach (developing joint classifier): Lindell and Pinkas (CRYPTO'00), proposed a method that enabled two parties to build a decision tree without either party learning anything about the other party's data, except what might be revealed through the final decision tree. Lindell and Pinkas (CRYPTO'00), proposed a method that enabled two parties to build a decision tree without either party learning anything about the other party's data, except what might be revealed through the final decision tree. Clifton and Du (SIGMOD'02), proposed a method that enabled two parties to build association rules without either party learning anything about the other party's data. Clifton and Du (SIGMOD'02), proposed a method that enabled two parties to build association rules without either party learning anything about the other party's data. Alternative Approach: Each site develops a classifier independently, these are used jointly to produce the global classifier. It protects individual entities but it has to be shown that the individual classifiers do not release private information. Each site develops a classifier independently, these are used jointly to produce the global classifier. It protects individual entities but it has to be shown that the individual classifiers do not release private information.
5
Local Data Security: Knowledge extracted from remote data can not be traced to objects stored locally and used to reveal secure information about them. Local Data Security: Knowledge extracted from remote data can not be traced to objects stored locally and used to reveal secure information about them. Data Security against Knowledge Discovery Secure Multiparty Computation: Computation is secure if at the end of the computation, no party knows anything except its own input and the results [Yao, 1986]. Secure Multiparty Computation: Computation is secure if at the end of the computation, no party knows anything except its own input and the results [Yao, 1986].
6
Ontology gabc g1g1 b2b2 g1g1 a2a2 b1b1 c2c2 g1g1 a2a2 c1c1 g1g1 a1a1 b1b1 c1c1 S2S2 bade a1a1 d2d2 b2b2 a2a2 d2d2 e2e2 b1b1 a2a2 d1d1 e1e1 d1d1 S1S1 abcd a1a1 b2b2 b1b1 c2c2 a2a2 b2b2 d2d2 a2a2 b1b1 c1c1 rulesupportsyste m S KB KBS r1r1 r2r2 q S =[a: c, d, b] qSqS
7
Ontology gabc g1g1 b2b2 g1g1 a2a2 b1b1 c2c2 g1g1 a2a2 c1c1 g1g1 a1a1 b1b1 c1c1 S2S2 bade a1a1 d2d2 b2b2 a2a2 d2d2 e2e2 b1b1 a2a2 d1d1 e1e1 d1d1 S1S1 abcd a1a1 b2b2 b1b1 c2c2 a2a2 b2b2 d2d2 a2a2 b1b1 c1c1 rulesupportsystems b1a2b1a2 1S1S1 b2*d2a2b2*d2a2 1S b2a2b2a2 1S1S1 c1*b1a1c1*b1a1 1S2S2 S KBS r1r1 r2r2 q S =[a, c, d : b] qSqS
8
Problem… Give a strategy for identifying minimal number of cells which additionally have to be hidden at site S (part of DIS) in order to guarantee that hidden attribute in S cannot be reconstructed by Distributed Knowledge Discovery. abcd a1a1 b2b2 c2c2 d1d1 a2a2 b1b1 c2c2 a2a2 b2b2 c1c1 d2d2 a2a2 b1b1 c1c1 d2d2 ruleconfidencesystem S KB Data Security against Knowledge Discovery
9
Problem… Give a strategy for identifying the minimal number of cells in S which additionally have to be hidden in order to guarantee that attribute a cannot be reconstructed by Distributed Knowledge Discovery. abcd b2b2 c2c2 d1d1 b1b1 c2c2 b2b2 c1c1 d2d2 b1b1 c1c1 d2d2 ruleconfidencesystem S KB Data Security against Knowledge Discovery
10
Problem… Give a strategy for identifying the minimal number of cells in S which additionally have to be hidden in order to guarantee that attribute a cannot be reconstructed by Distributed Knowledge Discovery. abcd b2b2 c2c2 d1d1 b1b1 c2c2 b2b2 c1c1 d2d2 b1b1 c1c1 d2d2 ruleconfidencesystem b1a2b1a2 3/4S2S2 b2*d1a1b2*d1a1 1S2S2 b 2 *d 2 a 2 1S1S1 c1*b1a1c1*b1a1 1S2S2 S KB Data Security against Knowledge Discovery
11
abcd a1a1 b2b2 c2c2 d1d1 a2a2 b1b1 c2c2 a2a2 b2b2 c1c1 d2d2 a2a2 b1b1 c1c1 d2d2 ruleconfidencesystem Original site S abcd a1a1 b2b2 c2c2 d1d1 a2a2 b1b1 c2c2 a2a2 b2b2 c1c1 d2d2 a1a1 b1b1 c1c1 d2d2 ruleconfidencesystem b1a2b1a2 3/4S2S2 b2*d1a1b2*d1a1 1S2S2 b 2 *d 2 a 2 1S1S1 c1*b1a1c1*b1a1 1S2S2 Reconstructed site S
12
Give a strategy that identifies minimum number of attribute values that need to be additionally hidden from Information System S to guarantee that a hidden attribute cannot be reconstructed by Local & Distributed Chase KDD Lab Research Problem Disclosure Risk of Confidential Data
13
KDD Lab x6a2b2c3 Object x6 x6a2b2c3sal=$50,000 x6a2c3 Due to a local rule r2 = c3 b2, confidential data is restored x6a2b2c3sal=$50,000 Global rule r1 = a2 b2 sal=$50,000, we hide b2 additionally Confidential data sal=$500 is hidden Chain of predictions by global and local rules
14
Algorithm SCIKD (bottom-up strategy) RuleABCDEFG r1r1 a1a1 b1b1 c1c1 r2r2 a1a1 c1c1 f1f1 r3r3 b1b1 c1c1 r4r4 b1b1 e1e1 r5r5 a1a1 c1c1 f1f1 r6r6 a1a1 c1c1 e1e1 r7r7 c1c1 e1e1 g1g1 r8r8 a1a1 c1c1 d1d1 r9r9 b1b1 c1c1 d1d1 r 10 d1d1 f1f1 r 1 = [ b 1 c 1 a 1 ], r 2 = [ c 1 f 1 a 1 ], KDD Lab D – decision attribute KB - knowledge base
15
Algorithm SCIKD (bottom-up strategy) {a 1 }* = {a 1 } unmarked {b 1 }* = {b 1 } unmarked {c 1 }* = {a 1,b 1,c 1,d 1,e 1 } {d 1 } marked {e 1 }* = {b 1, e 1 } unmarked {f 1 }* = {d 1,f 1 } {d 1 } marked {g 1 }* = {g 1 } unmarked {a 1, b 1 }* = {a 1, b 1 } unmarked {a 1, e 1 }* = {a 1, b 1, e 1 } unmarked {a 1, g 1 }* = {a 1, g 1 } unmarked {b 1, e 1 }* = {b 1, e 1 } unmarked {b 1, g 1 }* = {b 1, g 1, e 1 } unmarked {e 1, g 1 }* = {a 1,b 1,c 1,d 1,e 1, g 1 } {d 1 } marked {a 1, b 1, e 1 }* = {a 1, b 1, e 1 } unmarked /maximal subset/ {a 1, b 1, g 1 }* = {a 1, b 1, g 1 } unmarked /maximal subset/ {b 1, e 1, g 1 }* = {e 1, g 1 }* marked {a1, b1, e1, g1}* = {e1, g1}* marked
16
Data security versus knowledge loss XABCDEFG x1x1 a1a1 b1b1 c1c1 d1d1 e1e1 f1f1 g1g1 x2x2 {a1, b1, e1}* = {a1, b1, e1} unmarked /maximal subset/ {a1, b1, g1}* = {a1, b1, g1} unmarked /maximal subset/ KDD Lab Database d 1 - has to be hidden XABCDEFG x1x1 a1a1 b1b1 g1g1 x2x2 XABCDEFG x1x1 a1a1 b1b1 e1e1 x2x2
17
Data security versus knowledge loss KDD Lab Database D 1 XABCDEFG x1x1 a1a1 b1b1 g1g1 x2x2 R k = {r k,i : i I} set of rules extracted from D k and [c k,i, s k,i ] denotes the confidence and support of rule r i for all i I k, k=1,2 With R k we associate the number K(R k ) = { c k,i s k,i : i I k }. D 1 contains more hidden knowledge than D 2, if K(R 1 ) K(R 2 ). XABCDEFG x1x1 a1a1 b1b1 e1e1 x2x2 Database D 2
18
Objective Interestingness Basic Measures for : Domain: card[ ] Support or Strength: card[ ] Confidence or Certainty Factor: card[ ]/card[ ] Coverage Factor: card[ ]/card[ ] Leverage: card[ ] – card[ ]*card[ ] Lift: card[ ]/[card[ ]*card[ ]]
19
Data security versus knowledge loss KDD Lab Database D 1 XABCDEFG x1x1 a1a1 b1b1 g1g1 x2x2 R k = {r k,i : i I} set of rules extracted from D k and [c k,i, s k,i ] denotes the coverage factor and support of rule r i for all i I k, k=1,2 With R k we associate the number K(R k ) = { c k,i s k,i : i I k }. D 1 contains more hidden knowledge than D 2, if K(R 1 ) K(R 2 ). XABCDEFG x1x1 a1a1 b1b1 e1e1 x2x2 Database D 2
20
Data security versus knowledge loss KDD Lab Database D 1 XABCDEFG x1x1 a1a1 b1b1 g1g1 x1x1 a1a1 b1b1 e1e1 R k = {r k,i : i I} set of feasible action rules extracted from D k and [c k,i, s k,i ] denotes the confidence and support of action rule r i for all i I k, k=1,2 With R k we associate the number K(R k ) = { c k,i s k,i [1/cost Dk (r k,j )]: i I k }. D 1 contains more hidden knowledge than D 2, if K(R 1 ) K(R 2 ). Database D 2 Action rule : Action rule r: [(b 1, v 1 → w 1 ) (b 2, v 2 → w 2 ) … ( b p, v p → w p )](x) (d, k 1 → k 2 )(x) : cost D (r) = { D (v i, w i ) : 1 i p} The cost of r in D: cost D (r) = { D (v i, w i ) : 1 i p} Action rule r is feasible in D, if cost D (r) < D (k 1, k 2 ).
21
Questions? Thank You KDD Lab
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.