Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center
Thesis Organizational boundaries are blurring in the emerging networked economy Organizational boundaries are blurring in the emerging networked economy –Compete and co-operate simultaneously –Int’l value chain Need to rethink information sharing, searching, and mining in the new brave world of virtual organizations Need to rethink information sharing, searching, and mining in the new brave world of virtual organizations
Separate databases due to statutory, competitive, or security reasons. Separate databases due to statutory, competitive, or security reasons. Selective, minimal sharing on need-to-know basis. Example: Among those who took a particular drug, how many had adverse reaction and their DNA contains a specific sequence? Example: Among those who took a particular drug, how many had adverse reaction and their DNA contains a specific sequence? Researchers must not learn anything beyond counts. Commutative Encryption: E1(E2(T)) = E2(E1(T)) Minimal Necessary Sharing R S R must not know that S has b & y S must not know that R has a & x u v RSRSau v x bu v y R S Count (R S) R & S do not learn anything except that the result is 2. Sovereign Information Sharing Sovereign Information Sharing SIGMOD 00
Privacy Preserving Data Mining 50 | 40K |...30 | 70K |... Randomizer Reconstruct distribution of Age Reconstruct distribution of Salary Data Mining Algorithms Data Mining Model 65 | 20K |...25 | 60K |... Alice’s age Alice’s salary Bob’s age Insight: Preserve privacy at the individual level, while still building accurate data mining models at the aggregate level. Add random noise to individual values to protect privacy. EM algorithm to estimate original distribution of values given randomized values + randomization function. Algorithms for building classification models and discovering association rules on top of privacy- preserved data with only small loss of accuracy. SIGMOD 00
Finessing Schema Chaos Use a simple regular expression extractor to get numbers Do simple data extraction to get hints Hint for unit: the word following the number. Hint for attribute name: k following numbers. Use only numbers in the queries Treat any attribute name in the query also as hint Reflectivity estimates accuracy W W W 03
Privacy Preserving Indexing A public mapping function that maps a query to a set of providers P that may contain the desired document A public mapping function that maps a query to a set of providers P that may contain the desired document P contains false negatives P contains false negatives Providers return a document only if the searcher is authorized to access the document Providers return a document only if the searcher is authorized to access the document VLDB 03
Some Interesting Topics Current integration approaches do not scale Current integration approaches do not scale –Information integration per se is not interesting –Static vs. dynamic plumbing Incentive compatibility Incentive compatibility Auditing interactions Auditing interactions