1 Agenda 1. What is (Web) data mining? And what does it have to do with privacy? – a simple view – 2. Examples of data mining and "privacy-preserving data mining": l Association-rule mining (& privacy-preserving AR mining) l Collaborative filtering (& privacy-preserving collaborative filtering) 3. A second look at...privacy 4. A second look at...Web / data mining 5. The goal: More than modelling and hiding – Towards a comprehensive view of Web mining and privacy. Threats, opportunities and solution approaches. 6. An outlook: Data mining for privacy
2 Privacy Problems: Example 1 Technical background of the problem: The dataset allows for Web mining (e.g., which search queries lead to which site choices), it violates k-anonymity (e.g. "Lilburn" a likely k = #inhabitants of Lilburn)
3 Privacy Problems: Example 2 Where do people live who will buy the Koran soon? Technical background of the problem: A mashup of different data sources Amazon wishlists Yahoo! People (addresses) Google Maps each with insufficient k-anonymity, allows for attribute matching and thereby inferences
4 Predicting political affiliation from Facebook profile and link data (1): Most Conservative Traits Trait NameTrait ValueWeight Conservative Groupgeorge w bush is my homeboy Groupcollege republicans Grouptexas conservatives Groupbears for bush Groupkerry is a fairy Groupaggie republicans Groupkeep facebook clean Groupi voted for bush Groupprotect marriage one man one woman Lindamood et al. 09 & Heatherly et al. 09 Privacy Problems: Example 3
5 Predicting political affiliation from Facebook profile and link data (2): Most Liberal Traits per Trait Name Trait NameTrait ValueWeight Liberal activitiesamnesty international Employerhot topic favorite tv showsqueer as folk grad schoolcomputer science hometownmumbai Relationship Statusin an open relationship religious viewsagnostic looking forwhatever i can get Lindamood et al. 09 & Heatherly et al. 09
6 "Privacy-preserving Web mining" example: find patterns, unlink personal data Volvo S40 website targets people in 20s n Are visitors in their 20s or 40s? n Which demographic groups like/dislike the website? n An example of the "Randomization Approach" to PPDM: R. Agrawal and R. Srikant, "Privacy Preserving Data Mining", SIGMOD 2000.
7 Randomization Approach Overview 50 | 40K |...30 | 70K | Randomizer Reconstruct distribution of Age Reconstruct distribution of Salary Data Mining Algorithms Model 65 | 20K |...25 | 60K |......
8 Seems to work well!
9 What is collaborative filtering? "People like what people like them like" – regardless of support and confidence
10 User-based Collaborative Filtering n Idea: People who agreed in the past are likely to agree again n To predict a user’s opinion for an item, use the opinion of similar users n Similarity between users is decided by looking at their overlap in opinions for other items n Next step: build a model of user types "global model" rather than "local patterns" as mining result
11 1. Privacy as confidentiality: "the right to be let alone" – and to hide data Data Is this all there is to privacy?
12 2. Privacy as control: informational self-determination Data Don‘t do THIS ! n e.g. data privacy: "the right of the individual to decide what information about himself should be communicated to others and under what circumstances" (Westin, 1970) n behind much of data-protection legislation (see Eleni Kosta‘s talk)
13 Discussion item: What is this an example of? Tracing anonymous edits in Wikipedia
14 [Method: Attribute matching]
15 Results (an example)