Download presentation
Presentation is loading. Please wait.
1
You Are What You Say: Privacy Risks of Public Mentions Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl University of Minnesota
2
SIGIR 2006 2 Story: Finding “Subversives” “.. few things tell you as much about a person as the books he chooses to read.” – Tom Owad, applefritter.com
3
SIGIR 2006 3 The Whole Talk in One Slide + + = Your private data linked! with IR algs Seems bad. How can privacy be preserved? Private Dataset YOU Public Dataset YOU
4
SIGIR 2006 4 Talk Outline Introduction Motivation Privacy Risks Preserving Privacy Conclusion
5
SIGIR 2006 5 I’m in IR Why Do I Care? Identifying a user in two datasets is Information Retrieval (IR). The query: “given a user from one dataset, which is the corresponding user in another dataset?” This query is increasingly likely as our data is more and more electronically available IR community should lead the discussion of how to preserve user privacy given IR technologies
6
movielens.org -Started ~1995 -Users rate movies ½ to 5 stars -Users get recommendations -Private: no one outside GroupLens can see user’s ratings
7
Anonymized Dataset -Released 2003 -Ratings, some demographic data, but no identifiers -Intended for research -Public: anyone can download
8
movielens.org Forums -Started June 2005 -Users talk about movies -Public: on the web, no login to read -Can forum users be identified in our anonymized dataset?
9
SIGIR 2006 9 Research Questions RQ1: RISKS OF DATASET RELEASE: What are risks to user privacy when releasing a dataset? RQ2: ALTERING THE DATASET: How can dataset owners alter the dataset they release to preserve user privacy? RQ3: SELF DEFENSE: How can users protect their own privacy?
10
SIGIR 2006 10 Motivation: Privacy Loss MovieLens forum users did not agree to reveal ratings Anonymized ratings + public forum data = privacy violation? More generally: dataset 1 + dataset 2 = privacy risk? What kind of datasets? What kinds of risks?
11
SIGIR 2006 11 Vulnerable Datasets We talk about datasets from a sparse relation space Relates people to items Is sparse (few relations per person from possible relations) Has a large space of items i1i1 i2i2 i3i3 … p1p1 X p2p2 X p3p3 X …
12
SIGIR 2006 12 Example Sparse Relation Spaces Examples Customer purchase data from Target Songs played from iTunes Articles edited in Wikipedia Books/Albums/Beers… mentioned by bloggers or on forums Research papers cited in a paper (or review) Groceries bought at Safeway … We look at movie ratings and forum mentions, but there are many sparse relation spaces
13
SIGIR 2006 13 Risks of re-identification Re-identification is matching a user in two datasets by using some linking information (e.g., name and address, or movie mentions) Re-identifying to an identified dataset (e.g., with name and address, or social security number) can result in severe privacy loss
14
SIGIR 2006 14 Former Governor of Massachusetts Story: Finding Medical records (Sweeney 2002)
15
SIGIR 2006 15 The Rebus Form + = Governor’s medical records!
16
SIGIR 2006 16 Associated with movies– who cares? 1987 : Bork’s video rental history leaked to the press 1988: Video Privacy Protection Act 1991: If Clarence Thomas rented porn? Uh oh. People are judged by their preferences
17
SIGIR 2006 17 Related Work Anonymizing datasets: k-anonymity Sweeney 2002 Privacy-preserving data mining Verykios et al 2004, Agrawal et al 2000, … Privacy-preserving recommender systems Polat et al 2003, Berkovsky et al 2005, Ramakrishnan et al 2001 Text mining of user comments and opinions Drenner et al 2006, Dave et al 2003, Pang et al 2002
18
SIGIR 2006 18 Talk Outline Introduction Motivation Privacy Risks Preserving Privacy Conclusion
19
SIGIR 2006 19 RQ1: Risks of Dataset Release RQ1: What are risks to user privacy when releasing a dataset? Algorithms to re-identify users and how they worked on our datasets Our Datasets Set Intersection Algorithm TF-IDF Algorithm Scoring Algorithm
20
SIGIR 2006 20 Our Datasets: Ratings and Mentions Ratings Large Skewed, esp. item rats 140K users. max 6K rats, average 90, median 33. 9K movies. max 49K rats, average 1,403, median 207 12.6M ratings Forum mentions Small Skewed 133 forum posters 1,685 different movies 3,828 movie mentions Skew important for re-identification Star WarsGory Gory Hallelujah
21
SIGIR 2006 21 Re-identification Algorithms What is a re-identification algorithm? What assumptions did we use to create and improve them? How well did they re-identify people?
22
SIGIR 2006 22 Re-identification Algorithm Forum Ratings Target user t mentions m 1, m 2, m 3 … Likely list u 1, s 1 u 2, s 2 u 3, s 3 … Algorithm
23
SIGIR 2006 23 Re-identification Algorithm We know target user t in ratings data, too t is k-identified if at position k or higher on the likely list. (Some fiddling for ties.) k-identification rate for an algorithm: fraction of users that are k-identified (133 from forums) In paper, k=1,5,10,100. We’ll talk about 1-identification, because it’s the scariest. Likely list u 1, s 1 u 2, s 2 u 3, s 3 (t) u 4, s 4 … Above, t is 3-identified, also 4-identified, 5- identified, etc., but NOT 2- identified
24
SIGIR 2006 24 Glorious Linking Assumption People mostly talk about things they know => People tend to have rated what they mentioned Measured P(u rated m | u mentioned m) averaged over all forum users: 0.82
25
SIGIR 2006 25 Algorithm Idea All Users Users who rated a popular item Users who rated a rarely rated item Users who rated both
26
SIGIR 2006 26 Set Intersection Algorithm Find users who rate EVERY movie the target user mentioned They all have same likeliness score Ignore rating value entirely RESULT: 1-identification rate: 7% MEANING: 7% of the time there was one user at the top of the likely list, and it was the target user Room for improvement For target user with many mentions, no one possible
27
SIGIR 2006 27 Improving re-identification Loosen requirement that a user rate every movie mentioned Score each user by similarity to the target user. Score more highly if User has rated more mentions of target User has rated mentions of rarely rated movies Intuition: rare movies give more information ex: “Star Wars” vs. “Gory Gory Hallelujah”
28
SIGIR 2006 28 TF-IDF Algorithm Term Frequency (TF) Inverse Document Frequency (IDF) algorithm is a standard way to search in a sparse vector space Emphasizes rarely rated (or mentioned) movies NOT using TF-IDF for text For us: “word” is a movie, “document” (bag of words) is a user Score is cosine similarity to the target user RESULTS: 1-ident rate of 20% (compared to 7% from Set Int.) Room for improvement over-weights any mention for ratings user who rated few movies– high-scoring users have 4 ratings and 1 mention
29
SIGIR 2006 29 Scoring Algorithm Emphasizes mentions of rarely-rated movies, de- emphasizes number of ratings a user has Given mentions of a target user t, score ratings users by mentions they rated A user who has rated a mention is 10-20 times more likely to be the target user than one who has not Couple of tweaks (see paper)
30
SIGIR 2006 30 Scoring Algorithm (2) Example Target user t mentioned A, B, C, rated 20, 50, 1000 times (from 10,000 users) User u 1 rated A, user u 2 rated B, C u 1 score: 0.9981 * 0.05 * 0.05 = 0.0025 u 2 score: 0.05 * 0.9501 * 0.9001= 0.043 u 2 more likely to be target t Rating a mention is good, rare even better
31
SIGIR 2006 31 Scoring Algorithm (3) RESULT: 1-ident rate of 31% (compared to 20% for TF-IDF) Ignores rating values entirely! In the paper, we look at algorithms that use rating value assuming a “magic” forum post text analyzer. We’ll skip that here. Knowing rating helps, even if off by ±1 star (of 5)
32
Scoring 1-ident 31% Using ratings better (but requires magic forum text analyzer) We’ll use Scoring for the rest of the talk
33
>=16 mentions and we often 1-identify More mentions => better re-identification
34
SIGIR 2006 34 Privacy Risks: What We Learned Re-identification is a privacy risk Finding subversives from books, governor’s medical records, supreme court nominees With simple assumptions, we can re-identify users Scoring algorithm is good even without rating values Knowing rating value helps Rare items are more identifying More data per user => better re-identification Let’s try to preserve privacy by defeating Scoring
35
SIGIR 2006 35 Talk Outline Introduction Motivation Privacy Risks Preserving Privacy Conclusion
36
SIGIR 2006 36 RQ2: ALTERING THE DATASET How can dataset owners alter the dataset they release to preserve user privacy? Perturbation: change rating values Oops, Scoring doesn’t need values Generalization: group items (e.g., genre) Dataset becomes less useful Suppression: hide data Let’s try that
37
SIGIR 2006 37 Suppressing data We won’t modify forum data– users wouldn’t like it. Focus on ratings data We don’t know which movies a user will rate Rarely-rated items are identifying IDEA: Release a ratings dataset suppressing all “rarely-rated” items Rarely-rated: items rated fewer than N times Investigate for different values of N
38
Drop 88% of items to protect current users against 1- identification 88% of items => 28% ratings
39
SIGIR 2006 39 RQ3: SELF DEFENSE RQ3: How can users protect their own privacy? Similar to RQ2, but now per-user User can change ratings or mentions. We focus on mentions User can perturb, generalize, or suppress. As before, we study suppression
40
SIGIR 2006 40 Suppressing data (user-level) From previous, if users chose not mention any rarely- rated movies, they would be severely restricted (to 22% most popular movies) What if user chooses to drop certain mentions? (Perhaps a Forum Advisor interface.) IDEA: Each user suppresses some of their own mentions, starting with rarely rated movies Users probably unwilling to suppress many mentions– they want to talk about movies! Maybe if they knew how much privacy they were losing, they would suppress more
41
Suppressing 20% of mentions dropped 1- ident some, but not all Suppressing >20% is not reasonable for a user
42
SIGIR 2006 42 Another Strategy: Misdirection What if users mention items they did NOT rate? This might misdirect a re-identification algorithm Create a misdirection list of items. Each user takes an unrated item from the list and mentions it. Repeat until not identified. What are good misdirection lists? Remember: rarely-rated items are identifying
43
Rarely-rated items don’t misdirect!Popular items do better, though 1-ident isn’t zero Better to misdirect to a large crowd Rarely-rated items are identifying, popular items are misdirecting
44
SIGIR 2006 44 Talk Outline Introduction Motivation Privacy Risks Preserving Privacy Conclusion
45
SIGIR 2006 45 Conclusion: What Have We Learned? REAL RISK Re-identification can lead to loss of privacy We found substantial risk of re-identification in our sparse relation space There are a lot of sparse relation spaces We’re probably in more and more of them available electronically HARD TO PRESERVE PRIVACY Dataset owner had to suppress a lot of their dataset to protect privacy Users had to suppress a lot to protect privacy Users could misdirect somewhat with popular items
46
SIGIR 2006 46 AOL Data wants to be free Government subpoena, research, commerce People do not know the risks AOL was text, this is items #4417749 searched for “dog that urinates on everything.”
47
SIGIR 2006 47 Future Work We looked at one pair of datasets. Look at others! Model re-identification in sparse relation spaces mathematically rigorously Investigate more algorithms (re-identification and privacy protection) Arms race between re-identifiers and privacy protectors
48
SIGIR 2006 48 Thanks for listening! Questions? This work is supported by NSF grants IIS 03- 24851 and IIS 05-34420
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.