You Are What You Say: Privacy Risks of Public Mentions Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl University of Minnesota
SIGIR Story: Finding “Subversives” “.. few things tell you as much about a person as the books he chooses to read.” – Tom Owad, applefritter.com
SIGIR The Whole Talk in One Slide + + = Your private data linked! with IR algs Seems bad. How can privacy be preserved? Private Dataset YOU Public Dataset YOU
SIGIR Talk Outline Introduction Motivation Privacy Risks Preserving Privacy Conclusion
SIGIR I’m in IR Why Do I Care? Identifying a user in two datasets is Information Retrieval (IR). The query: “given a user from one dataset, which is the corresponding user in another dataset?” This query is increasingly likely as our data is more and more electronically available IR community should lead the discussion of how to preserve user privacy given IR technologies
movielens.org -Started ~1995 -Users rate movies ½ to 5 stars -Users get recommendations -Private: no one outside GroupLens can see user’s ratings
Anonymized Dataset -Released Ratings, some demographic data, but no identifiers -Intended for research -Public: anyone can download
movielens.org Forums -Started June Users talk about movies -Public: on the web, no login to read -Can forum users be identified in our anonymized dataset?
SIGIR Research Questions RQ1: RISKS OF DATASET RELEASE: What are risks to user privacy when releasing a dataset? RQ2: ALTERING THE DATASET: How can dataset owners alter the dataset they release to preserve user privacy? RQ3: SELF DEFENSE: How can users protect their own privacy?
SIGIR Motivation: Privacy Loss MovieLens forum users did not agree to reveal ratings Anonymized ratings + public forum data = privacy violation? More generally: dataset 1 + dataset 2 = privacy risk? What kind of datasets? What kinds of risks?
SIGIR Vulnerable Datasets We talk about datasets from a sparse relation space Relates people to items Is sparse (few relations per person from possible relations) Has a large space of items i1i1 i2i2 i3i3 … p1p1 X p2p2 X p3p3 X …
SIGIR Example Sparse Relation Spaces Examples Customer purchase data from Target Songs played from iTunes Articles edited in Wikipedia Books/Albums/Beers… mentioned by bloggers or on forums Research papers cited in a paper (or review) Groceries bought at Safeway … We look at movie ratings and forum mentions, but there are many sparse relation spaces
SIGIR Risks of re-identification Re-identification is matching a user in two datasets by using some linking information (e.g., name and address, or movie mentions) Re-identifying to an identified dataset (e.g., with name and address, or social security number) can result in severe privacy loss
SIGIR Former Governor of Massachusetts Story: Finding Medical records (Sweeney 2002)
SIGIR The Rebus Form + = Governor’s medical records!
SIGIR Associated with movies– who cares? 1987 : Bork’s video rental history leaked to the press 1988: Video Privacy Protection Act 1991: If Clarence Thomas rented porn? Uh oh. People are judged by their preferences
SIGIR Related Work Anonymizing datasets: k-anonymity Sweeney 2002 Privacy-preserving data mining Verykios et al 2004, Agrawal et al 2000, … Privacy-preserving recommender systems Polat et al 2003, Berkovsky et al 2005, Ramakrishnan et al 2001 Text mining of user comments and opinions Drenner et al 2006, Dave et al 2003, Pang et al 2002
SIGIR Talk Outline Introduction Motivation Privacy Risks Preserving Privacy Conclusion
SIGIR RQ1: Risks of Dataset Release RQ1: What are risks to user privacy when releasing a dataset? Algorithms to re-identify users and how they worked on our datasets Our Datasets Set Intersection Algorithm TF-IDF Algorithm Scoring Algorithm
SIGIR Our Datasets: Ratings and Mentions Ratings Large Skewed, esp. item rats 140K users. max 6K rats, average 90, median 33. 9K movies. max 49K rats, average 1,403, median 207 12.6M ratings Forum mentions Small Skewed 133 forum posters 1,685 different movies 3,828 movie mentions Skew important for re-identification Star WarsGory Gory Hallelujah
SIGIR Re-identification Algorithms What is a re-identification algorithm? What assumptions did we use to create and improve them? How well did they re-identify people?
SIGIR Re-identification Algorithm Forum Ratings Target user t mentions m 1, m 2, m 3 … Likely list u 1, s 1 u 2, s 2 u 3, s 3 … Algorithm
SIGIR Re-identification Algorithm We know target user t in ratings data, too t is k-identified if at position k or higher on the likely list. (Some fiddling for ties.) k-identification rate for an algorithm: fraction of users that are k-identified (133 from forums) In paper, k=1,5,10,100. We’ll talk about 1-identification, because it’s the scariest. Likely list u 1, s 1 u 2, s 2 u 3, s 3 (t) u 4, s 4 … Above, t is 3-identified, also 4-identified, 5- identified, etc., but NOT 2- identified
SIGIR Glorious Linking Assumption People mostly talk about things they know => People tend to have rated what they mentioned Measured P(u rated m | u mentioned m) averaged over all forum users: 0.82
SIGIR Algorithm Idea All Users Users who rated a popular item Users who rated a rarely rated item Users who rated both
SIGIR Set Intersection Algorithm Find users who rate EVERY movie the target user mentioned They all have same likeliness score Ignore rating value entirely RESULT: 1-identification rate: 7% MEANING: 7% of the time there was one user at the top of the likely list, and it was the target user Room for improvement For target user with many mentions, no one possible
SIGIR Improving re-identification Loosen requirement that a user rate every movie mentioned Score each user by similarity to the target user. Score more highly if User has rated more mentions of target User has rated mentions of rarely rated movies Intuition: rare movies give more information ex: “Star Wars” vs. “Gory Gory Hallelujah”
SIGIR TF-IDF Algorithm Term Frequency (TF) Inverse Document Frequency (IDF) algorithm is a standard way to search in a sparse vector space Emphasizes rarely rated (or mentioned) movies NOT using TF-IDF for text For us: “word” is a movie, “document” (bag of words) is a user Score is cosine similarity to the target user RESULTS: 1-ident rate of 20% (compared to 7% from Set Int.) Room for improvement over-weights any mention for ratings user who rated few movies– high-scoring users have 4 ratings and 1 mention
SIGIR Scoring Algorithm Emphasizes mentions of rarely-rated movies, de- emphasizes number of ratings a user has Given mentions of a target user t, score ratings users by mentions they rated A user who has rated a mention is times more likely to be the target user than one who has not Couple of tweaks (see paper)
SIGIR Scoring Algorithm (2) Example Target user t mentioned A, B, C, rated 20, 50, 1000 times (from 10,000 users) User u 1 rated A, user u 2 rated B, C u 1 score: * 0.05 * 0.05 = u 2 score: 0.05 * * = u 2 more likely to be target t Rating a mention is good, rare even better
SIGIR Scoring Algorithm (3) RESULT: 1-ident rate of 31% (compared to 20% for TF-IDF) Ignores rating values entirely! In the paper, we look at algorithms that use rating value assuming a “magic” forum post text analyzer. We’ll skip that here. Knowing rating helps, even if off by ±1 star (of 5)
Scoring 1-ident 31% Using ratings better (but requires magic forum text analyzer) We’ll use Scoring for the rest of the talk
>=16 mentions and we often 1-identify More mentions => better re-identification
SIGIR Privacy Risks: What We Learned Re-identification is a privacy risk Finding subversives from books, governor’s medical records, supreme court nominees With simple assumptions, we can re-identify users Scoring algorithm is good even without rating values Knowing rating value helps Rare items are more identifying More data per user => better re-identification Let’s try to preserve privacy by defeating Scoring
SIGIR Talk Outline Introduction Motivation Privacy Risks Preserving Privacy Conclusion
SIGIR RQ2: ALTERING THE DATASET How can dataset owners alter the dataset they release to preserve user privacy? Perturbation: change rating values Oops, Scoring doesn’t need values Generalization: group items (e.g., genre) Dataset becomes less useful Suppression: hide data Let’s try that
SIGIR Suppressing data We won’t modify forum data– users wouldn’t like it. Focus on ratings data We don’t know which movies a user will rate Rarely-rated items are identifying IDEA: Release a ratings dataset suppressing all “rarely-rated” items Rarely-rated: items rated fewer than N times Investigate for different values of N
Drop 88% of items to protect current users against 1- identification 88% of items => 28% ratings
SIGIR RQ3: SELF DEFENSE RQ3: How can users protect their own privacy? Similar to RQ2, but now per-user User can change ratings or mentions. We focus on mentions User can perturb, generalize, or suppress. As before, we study suppression
SIGIR Suppressing data (user-level) From previous, if users chose not mention any rarely- rated movies, they would be severely restricted (to 22% most popular movies) What if user chooses to drop certain mentions? (Perhaps a Forum Advisor interface.) IDEA: Each user suppresses some of their own mentions, starting with rarely rated movies Users probably unwilling to suppress many mentions– they want to talk about movies! Maybe if they knew how much privacy they were losing, they would suppress more
Suppressing 20% of mentions dropped 1- ident some, but not all Suppressing >20% is not reasonable for a user
SIGIR Another Strategy: Misdirection What if users mention items they did NOT rate? This might misdirect a re-identification algorithm Create a misdirection list of items. Each user takes an unrated item from the list and mentions it. Repeat until not identified. What are good misdirection lists? Remember: rarely-rated items are identifying
Rarely-rated items don’t misdirect!Popular items do better, though 1-ident isn’t zero Better to misdirect to a large crowd Rarely-rated items are identifying, popular items are misdirecting
SIGIR Talk Outline Introduction Motivation Privacy Risks Preserving Privacy Conclusion
SIGIR Conclusion: What Have We Learned? REAL RISK Re-identification can lead to loss of privacy We found substantial risk of re-identification in our sparse relation space There are a lot of sparse relation spaces We’re probably in more and more of them available electronically HARD TO PRESERVE PRIVACY Dataset owner had to suppress a lot of their dataset to protect privacy Users had to suppress a lot to protect privacy Users could misdirect somewhat with popular items
SIGIR AOL Data wants to be free Government subpoena, research, commerce People do not know the risks AOL was text, this is items # searched for “dog that urinates on everything.”
SIGIR Future Work We looked at one pair of datasets. Look at others! Model re-identification in sparse relation spaces mathematically rigorously Investigate more algorithms (re-identification and privacy protection) Arms race between re-identifiers and privacy protectors
SIGIR Thanks for listening! Questions? This work is supported by NSF grants IIS and IIS