You Are What You Say: Privacy Risks of Public Mentions Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl University of Minnesota.

1 You Are What You Say: Privacy Risks of Public Mentions Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl University of Minnesota

Story: Finding "Subversives" ".. few things tell you as much about a person as the books he chooses to read." – Tom Owad,

The Whole Talk in One Slide + + = Your private data linked! with IR algs Seems bad. How can privacy be preserved? Private Dataset YOU Public Dataset YOU

Talk Outline  Introduction  Motivation  Privacy Risks  Preserving Privacy  Conclusion

I'm in IR  Why Do I Care?  Identifying a user in two datasets is Information Retrieval (IR). The query: "given a user from one dataset, which is the corresponding user in another dataset?"  This query is increasingly likely as our data is more and more electronically available  IR community should lead the discussion of how to preserve user privacy given IR technologies

-Started ~1995 -Users rate movies ½ to 5 stars -Users get recommendations -Private: no one outside GroupLens can see user's ratings

Anonymized Dataset -Released 2003 -Ratings, some demographic data, but no identifiers -Intended for research -Public: anyone can download

Forums -Started June 2005 -Users talk about movies -Public: on the web, no login to read -Can forum users be identified in our anonymized dataset?

Research Questions  RQ1: RISKS OF DATASET RELEASE: What are risks to user privacy when releasing a dataset?  RQ2: ALTERING THE DATASET: How can dataset owners alter the dataset they release to preserve user privacy?  RQ3: SELF DEFENSE: How can users protect their own privacy?

Motivation: Privacy Loss  MovieLens forum users did not agree to reveal ratings  Anonymized ratings + public forum data = privacy violation?  More generally: dataset 1 + dataset 2 = privacy risk?  What kind of datasets?  What kinds of risks?

Vulnerable Datasets  We talk about datasets from a sparse relation space  Relates people to items  Is sparse (few relations per person from possible relations)  Has a large space of items i1i1 i2i2 i3i3 … p1p1 X p2p2 X p3p3 X …

Example Sparse Relation Spaces  Examples  Customer purchase data from Target  Songs played from iTunes  Articles edited in Wikipedia  Books/Albums/Beers… mentioned by bloggers or on forums  Research papers cited in a paper (or review)  Groceries bought at Safeway  …  We look at movie ratings and forum mentions, but there are many sparse relation spaces

Risks of re-identification  Re-identification is matching a user in two datasets by using some linking information (e.g., name and address, or movie mentions)  Re-identifying to an identified dataset (e.g., with name and address, or social security number) can result in severe privacy loss

Former Governor of Massachusetts Story: Finding Medical records (Sweeney 2002)

The Rebus Form + = Governor's medical records!

Associated with movies– who cares? 1987 : Bork's video rental history leaked to the press 1988: Video Privacy Protection Act 1991: If Clarence Thomas rented porn? Uh oh.  People are judged by their preferences

Related Work  Anonymizing datasets: k-anonymity  Sweeney 2002  Privacy-preserving data mining  Verykios et al 2004, Agrawal et al 2000, …  Privacy-preserving recommender systems  Polat et al 2003, Berkovsky et al 2005, Ramakrishnan et al 2001  Text mining of user comments and opinions  Drenner et al 2006, Dave et al 2003, Pang et al 2002

Talk Outline  Introduction  Motivation  Privacy Risks  Preserving Privacy  Conclusion

RQ1: Risks of Dataset Release  RQ1: What are risks to user privacy when releasing a dataset?  Algorithms to re-identify users and how they worked on our datasets  Our Datasets  Set Intersection Algorithm  TF-IDF Algorithm  Scoring Algorithm

Our Datasets: Ratings and Mentions  Ratings  Large  Skewed, esp. item rats  140K users. max 6K rats, average 90, median 33.  9K movies. max 49K rats, average 1,403, median 207  12.6M ratings  Forum mentions  Small  Skewed  133 forum posters  1,685 different movies  3,828 movie mentions  Skew important for re-identification Star WarsGory Gory Hallelujah

Re-identification Algorithms  What is a re-identification algorithm?  What assumptions did we use to create and improve them?  How well did they re-identify people?

Re-identification Algorithm Forum Ratings Target user t mentions m 1, m 2, m 3 … Likely list u 1, s 1 u 2, s 2 u 3, s 3 … Algorithm

Re-identification Algorithm  We know target user t in ratings data, too  t is k-identified if at position k or higher on the likely list. (Some fiddling for ties.)  k-identification rate for an algorithm: fraction of users that are k-identified (133 from forums)  In paper, k=1,5,10,100. We'll talk about 1-identification, because it's the scariest.  Likely list  u 1, s 1  u 2, s 2  u 3, s 3 (t)  u 4, s 4  …  Above, t is 3-identified, also 4-identified, 5- identified, etc., but NOT 2- identified

Glorious Linking Assumption  People mostly talk about things they know => People tend to have rated what they mentioned  Measured P(u rated m | u mentioned m) averaged over all forum users: 0.82

Algorithm Idea All Users Users who rated a popular item Users who rated a rarely rated item Users who rated both

Set Intersection Algorithm  Find users who rate EVERY movie the target user mentioned  They all have same likeliness score  Ignore rating value entirely  RESULT: 1-identification rate: 7%  MEANING: 7% of the time there was one user at the top of the likely list, and it was the target user  Room for improvement  For target user with many mentions, no one possible

Improving re-identification  Loosen requirement that a user rate every movie mentioned  Score each user by similarity to the target user. Score more highly if  User has rated more mentions of target  User has rated mentions of rarely rated movies  Intuition: rare movies give more information ex: "Star Wars" vs. "Gory Gory Hallelujah"

TF-IDF Algorithm  Term Frequency (TF) Inverse Document Frequency (IDF) algorithm is a standard way to search in a sparse vector space  Emphasizes rarely rated (or mentioned) movies  NOT using TF-IDF for text  For us: "word" is a movie, "document" (bag of words) is a user  Score is cosine similarity to the target user  RESULTS: 1-ident rate of 20% (compared to 7% from Set Int.)  Room for improvement  over-weights any mention for ratings user who rated few movies– high-scoring users have 4 ratings and 1 mention

Scoring Algorithm  Emphasizes mentions of rarely-rated movies, de- emphasizes number of ratings a user has  Given mentions of a target user t, score ratings users by mentions they rated  A user who has rated a mention is 10-20 times more likely to be the target user than one who has not  Couple of tweaks (see paper)

Scoring Algorithm (2)  Example  Target user t mentioned A, B, C, rated 20, 50, 1000 times (from 10,000 users)  User u 1 rated A, user u 2 rated B, C  u 1 score: 0.9981 * 0.05 * 0.05 = 0.0025  u 2 score: 0.05 * 0.9501 * 0.9001= 0.043  u 2 more likely to be target t  Rating a mention is good, rare even better

Scoring Algorithm (3)  RESULT: 1-ident rate of 31% (compared to 20% for TF-IDF)  Ignores rating values entirely!  In the paper, we look at algorithms that use rating value assuming a "magic" forum post text analyzer. We'll skip that here.  Knowing rating helps, even if off by ±1 star (of 5)

Scoring 1-ident 31% Using ratings better (but requires magic forum text analyzer) We'll use Scoring for the rest of the talk

>=16 mentions and we often 1-identify More mentions => better re-identification

Privacy Risks: What We Learned  Re-identification is a privacy risk  Finding subversives from books, governor's medical records, supreme court nominees  With simple assumptions, we can re-identify users  Scoring algorithm is good even without rating values  Knowing rating value helps  Rare items are more identifying  More data per user => better re-identification  Let's try to preserve privacy by defeating Scoring

Talk Outline  Introduction  Motivation  Privacy Risks  Preserving Privacy  Conclusion

RQ2: ALTERING THE DATASET  How can dataset owners alter the dataset they release to preserve user privacy?  Perturbation: change rating values  Oops, Scoring doesn't need values  Generalization: group items (e.g., genre)  Dataset becomes less useful  Suppression: hide data  Let's try that

Suppressing data  We won't modify forum data– users wouldn't like it. Focus on ratings data  We don't know which movies a user will rate  Rarely-rated items are identifying  IDEA: Release a ratings dataset suppressing all "rarely-rated" items  Rarely-rated: items rated fewer than N times  Investigate for different values of N

Drop 88% of items to protect current users against 1- identification 88% of items => 28% ratings

RQ3: SELF DEFENSE  RQ3: How can users protect their own privacy?  Similar to RQ2, but now per-user  User can change ratings or mentions. We focus on mentions  User can perturb, generalize, or suppress. As before, we study suppression

Suppressing data (user-level)  From previous, if users chose not mention any rarely- rated movies, they would be severely restricted (to 22% most popular movies)  What if user chooses to drop certain mentions? (Perhaps a Forum Advisor interface.)  IDEA: Each user suppresses some of their own mentions, starting with rarely rated movies  Users probably unwilling to suppress many mentions– they want to talk about movies!  Maybe if they knew how much privacy they were losing, they would suppress more

Suppressing 20% of mentions dropped 1- ident some, but not all Suppressing >20% is not reasonable for a user

Another Strategy: Misdirection  What if users mention items they did NOT rate? This might misdirect a re-identification algorithm  Create a misdirection list of items. Each user takes an unrated item from the list and mentions it. Repeat until not identified.  What are good misdirection lists?  Remember: rarely-rated items are identifying

Rarely-rated items don't misdirect!Popular items do better, though 1-ident isn't zero Better to misdirect to a large crowd Rarely-rated items are identifying, popular items are misdirecting

Talk Outline  Introduction  Motivation  Privacy Risks  Preserving Privacy  Conclusion

Conclusion: What Have We Learned?  REAL RISK  Re-identification can lead to loss of privacy  We found substantial risk of re-identification in our sparse relation space  There are a lot of sparse relation spaces  We're probably in more and more of them available electronically  HARD TO PRESERVE PRIVACY  Dataset owner had to suppress a lot of their dataset to protect privacy  Users had to suppress a lot to protect privacy  Users could misdirect somewhat with popular items

AOL  Data wants to be free  Government subpoena, research, commerce  People do not know the risks  AOL was text, this is items  #4417749 searched for "dog that urinates on everything."

Future Work  We looked at one pair of datasets. Look at others!  Model re-identification in sparse relation spaces mathematically rigorously  Investigate more algorithms (re-identification and privacy protection)  Arms race between re-identifiers and privacy protectors

Thanks for listening!  Questions?  This work is supported by NSF grants IIS 03- 24851 and IIS 05-34420

