You Are What You Say: Privacy Risks of Public Mentions Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl University of Minnesota.

You Are What You Say: Privacy Risks of Public Mentions Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl University of Minnesota

SIGIR 2006 2 Story: Finding “Subversives” “.. few things tell you as much about a person as the books he chooses to read.” – Tom Owad, applefritter.com

SIGIR 2006 3 The Whole Talk in One Slide + + = Your private data linked! with IR algs Seems bad. How can privacy be preserved? Private Dataset YOU Public Dataset YOU

SIGIR 2006 4 Talk Outline  Introduction  Motivation  Privacy Risks  Preserving Privacy  Conclusion

SIGIR 2006 5 I’m in IR  Why Do I Care?  Identifying a user in two datasets is Information Retrieval (IR). The query: “given a user from one dataset, which is the corresponding user in another dataset?”  This query is increasingly likely as our data is more and more electronically available  IR community should lead the discussion of how to preserve user privacy given IR technologies

movielens.org -Started ~1995 -Users rate movies ½ to 5 stars -Users get recommendations -Private: no one outside GroupLens can see user’s ratings

Anonymized Dataset -Released 2003 -Ratings, some demographic data, but no identifiers -Intended for research -Public: anyone can download

movielens.org Forums -Started June 2005 -Users talk about movies -Public: on the web, no login to read -Can forum users be identified in our anonymized dataset?

SIGIR 2006 9 Research Questions  RQ1: RISKS OF DATASET RELEASE: What are risks to user privacy when releasing a dataset?  RQ2: ALTERING THE DATASET: How can dataset owners alter the dataset they release to preserve user privacy?  RQ3: SELF DEFENSE: How can users protect their own privacy?

SIGIR 2006 10 Motivation: Privacy Loss  MovieLens forum users did not agree to reveal ratings  Anonymized ratings + public forum data = privacy violation?  More generally: dataset 1 + dataset 2 = privacy risk?  What kind of datasets?  What kinds of risks?

SIGIR 2006 11 Vulnerable Datasets  We talk about datasets from a sparse relation space  Relates people to items  Is sparse (few relations per person from possible relations)  Has a large space of items i1i1 i2i2 i3i3 … p1p1 X p2p2 X p3p3 X …

SIGIR 2006 12 Example Sparse Relation Spaces  Examples  Customer purchase data from Target  Songs played from iTunes  Articles edited in Wikipedia  Books/Albums/Beers… mentioned by bloggers or on forums  Research papers cited in a paper (or review)  Groceries bought at Safeway  …  We look at movie ratings and forum mentions, but there are many sparse relation spaces

SIGIR 2006 13 Risks of re-identification  Re-identification is matching a user in two datasets by using some linking information (e.g., name and address, or movie mentions)  Re-identifying to an identified dataset (e.g., with name and address, or social security number) can result in severe privacy loss

SIGIR 2006 14 Former Governor of Massachusetts Story: Finding Medical records (Sweeney 2002)

SIGIR 2006 15 The Rebus Form + = Governor’s medical records!

SIGIR 2006 16 Associated with movies– who cares? 1987 : Bork’s video rental history leaked to the press 1988: Video Privacy Protection Act 1991: If Clarence Thomas rented porn? Uh oh.  People are judged by their preferences

SIGIR 2006 17 Related Work  Anonymizing datasets: k-anonymity  Sweeney 2002  Privacy-preserving data mining  Verykios et al 2004, Agrawal et al 2000, …  Privacy-preserving recommender systems  Polat et al 2003, Berkovsky et al 2005, Ramakrishnan et al 2001  Text mining of user comments and opinions  Drenner et al 2006, Dave et al 2003, Pang et al 2002

SIGIR 2006 19 RQ1: Risks of Dataset Release  RQ1: What are risks to user privacy when releasing a dataset?  Algorithms to re-identify users and how they worked on our datasets  Our Datasets  Set Intersection Algorithm  TF-IDF Algorithm  Scoring Algorithm

SIGIR 2006 20 Our Datasets: Ratings and Mentions  Ratings  Large  Skewed, esp. item rats  140K users. max 6K rats, average 90, median 33.  9K movies. max 49K rats, average 1,403, median 207  12.6M ratings  Forum mentions  Small  Skewed  133 forum posters  1,685 different movies  3,828 movie mentions  Skew important for re-identification Star WarsGory Gory Hallelujah

SIGIR 2006 21 Re-identification Algorithms  What is a re-identification algorithm?  What assumptions did we use to create and improve them?  How well did they re-identify people?

SIGIR 2006 22 Re-identification Algorithm Forum Ratings Target user t mentions m 1, m 2, m 3 … Likely list u 1, s 1 u 2, s 2 u 3, s 3 … Algorithm

SIGIR 2006 23 Re-identification Algorithm  We know target user t in ratings data, too  t is k-identified if at position k or higher on the likely list. (Some fiddling for ties.)  k-identification rate for an algorithm: fraction of users that are k-identified (133 from forums)  In paper, k=1,5,10,100. We’ll talk about 1-identification, because it’s the scariest.  Likely list  u 1, s 1  u 2, s 2  u 3, s 3 (t)  u 4, s 4  …  Above, t is 3-identified, also 4-identified, 5- identified, etc., but NOT 2- identified

SIGIR 2006 24 Glorious Linking Assumption  People mostly talk about things they know => People tend to have rated what they mentioned  Measured P(u rated m | u mentioned m) averaged over all forum users: 0.82

SIGIR 2006 25 Algorithm Idea All Users Users who rated a popular item Users who rated a rarely rated item Users who rated both

SIGIR 2006 26 Set Intersection Algorithm  Find users who rate EVERY movie the target user mentioned  They all have same likeliness score  Ignore rating value entirely  RESULT: 1-identification rate: 7%  MEANING: 7% of the time there was one user at the top of the likely list, and it was the target user  Room for improvement  For target user with many mentions, no one possible

SIGIR 2006 27 Improving re-identification  Loosen requirement that a user rate every movie mentioned  Score each user by similarity to the target user. Score more highly if  User has rated more mentions of target  User has rated mentions of rarely rated movies  Intuition: rare movies give more information ex: “Star Wars” vs. “Gory Gory Hallelujah”

SIGIR 2006 28 TF-IDF Algorithm  Term Frequency (TF) Inverse Document Frequency (IDF) algorithm is a standard way to search in a sparse vector space  Emphasizes rarely rated (or mentioned) movies  NOT using TF-IDF for text  For us: “word” is a movie, “document” (bag of words) is a user  Score is cosine similarity to the target user  RESULTS: 1-ident rate of 20% (compared to 7% from Set Int.)  Room for improvement  over-weights any mention for ratings user who rated few movies– high-scoring users have 4 ratings and 1 mention

SIGIR 2006 29 Scoring Algorithm  Emphasizes mentions of rarely-rated movies, de- emphasizes number of ratings a user has  Given mentions of a target user t, score ratings users by mentions they rated  A user who has rated a mention is 10-20 times more likely to be the target user than one who has not  Couple of tweaks (see paper)

SIGIR 2006 30 Scoring Algorithm (2)  Example  Target user t mentioned A, B, C, rated 20, 50, 1000 times (from 10,000 users)  User u 1 rated A, user u 2 rated B, C  u 1 score: 0.9981 * 0.05 * 0.05 = 0.0025  u 2 score: 0.05 * 0.9501 * 0.9001= 0.043  u 2 more likely to be target t  Rating a mention is good, rare even better

SIGIR 2006 31 Scoring Algorithm (3)  RESULT: 1-ident rate of 31% (compared to 20% for TF-IDF)  Ignores rating values entirely!  In the paper, we look at algorithms that use rating value assuming a “magic” forum post text analyzer. We’ll skip that here.  Knowing rating helps, even if off by ±1 star (of 5)

Scoring 1-ident 31% Using ratings better (but requires magic forum text analyzer) We’ll use Scoring for the rest of the talk

>=16 mentions and we often 1-identify More mentions => better re-identification

SIGIR 2006 34 Privacy Risks: What We Learned  Re-identification is a privacy risk  Finding subversives from books, governor’s medical records, supreme court nominees  With simple assumptions, we can re-identify users  Scoring algorithm is good even without rating values  Knowing rating value helps  Rare items are more identifying  More data per user => better re-identification  Let’s try to preserve privacy by defeating Scoring

SIGIR 2006 36 RQ2: ALTERING THE DATASET  How can dataset owners alter the dataset they release to preserve user privacy?  Perturbation: change rating values  Oops, Scoring doesn’t need values  Generalization: group items (e.g., genre)  Dataset becomes less useful  Suppression: hide data  Let’s try that

SIGIR 2006 37 Suppressing data  We won’t modify forum data– users wouldn’t like it. Focus on ratings data  We don’t know which movies a user will rate  Rarely-rated items are identifying  IDEA: Release a ratings dataset suppressing all “rarely-rated” items  Rarely-rated: items rated fewer than N times  Investigate for different values of N

Drop 88% of items to protect current users against 1- identification 88% of items => 28% ratings

SIGIR 2006 39 RQ3: SELF DEFENSE  RQ3: How can users protect their own privacy?  Similar to RQ2, but now per-user  User can change ratings or mentions. We focus on mentions  User can perturb, generalize, or suppress. As before, we study suppression

SIGIR 2006 40 Suppressing data (user-level)  From previous, if users chose not mention any rarely- rated movies, they would be severely restricted (to 22% most popular movies)  What if user chooses to drop certain mentions? (Perhaps a Forum Advisor interface.)  IDEA: Each user suppresses some of their own mentions, starting with rarely rated movies  Users probably unwilling to suppress many mentions– they want to talk about movies!  Maybe if they knew how much privacy they were losing, they would suppress more

Suppressing 20% of mentions dropped 1- ident some, but not all Suppressing >20% is not reasonable for a user

SIGIR 2006 42 Another Strategy: Misdirection  What if users mention items they did NOT rate? This might misdirect a re-identification algorithm  Create a misdirection list of items. Each user takes an unrated item from the list and mentions it. Repeat until not identified.  What are good misdirection lists?  Remember: rarely-rated items are identifying

Rarely-rated items don’t misdirect!Popular items do better, though 1-ident isn’t zero Better to misdirect to a large crowd Rarely-rated items are identifying, popular items are misdirecting

SIGIR 2006 45 Conclusion: What Have We Learned?  REAL RISK  Re-identification can lead to loss of privacy  We found substantial risk of re-identification in our sparse relation space  There are a lot of sparse relation spaces  We’re probably in more and more of them available electronically  HARD TO PRESERVE PRIVACY  Dataset owner had to suppress a lot of their dataset to protect privacy  Users had to suppress a lot to protect privacy  Users could misdirect somewhat with popular items

SIGIR 2006 46 AOL  Data wants to be free  Government subpoena, research, commerce  People do not know the risks  AOL was text, this is items  #4417749 searched for “dog that urinates on everything.”

SIGIR 2006 47 Future Work  We looked at one pair of datasets. Look at others!  Model re-identification in sparse relation spaces mathematically rigorously  Investigate more algorithms (re-identification and privacy protection)  Arms race between re-identifiers and privacy protectors

SIGIR 2006 48 Thanks for listening!  Questions?  This work is supported by NSF grants IIS 03- 24851 and IIS 05-34420

You Are What You Say: Privacy Risks of Public Mentions Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl University of Minnesota.

Similar presentations

Presentation on theme: "You Are What You Say: Privacy Risks of Public Mentions Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl University of Minnesota."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

You Are What You Say: Privacy Risks of Public Mentions Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl University of Minnesota.

Similar presentations

Presentation on theme: "You Are What You Say: Privacy Risks of Public Mentions Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl University of Minnesota."— Presentation transcript:

Similar presentations

About project

Feedback