Download presentation
Presentation is loading. Please wait.
1
Do You Trust Your Recommender? An Exploration of Privacy and Trust in Recommender Systems Dan Frankowski, Dan Cosley, Shilad Sen, Tony Lam, Loren Terveen, John Riedl University of Minnesota
2
CDT Spring Research Forum 2007 2 Story: Finding “Subversives” “.. few things tell you as much about a person as the books he chooses to read.” – Tom Owad, applefritter.com
3
CDT Spring Research Forum 2007 3 Session Outline Exposure: undesired access to a person’s information Privacy Risks Preserving Privacy Bias and Sabotage: manipulating a trusted system to manipulate users of that system
4
CDT Spring Research Forum 2007 4 Why Do I Care? As a businessperson The nearest competitor is one click away Lose your customer’s trust, they will leave Lose your credibility, they will ignore you As a person Let’s not build Big Brother
5
CDT Spring Research Forum 2007 5 Risk of Exposure in One Slide + + = Your private data linked! algorithms Seems bad. How can privacy be preserved? Private Dataset YOU Public Dataset YOU
6
movielens.org -Started ~1995 -Users rate movies ½ to 5 stars -Users get recommendations -Private: no one outside GroupLens can see user’s ratings
7
Anonymized Dataset -Released 2003 -Ratings, some demographic data, but no identifiers -Intended for research -Public: anyone can download
8
movielens.org Forums -Started June 2005 -Users talk about movies -Public: on the web, no login to read -Can forum users be identified in our anonymized dataset?
9
CDT Spring Research Forum 2007 9 Research Questions RQ1: RISKS OF DATASET RELEASE: What are risks to user privacy when releasing a dataset? RQ2: ALTERING THE DATASET: How can dataset owners alter the dataset they release to preserve user privacy? RQ3: SELF DEFENSE: How can users protect their own privacy?
10
CDT Spring Research Forum 2007 10 Motivation: Privacy Loss MovieLens forum users did not agree to reveal ratings Anonymized ratings + public forum data = privacy violation? More generally: dataset 1 + dataset 2 = privacy risk? What kind of datasets? What kinds of risks?
11
CDT Spring Research Forum 2007 11 Vulnerable Datasets We talk about datasets from a sparse relation space Relates people to items Is sparse (few relations per person from possible relations) Has a large space of items i1i1 i2i2 i3i3 … p1p1 X p2p2 X p3p3 X …
12
CDT Spring Research Forum 2007 12 Example Sparse Relation Spaces Examples Customer purchase data from Target Songs played from iTunes Articles edited in Wikipedia Books/Albums/Beers… mentioned by bloggers or on forums Research papers cited in a paper (or review) Groceries bought at Safeway … We look at movie ratings and forum mentions, but there are many sparse relation spaces
13
CDT Spring Research Forum 2007 13 Risks of re-identification Re-identification is matching a user in two datasets by using some linking information (e.g., name and address, or movie mentions) Re-identifying to an identified dataset (e.g., with name and address, or social security number) can result in severe privacy loss
14
CDT Spring Research Forum 2007 14 Former Governor of Massachusetts Story: Finding Medical records (Sweeney 2002) 87% of people in 1990 U.S. census identifiable by these!
15
CDT Spring Research Forum 2007 15 The Rebus Form + = Governor’s medical records!
16
CDT Spring Research Forum 2007 16 Related Work Anonymizing datasets: k-anonymity Sweeney 2002 Privacy-preserving data mining Verykios et al 2004, Agrawal et al 2000, … Privacy-preserving recommender systems Polat et al 2003, Berkovsky et al 2005, Ramakrishnan et al 2001 Text mining of user comments and opinions Drenner et al 2006, Dave et al 2003, Pang et al 2002
17
CDT Spring Research Forum 2007 17 RQ1: Risks of Dataset Release RQ1: What are risks to user privacy when releasing a dataset? RESULT: 1-identification rate of 31% Ignores rating values entirely! Can do even better if text analysis produces rating value Rarely-rated items were more identifying
18
CDT Spring Research Forum 2007 18 Glorious Linking Assumption People mostly talk about things they know => People tend to have rated what they mentioned Measured P(u rated m | u mentioned m) averaged over all forum users: 0.82
19
CDT Spring Research Forum 2007 19 Algorithm Idea All Users Users who rated a popular item Users who rated a rarely rated item Users who rated both
20
>=16 mentions and we often 1-identify More mentions => better re-identification
21
CDT Spring Research Forum 2007 21 RQ2: ALTERING THE DATASET How can dataset owners alter the dataset they release to preserve user privacy? Perturbation: change rating values Oops, Scoring doesn’t need values Generalization: group items (e.g., genre) Dataset becomes less useful Suppression: hide data IDEA: Release a ratings dataset suppressing all “rarely-rated” items
22
Drop 88% of items to protect current users against 1- identification 88% of items => 28% ratings
23
CDT Spring Research Forum 2007 23 RQ3: SELF DEFENSE RQ3: How can users protect their own privacy? Similar to RQ2, but now per-user User can change ratings or mentions. We focus on mentions User can perturb, generalize, or suppress. As before, we study suppression
24
Suppressing 20% of mentions dropped 1- ident some, but not all Suppressing >20% is not reasonable for a user
25
CDT Spring Research Forum 2007 25 Another Strategy: Misdirection What if users mention items they did NOT rate? This might misdirect a re-identification algorithm Create a misdirection list of items. Each user takes an unrated item from the list and mentions it. Repeat until not identified. What are good misdirection lists? Remember: rarely-rated items are identifying
26
Rarely-rated items don’t misdirect!Popular items do better, though 1-ident isn’t zero Better to misdirect to a large crowd Rarely-rated items are identifying, popular items are misdirecting
27
CDT Spring Research Forum 2007 27 Exposure: What Have We Learned? REAL RISK Re-identification can lead to loss of privacy We found substantial risk of re-identification in our sparse relation space There are a lot of sparse relation spaces We’re probably in more and more of them available electronically HARD TO PRESERVE PRIVACY Dataset owner had to suppress a lot of their dataset to protect privacy Users had to suppress a lot to protect privacy Users could misdirect somewhat with popular items
28
CDT Spring Research Forum 2007 28 Advice: Keep Customer’s Trust Share data rarely Remember the governor: (zip + birthdate + gender) is not anonymous Reduce exposure Example: Google will anonymize search data older than 24 months
29
CDT Spring Research Forum 2007 29 AOL: 650K users, 20M queries Data wants to be free Government subpoena, research, commerce People do not know the risks AOL was text, this is items NY Times: 4417749 searched for “dog that urinates on everything.”
30
CDT Spring Research Forum 2007 30 Discussion #1: Exposure Examples of sparse relation spaces? Examples of re-identification risks? How to preserve privacy?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.