You Are What You Say: Privacy Risks of Public Mentions Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl University of Minnesota.

Slides:



Advertisements
Similar presentations
Recommender Systems & Collaborative Filtering
Advertisements

Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy.
Web Intelligence Text Mining, and web-related Applications
The End of Anonymity Vitaly Shmatikov. Tastes and Purchases slide 2.
Electronic Visualization Laboratory University of Illinois at Chicago Giving Good Presentations Electronic Visualization Laboratory University of Illinois.
Students’ online profiles for employability and community Frances Chetwynd, Karen Kear, Helen Jefferis and John Woodthorpe The Open University.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Recommender Systems Aalap Kohojkar Yang Liu Zhan Shi March 31, 2008.
SENG 531: Labs TA: Brad Cossette Office Hours: Monday, Wednesday.
Do You Trust Your Recommender? An Exploration of Privacy and Trust in Recommender Systems Dan Frankowski, Dan Cosley, Shilad Sen, Tony Lam, Loren Terveen,
CS345 Data Mining Recommendation Systems Netflix Challenge Anand Rajaraman, Jeffrey D. Ullman.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Mehran Sahami Timothy D. Heilman A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets.
Hinrich Schütze and Christina Lioma
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
The Vector Space Model …and applications in Information Retrieval.
Information Retrieval
Preserving Privacy in Clickstreams Isabelle Stanton.
1 Agenda 1. What is (Web) data mining? And what does it have to do with privacy? – a simple view – 2. Examples of data mining and "privacy-preserving data.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
A step-by-step tutorial by Henry Liu Auckland City Libraries Make a start Chinese Digital Community.
R 18 G 65 B 145 R 0 G 201 B 255 R 104 G 113 B 122 R 216 G 217 B 218 R 168 G 187 B 192 Core and background colors: 1© Nokia Solutions and Networks 2014.
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 9.1 Chapter 9 : Social Networks What is a social.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
LSP 121 Week 1 Intro to Databases. Welcome to LSP 121 Quantitative Reasoning and Technological Literacy II Continuation of quantitative data concepts.
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
Broadcasting News Trivia "LESSON PLANS." BBC News. BBC, 30 Jan Web. 19 Nov
1 Building a Good Presentation Prof. Greg Steffan Electrical & Computer Engineering University of Toronto.
1.NET Web Forms Business Forms © 2002 by Jerry Post.
Digital Citizenship Lesson 3. Does it Matter who has your Data What kinds of information about yourself do you share online? What else do you do online.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
1 Computing Relevance, Similarity: The Vector Space Model.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Collaborative Information Retrieval - Collaborative Filtering systems - Recommender systems - Information Filtering Why do we need CIR? - IR system augmentation.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Badrul M. Sarwar, George Karypis, Joseph A. Konstan, and John T. Riedl
Create speaking avatars and use them as an effective learning tool.
1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Evaluation of Recommender Systems Joonseok Lee Georgia Institute of Technology 2011/04/12 1.
Vector Space Models.
Anonymity and Privacy Issues --- re-identification
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) Jagdish Achara (INRIA)
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
A code-centric cluster-based approach for searching online support forums for programmers Christopher Scaffidi, Christopher Chambers, Sheela Surisetty.
1 CS 430: Information Discovery Lecture 5 Ranking.
User Modeling and Recommender Systems: recommendation algorithms
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Privacy, anonymity and other confusing words Przemek Jaroszewski CERT Polska/NASK.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.
IR 6 Scoring, term weighting and the vector space model.
Recommender Systems & Collaborative Filtering
Warm up The mean salt content of a certain type of potato chips is supposed to be 2.0mg. The salt content of these chips varies normally with standard.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Advisor: Prof. Shou-de Lin (林守德) Student: Eric L. Lee (李揚)
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
Presentation transcript:

You Are What You Say: Privacy Risks of Public Mentions Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl University of Minnesota

SIGIR Story: Finding “Subversives” “.. few things tell you as much about a person as the books he chooses to read.” – Tom Owad, applefritter.com

SIGIR The Whole Talk in One Slide + + = Your private data linked! with IR algs Seems bad. How can privacy be preserved? Private Dataset YOU Public Dataset YOU

SIGIR Talk Outline  Introduction  Motivation  Privacy Risks  Preserving Privacy  Conclusion

SIGIR I’m in IR  Why Do I Care?  Identifying a user in two datasets is Information Retrieval (IR). The query: “given a user from one dataset, which is the corresponding user in another dataset?”  This query is increasingly likely as our data is more and more electronically available  IR community should lead the discussion of how to preserve user privacy given IR technologies

movielens.org -Started ~1995 -Users rate movies ½ to 5 stars -Users get recommendations -Private: no one outside GroupLens can see user’s ratings

Anonymized Dataset -Released Ratings, some demographic data, but no identifiers -Intended for research -Public: anyone can download

movielens.org Forums -Started June Users talk about movies -Public: on the web, no login to read -Can forum users be identified in our anonymized dataset?

SIGIR Research Questions  RQ1: RISKS OF DATASET RELEASE: What are risks to user privacy when releasing a dataset?  RQ2: ALTERING THE DATASET: How can dataset owners alter the dataset they release to preserve user privacy?  RQ3: SELF DEFENSE: How can users protect their own privacy?

SIGIR Motivation: Privacy Loss  MovieLens forum users did not agree to reveal ratings  Anonymized ratings + public forum data = privacy violation?  More generally: dataset 1 + dataset 2 = privacy risk?  What kind of datasets?  What kinds of risks?

SIGIR Vulnerable Datasets  We talk about datasets from a sparse relation space  Relates people to items  Is sparse (few relations per person from possible relations)  Has a large space of items i1i1 i2i2 i3i3 … p1p1 X p2p2 X p3p3 X …

SIGIR Example Sparse Relation Spaces  Examples  Customer purchase data from Target  Songs played from iTunes  Articles edited in Wikipedia  Books/Albums/Beers… mentioned by bloggers or on forums  Research papers cited in a paper (or review)  Groceries bought at Safeway  …  We look at movie ratings and forum mentions, but there are many sparse relation spaces

SIGIR Risks of re-identification  Re-identification is matching a user in two datasets by using some linking information (e.g., name and address, or movie mentions)  Re-identifying to an identified dataset (e.g., with name and address, or social security number) can result in severe privacy loss

SIGIR Former Governor of Massachusetts Story: Finding Medical records (Sweeney 2002)

SIGIR The Rebus Form + = Governor’s medical records!

SIGIR Associated with movies– who cares? 1987 : Bork’s video rental history leaked to the press 1988: Video Privacy Protection Act 1991: If Clarence Thomas rented porn? Uh oh.  People are judged by their preferences

SIGIR Related Work  Anonymizing datasets: k-anonymity  Sweeney 2002  Privacy-preserving data mining  Verykios et al 2004, Agrawal et al 2000, …  Privacy-preserving recommender systems  Polat et al 2003, Berkovsky et al 2005, Ramakrishnan et al 2001  Text mining of user comments and opinions  Drenner et al 2006, Dave et al 2003, Pang et al 2002

SIGIR Talk Outline  Introduction  Motivation  Privacy Risks  Preserving Privacy  Conclusion

SIGIR RQ1: Risks of Dataset Release  RQ1: What are risks to user privacy when releasing a dataset?  Algorithms to re-identify users and how they worked on our datasets  Our Datasets  Set Intersection Algorithm  TF-IDF Algorithm  Scoring Algorithm

SIGIR Our Datasets: Ratings and Mentions  Ratings  Large  Skewed, esp. item rats  140K users. max 6K rats, average 90, median 33.  9K movies. max 49K rats, average 1,403, median 207  12.6M ratings  Forum mentions  Small  Skewed  133 forum posters  1,685 different movies  3,828 movie mentions  Skew important for re-identification Star WarsGory Gory Hallelujah

SIGIR Re-identification Algorithms  What is a re-identification algorithm?  What assumptions did we use to create and improve them?  How well did they re-identify people?

SIGIR Re-identification Algorithm Forum Ratings Target user t mentions m 1, m 2, m 3 … Likely list u 1, s 1 u 2, s 2 u 3, s 3 … Algorithm

SIGIR Re-identification Algorithm  We know target user t in ratings data, too  t is k-identified if at position k or higher on the likely list. (Some fiddling for ties.)  k-identification rate for an algorithm: fraction of users that are k-identified (133 from forums)  In paper, k=1,5,10,100. We’ll talk about 1-identification, because it’s the scariest.  Likely list  u 1, s 1  u 2, s 2  u 3, s 3 (t)  u 4, s 4  …  Above, t is 3-identified, also 4-identified, 5- identified, etc., but NOT 2- identified

SIGIR Glorious Linking Assumption  People mostly talk about things they know => People tend to have rated what they mentioned  Measured P(u rated m | u mentioned m) averaged over all forum users: 0.82

SIGIR Algorithm Idea All Users Users who rated a popular item Users who rated a rarely rated item Users who rated both

SIGIR Set Intersection Algorithm  Find users who rate EVERY movie the target user mentioned  They all have same likeliness score  Ignore rating value entirely  RESULT: 1-identification rate: 7%  MEANING: 7% of the time there was one user at the top of the likely list, and it was the target user  Room for improvement  For target user with many mentions, no one possible

SIGIR Improving re-identification  Loosen requirement that a user rate every movie mentioned  Score each user by similarity to the target user. Score more highly if  User has rated more mentions of target  User has rated mentions of rarely rated movies  Intuition: rare movies give more information ex: “Star Wars” vs. “Gory Gory Hallelujah”

SIGIR TF-IDF Algorithm  Term Frequency (TF) Inverse Document Frequency (IDF) algorithm is a standard way to search in a sparse vector space  Emphasizes rarely rated (or mentioned) movies  NOT using TF-IDF for text  For us: “word” is a movie, “document” (bag of words) is a user  Score is cosine similarity to the target user  RESULTS: 1-ident rate of 20% (compared to 7% from Set Int.)  Room for improvement  over-weights any mention for ratings user who rated few movies– high-scoring users have 4 ratings and 1 mention

SIGIR Scoring Algorithm  Emphasizes mentions of rarely-rated movies, de- emphasizes number of ratings a user has  Given mentions of a target user t, score ratings users by mentions they rated  A user who has rated a mention is times more likely to be the target user than one who has not  Couple of tweaks (see paper)

SIGIR Scoring Algorithm (2)  Example  Target user t mentioned A, B, C, rated 20, 50, 1000 times (from 10,000 users)  User u 1 rated A, user u 2 rated B, C  u 1 score: * 0.05 * 0.05 =  u 2 score: 0.05 * * =  u 2 more likely to be target t  Rating a mention is good, rare even better

SIGIR Scoring Algorithm (3)  RESULT: 1-ident rate of 31% (compared to 20% for TF-IDF)  Ignores rating values entirely!  In the paper, we look at algorithms that use rating value assuming a “magic” forum post text analyzer. We’ll skip that here.  Knowing rating helps, even if off by ±1 star (of 5)

Scoring 1-ident 31% Using ratings better (but requires magic forum text analyzer) We’ll use Scoring for the rest of the talk

>=16 mentions and we often 1-identify More mentions => better re-identification

SIGIR Privacy Risks: What We Learned  Re-identification is a privacy risk  Finding subversives from books, governor’s medical records, supreme court nominees  With simple assumptions, we can re-identify users  Scoring algorithm is good even without rating values  Knowing rating value helps  Rare items are more identifying  More data per user => better re-identification  Let’s try to preserve privacy by defeating Scoring

SIGIR Talk Outline  Introduction  Motivation  Privacy Risks  Preserving Privacy  Conclusion

SIGIR RQ2: ALTERING THE DATASET  How can dataset owners alter the dataset they release to preserve user privacy?  Perturbation: change rating values  Oops, Scoring doesn’t need values  Generalization: group items (e.g., genre)  Dataset becomes less useful  Suppression: hide data  Let’s try that

SIGIR Suppressing data  We won’t modify forum data– users wouldn’t like it. Focus on ratings data  We don’t know which movies a user will rate  Rarely-rated items are identifying  IDEA: Release a ratings dataset suppressing all “rarely-rated” items  Rarely-rated: items rated fewer than N times  Investigate for different values of N

Drop 88% of items to protect current users against 1- identification 88% of items => 28% ratings

SIGIR RQ3: SELF DEFENSE  RQ3: How can users protect their own privacy?  Similar to RQ2, but now per-user  User can change ratings or mentions. We focus on mentions  User can perturb, generalize, or suppress. As before, we study suppression

SIGIR Suppressing data (user-level)  From previous, if users chose not mention any rarely- rated movies, they would be severely restricted (to 22% most popular movies)  What if user chooses to drop certain mentions? (Perhaps a Forum Advisor interface.)  IDEA: Each user suppresses some of their own mentions, starting with rarely rated movies  Users probably unwilling to suppress many mentions– they want to talk about movies!  Maybe if they knew how much privacy they were losing, they would suppress more

Suppressing 20% of mentions dropped 1- ident some, but not all Suppressing >20% is not reasonable for a user

SIGIR Another Strategy: Misdirection  What if users mention items they did NOT rate? This might misdirect a re-identification algorithm  Create a misdirection list of items. Each user takes an unrated item from the list and mentions it. Repeat until not identified.  What are good misdirection lists?  Remember: rarely-rated items are identifying

Rarely-rated items don’t misdirect!Popular items do better, though 1-ident isn’t zero Better to misdirect to a large crowd Rarely-rated items are identifying, popular items are misdirecting

SIGIR Talk Outline  Introduction  Motivation  Privacy Risks  Preserving Privacy  Conclusion

SIGIR Conclusion: What Have We Learned?  REAL RISK  Re-identification can lead to loss of privacy  We found substantial risk of re-identification in our sparse relation space  There are a lot of sparse relation spaces  We’re probably in more and more of them available electronically  HARD TO PRESERVE PRIVACY  Dataset owner had to suppress a lot of their dataset to protect privacy  Users had to suppress a lot to protect privacy  Users could misdirect somewhat with popular items

SIGIR AOL  Data wants to be free  Government subpoena, research, commerce  People do not know the risks  AOL was text, this is items  # searched for “dog that urinates on everything.”

SIGIR Future Work  We looked at one pair of datasets. Look at others!  Model re-identification in sparse relation spaces mathematically rigorously  Investigate more algorithms (re-identification and privacy protection)  Arms race between re-identifiers and privacy protectors

SIGIR Thanks for listening!  Questions?  This work is supported by NSF grants IIS and IIS