Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine
Increasing Popularity of Reviewing Sites Yelp, more than 39M visitors and 15M reviews in 2010
category Rating
Rising Awareness of Privacy
How Privacy apply to Reviews? Traceability Linkability of Ad hoc Reviews Linkablility of Several Accounts
Contribution Extensive Study to Measure privacy/linakability in user reviews Propose models that adequately identify authors
Settings & Problem Formulation
IR: Identified Record IRIR IRIR IRIR IRIR AR AR: Anonymous Record
Anonymous Record Size (AR) Identified Record Size (IR) Matching Model TOP-X Linkability X: 1 and 10 1, 5, 10, 20,…60
Dataset 1 Million Reviews 2000 Users more than 300 review
Methodology Naïve Bayesian Model Kullback-Leibler Model Symmetric Version
Methodology Anonymous Record AR -> Identified Record IR Naïve Bayesian Model, NB Max IRi P(AR|IR i ) Kullback-Leibler Divergence, KLD Distance(AR, IR_i) and return IR_i with MIN
Naïve Bayesian (NB) Identified Record (IR) Anonymous Record (AR) Decreasing Sorted List of IRs
Naïve Bayesian Identified Record Anonymous Record Sorted List of IRs
Kullback-Leibler Divergence (KLD) Identified Record (IR) Anonymous Record (AR) Increasing Sorted List of IRs
Maximum Likelihood Estimation
Tokens Unigram: a, ….z Digram: aa, ab,…,zz Rating :1,2,3,4,5 Category: restaurant, Beauty and Spa, Education
Lexical Token Results
NB -Unigram Size 60, LR 83%/ Top-1 LR 96% Top-10
KLD - Unigram Size 60, LR 83%/ Top-1 LR 96% Top-10
NB Digram Size 20, LR 97%/ Top-1 Size10, LR 88%/ Top-1
KLD Digram Size 60, LR 99%/ Top-1 Size 30, LR 75%/ Top-1
Improvement (1): Combining Lexical and non- Lexical ones
Combining in NB model Straightforward P(Rating|IR), P(Category|IR) But for KLD? Weighted Average
First, Combine Rating and Category Second, Combine non-lexical and lexical /0.97 for Unigram/Digram
Rating and Category Beta Value of 0.5
Non-lexical and Unigram Alpha Value of 0.997
Non-Lexical and Digram Alpha Value of 0.97
Token Combining Results
Rating, Category, and Unigram - NB Gain, up to 20% Size 30, 60 % To 80% Size 60, 83 % To 96%
Rating, Category, and Unigram - KLD Gain, up to 12% Size 40, 68 % To 80% Size 60, 83 % To 92%
Rating, Category, and Digram - NB
Rating, Category, and Digram - KLD
What about Restricting Identified Record (IR) Size?
Anonymous Record Size (AR) Identified Record Size (IR) Matching Model TOP-X Linkability X: 1 and 10
Anonymous Record Size (AR) Identified Record Size (IR) Matching Model TOP-X Linkability X: 1 and 10
Restricted IR - NB Affected by IR size
Restricted IR - KLD Performed better for smaller IR Size 20 or less, improved The rest, comparable
What about Matching All ARs at once?
Anonymous Record Size (AR) Identified Record Size (IR) Matching Model TOP-X Linkability X: 1 and 10
Anonymous Records (ARs) Identified Records (IRs) Matching Model
Improvement (2): Matching All IRs At Once
MatchAll - Restricted Gain, up to 16% Size 30, From 74% To 90%
Matchall - Full Gain, up to 23% Size 20, From 35% To 55%
Improvement (3): For Small IR Size
Changing it to: Review Length
Results – Improvement (3) Size 10, 89% To 92% Size 7, 79% To 84% Gain up to 5%
Discussion Implications Cross-Referencing Review Spam Non-Prolific Users Gradually becomes prolific IR of 20, Link Around 70% Anonymous Record Size Linkability high even for small (92% for AR of 10) 60 only 20% of min user contribution
Discussion (cont.) Unigram Token Very Comparable for larger AR Entail less resources in the attach 26 VS 676
Future Directions Improving more for Small ARs Other Probabilistic Models Using Stylometry Exploring Linkability in other Preference Databases More than one AR for different Users: Exploring it more
Conclusion Extensive Study to Assess Linkability of User Reviews For large set of users Using very simple features Users are very exposed even with simple features and large number of authors
Thank you all!