563.10: Bloom Cookies Web Search Personalization without User Tracking Presented by Ben Ujcich CS563/ECE524 Advanced Computer Security University of Illinois
Background A trade-off between privacy and personalization from what we give search engines when we perform searches If I search for UIUC-related websites often, would I want Google to show UIUC pages when I simply type “university”? What do I lose when I make myself more private in my searches (e.g., browsing through Tor)?
A Compromise Profile obfuscation masks the exact profile of a user’s previous searches and URLs visited Provides some degree of privacy while allowing personalization (How can this be quantified?) Implemented client-side or through a personalization proxy Downsides? Costly in bandwidth Need for a trusted third party
Profile Obfuscation Techniques Generalization: make specifics coarser Noise addition: add fake information User visited URLs ece.illinois.edu nytimes.com facebook.com youtube.com User visited URL categories Higher education News Social media Videos and media User visited URLs ece.illinois.edu nytimes.com facebook.com youtube.com User visited URLs ece.illinois.edu nytimes.com facebook.com youtube.com wsj.com umich.edu instagram.com reddit.com
Research Challenges “What obfuscation technique is more suitable for privacy-preserving personalization of web search?” “How big a dictionary and how much noise are required to achieve reasonable unlinkability?” “Is it possible to receive the advantages of noisy profiles without incurring the aforementioned costs (i.e., noise dictionary and large communication overhead)?”
Citation Bloom Cookies: Web Search Personalization without User Tracking Nitesh Mor (UC Berkeley), Oriana Riva (Microsoft), Suman Nath (Microsoft), John Kubiatowicz (UC Berkeley) NDSS ‘15
Overview Providing personalization while preserving privacy in web searches can be done through profile obfuscation, but it is often costly or impractical. The authors quantify and evaluate whether generalization or noise addition is better for the privacy-personalization trade-off. The authors propose the Bloom cookie, based on the properties of a Bloom filter’s false positives, as a cost-efficient mechanism for adding noise and preserving configurable amounts of privacy.
Threat Model Server not trusted by client (user) Techniques for hiding IP addresses are not assumed (“unlinkability” across IP addresses) IP addresses change frequently Browsers prevent online services from tracking (though browsers themselves keep track of previous activity) Large population size No collusion with other services
Evaluation Techniques Personalization (measured by average rank) URL-based: URLs users visit most often Interest-based: preferred interest based on prior searches Privacy (measured by unlinkability) RAND: add random noise from dictionary containing URLs and their associated interests HYBRID: add random noise only from dictionary entries that correspond to interests that user has already has looked at in the past
Results: Generalization Higher unlinkability (44.1% linkable users) than using exact URLs (98.7% linkable users) Is this reasonable?
Results: Noise Addition Better unlinkability (20% linkable) than generalization (44%), but large cost to send noise HYBRID makes personalization worse than with equivalent in RAND
Results: Noise Addition Better unlinkability (20% linkable) than generalization (44%), but large cost to send noise HYBRID makes personalization worse than with equivalent parameters in RAND
An uninitialized Bloom filter with m = 12 Review: Bloom Filters Space and time efficient probabilistic membership data structure May have false positives; no false negatives Stored as a bit array m = size of array k = # of hashes to use for inserting/querying elements n = # of inserted elements An uninitialized Bloom filter with m = 12
set corresponding bit locations to 1 Review: Bloom Filters Adding an element (m = 12, k = 3, n = 1) hash1(“Hello”) = 1 hash2(“Hello”) = 5 hash3(“Hello”) = 10 hashes to Inserting element: “Hello” set corresponding bit locations to 1 1 1 2 3 4 5 6 7 8 9 10 11
set corresponding bit locations to 1 Review: Bloom Filters Adding an element (m = 12, k = 3, n = 2) hash1(“Hello”) = 3 hash2(“Hello”) = 5 hash3(“Hello”) = 9 hashes to Inserting element: “World” set corresponding bit locations to 1 1 1 2 3 4 5 6 7 8 9 10 11
Review: Bloom Filters ✓ ✓ ✓ Querying for an element (m = 12, k = 3, n = 2) hash1(“Hello”) = 1 hash2(“Hello”) = 5 hash3(“Hello”) = 10 Membership query: Is “Hello” in the list? hashes to check that all corresponding bit locations are 1 1 1 2 3 4 5 6 7 8 9 10 11 ✓ ✓ ✓ Answer: Possibly (with some probability)
Review: Bloom Filters ✓ ✓ ✗ Querying for an element (m = 12, k = 3, n = 2) hash1(“Hello”) = 1 hash2(“Hello”) = 5 hash3(“Hello”) = 7 Membership query: Is “Goodbye” in the list? hashes to check that all corresponding bit locations are 1 1 1 2 3 4 5 6 7 8 9 10 11 ✓ ✓ ✗ Answer: No (guaranteed)
In effect, the false positive rate increases. Bloom Cookies Add exact profile of user’s previously visited URLs as elements into Bloom filter: Then, add noise by setting random fake bits to 1 to achieve at least l proportion of 1 bits: [“nytimes.com”,”wsj.com”, “google.com”] 1 1 In effect, the false positive rate increases.
Bloom Cookie Properties Efficiency More compact since filter size is fixed Noise by design False positives are advantages Non-deterministic noise Noise changes as filter changes Dictionary-free No noise dictionary required Expensive dictionary attacks Adversary would need to query for membership from the Bloom filter rather than already having the membership list
Bloom Cookie System Design
Results: Bloom Cookies Cost to send is constant (2000 bits) Linkability decreases with higher l value No dependency on a noise dictionary
Pros and Cons Pros: Cons: Use of real search logs Bloom cookie design described well Using a “negative” of Bloom filters as a positive No need for a third party Limitations section Clear and well written Useful diagrams Cons: Assumption that user has browser not sending tracking info to services No collusion assumption Don’t justify 1,000 users to smooth outliers Single data set Design is described late into the paper Study period too small