Non-tracking Web Analytics Istemi Ekin Akkus 1, Ruichuan Chen 1, Michaela Hardt 2, Paul Francis 1, Johannes Gehrke 3 1 Max Planck Institute for Software Systems 2 Twitter Inc. 3 Cornell University
Web Analytics Statistics about users visiting a publisher website Akkus et al.Non-tracking Web Analytics2
Analytics by Data Aggregators Collect analytics for many publishers from many clients Infer extended analytics – Age, gender, education level, other sites visited, … Provide aggregate information to publishers & advertisers Akkus et al.Non-tracking Web Analytics3 Aggregate Extended Analytics Data AggregatorPublisher
Analytics Today Akkus et al.Non-tracking Web Analytics4 Publisher Client Data Aggregator
Tracking Data aggregators criticized – Collection of individual information Criticisms led to reactions – Do-not-Track proposal, EU cookie law – Voluntary opt-out mechanisms by aggregators – Client-side tools to blacklist aggregators Fewer tracked users less data for inference worse extended analytics for publishers Akkus et al.Non-tracking Web Analytics5
Goal Replicate the functionality of today’s systems without tracking Replicate the functionality of today’s systems without tracking Akkus et al.Non-tracking Web Analytics6
Specific Goals Privacy – No individual information collected by publishers & aggregators Functionality – Aggregate information for publishers & aggregators – No new organizational components – Practical and efficient Akkus et al.Non-tracking Web Analytics7
Outline Motivation & Goals Components & Assumptions Non-tracking Analytics Implementation & Evaluation Conclusion Akkus et al.Non-tracking Web Analytics8
Components Client locally stores information about the user Publisher serves webpages to clients Aggregator provides aggregation service Akkus et al.Non-tracking Web Analytics9
Assumptions Akkus et al.Non-tracking Web Analytics10 Potentially malicious client – May try to distort results Potentially malicious publisher – May try to violate individual user privacy Honest-but-curious data aggregator – Follows the protocol – Doesn’t collude with publishers
Outline Motivation & Goals Components & Assumptions Non-tracking Analytics – Publisher as Proxy – Noise – Yes-No Queries – Auditing Implementation & Evaluation Conclusion Akkus et al.Non-tracking Web Analytics11
Today Not anonymous; need a proxy… …, but don’t want a new component Publisher already interacts with clients! Akkus et al.Non-tracking Web Analytics12
Publisher as Anonymizing Proxy 4.Aggregator counts anonymous answers and returns results 1.Publisher distributes queries to be executed 2.Publisher collects encrypted answers 3.Publisher forwards answers to the aggregator Clients never exposed to the data aggregator 1. Queries 2. Encrypted Answers 3. Encrypted Answers 4. Results Akkus et al.Non-tracking Web Analytics13
Identifiers in Responses Rare attributes – Job: CEO of ACME Enc(CEO of ACME) Enc(CEO of ACME) CEO of ACME visits my site! CEO of ACME visits example.com Akkus et al.Non-tracking Web Analytics example.com 14
Noise 2. Encrypted Answers 4. Noisy Encrypted Answers 6. Double-noisy Result 3. Add Noise_Publisher 5. Add Noise_Aggregator 7. Remove Noise_Publisher Both entities obtain noisy results Both entities obtain noisy results Result with Noise_Aggregator Result with Noise_Publisher Akkus et al.Non-tracking Web Analytics15
Differentially-private Noise Hides the existence of an individual answer CEO: real or noise?? Requires numerical values ? Akkus et al.Non-tracking Web Analytics16
Yes-No Questions Convert queries to binary & count answers “What is your job?” “Is your job ‘CEO’?” Noise as additional answers – Enc(‘Yes’), Enc(‘No’) Bonus: limits a malicious client – Either +1 or 0 Many possible values Many questions – Job: ‘CEO’, ‘Student’, ‘Gardener’,... Akkus et al.Non-tracking Web Analytics17
Buckets Multiple yes-no questions with one query 1.Enumerate possible answer values – Job: {‘CEO’, ‘Student’, `Gardener’, `Teacher’,...} 2.A fixed number of ‘Yes’ answers – Job: 1 3.Clients choose ‘Yes’ for the matching bucket – Enc(‘CEO = Yes’) 4.Publisher generates additional answers – Enc(‘CEO = Yes’), Enc(‘Student = Yes’),... Akkus et al.Non-tracking Web Analytics18
Impracticalities of Differential Privacy Requires a privacy budget – Stop answering when budget expires – No answers from clients low-utility results Assumes a static database; our setting is dynamic – User population of a publisher changes – Certain user data may change Clients keep answering queries Akkus et al.Non-tracking Web Analytics19
Malicious Publishers Isolation attacks – Isolate a user’s response – Repeat the same query – Cancel out noise 1.Specific query conditions or buckets – Monitoring and approval by the data aggregator 2.Selectively dropping client responses Akkus et al.Non-tracking Web Analytics20
Isolation via Dropping Responses Enc(CEO) Enc(Student) Enc(Gardener) Enc(CEO) Enc(Student) Enc(Gardener) Enc(Driver) Enc(Mechanic) Enc(Driver) Mechanic: 1 + noise Driver: 2 + noise CEO: 1 + noise User in the middle is a CEO! Akkus et al.Non-tracking Web Analytics21 example.com
Auditing Enc(CEO) Enc(Student) Enc(CEO) Enc(Student) Enc(nonce) Enc(Driver) Enc(Mechanic) Enc(Driver) Enc(nonce) Enc(example.com, nonce) Enc(example.com, nonce) Akkus et al.Non-tracking Web Analytics22 example.com nonce? example.com
Outline Motivation & Goals Components & Assumptions Non-tracking Analytics – Publisher as Proxy – Noise – Yes-No Answer – Auditing Implementation & Evaluation Conclusion Akkus et al.Non-tracking Web Analytics23
Implementation 2000 lines of code in total – Client: Firefox extension – Publisher software: Piwik plugin – Aggregator software: simple server Deployed and tested with over 200 users RSA public key cryptosystem Akkus et al.Non-tracking Web Analytics24
Evaluation – Decryption Overhead Aggregator: 2.4 GHz CPU, 2048-bit key Publisher: 50K users, 2 sets of queries/week 1.Information currently provided – Demographics, other sites – 3.6 CPU hours/week 2.Information available through our system – # pages browsed, search engines, visit frequency to other sites – 3 CPU hours/week Akkus et al.Non-tracking Web Analytics25
Evaluation – Client Overhead Bandwidth overhead – <100KB/week to download 11 queries – 8KB/week for all query responses CPU overhead for encryption – Google Chrome: 380 enc/sec – Firefox: 20 enc/sec Akkus et al.Non-tracking Web Analytics26
Summary Extended analytics without tracking – Differential privacy guarantees for users – Aggregate information for publishers & aggregators No new organizational component Practical & feasible to deploy Akkus et al.Non-tracking Web Analytics27