Raef Bassily Penn State Local, Private, Efficient Protocols for Succinct Histograms Based on joint work with Adam Smith (Penn State) (To appear in STOC 2015). ITA 2015
Finance.com Fashion.com WeirdStuff.com How many users like Google.com? A conundrum Google server How would the server compute aggregate statistics about users without storing user- specific information?
Succinct histograms Goal is to produce a succinct histogram: a list of frequent items (“heavy hitters”) and estimates of their frequencies while providing rigorous privacy guarantees to the users n Untrusted server A set of items (e.g. websites) = [d] = {1, …, d} Set of users = [n] Frequency of an item a is f(a) = ( ♯ users holding a)/n Finance.com Fashion.com WeirdStuff.com Item ♯... d-2 d-1 d f(1) f(2)... f(3) f(d) Item ♯... d-2 d-1 d f(1) f(2)... f(3) f(d)
Local model of Differential Privacy Algorithm Q is -local differentially private (LDP) if for any pair v, v’ [d], for all events S, v1v v2v2 vnvn Q1Q1 Q1Q1 Q2Q2 Q2Q2 QnQn QnQn z1z1 z2z2 znzn Succinct histogram is item of user z i is differentially-private report of user i LDP protocols for frequency estimation is used in Chrome web browser (RAPPOR) [Erlingsson-Korolova-Pihur’14] as a basis for other estimation tasks [Dwork-Nissim’04]
Error is measured by the worst-case estimation error: Performance measures v1v v2v2 vnvn Q1Q1 Q1Q1 Q2Q2 Q2Q2 QnQn QnQn z1z1 z2z2 znzn implicitly Succinct histogram = is item of user z i is differentially-private report of user i for some A protocol is efficient if it runs in time poly(log(d), n) Communication Complexity measured by number of bits transmitted per user.
Contributions 1.Efficient -LDP protocol with optimal error: run in time poly(log(d), n). Estimate all frequencies up to error. 2.Matching lower bound on the error. 3.Generic transformation reducing the communication complexity to 1 bit/user. Previous protocols either ran in time [Mishra-Sandler’06, Hsu-Khanna-Roth’12, EKP’14] or, had larger error [HKR’12] Best previous lower bound was
UHH: at least fraction of users have the same item while the rest have (i.e., “no item”) Design paradigm Reduction from a simpler problem with a unique heavy hitter (UHH problem) Efficient protocol with optimal error for UHH efficient protocol with optimal error for the general problem.
Construction for the UHH problem v*v* Encoder z1z1 Noising operator (error-correcting code) Encoder z2z2 Noising operator v*v* znzn Round Decoder Key idea: is the signal-to-noise ratio. Decoding succeeds when
Construction for the general setting Key insight: Decompose general scenario into multiple instances of UHH via hashing. Run parallel copies of the UHH protocol on these instances. Guarantees that w.h.p., every heavy hitter is allocated a “collision-free” copy of the UHH protocol Protocol worst-case error = O( ) Hashing paradigm: Given pair-wise independent HASH: [d] [K] for some fixed K = poly(n): FOR j = 1 to K FOR each user i with item v i IF j = HASH( v i ) THEN user i simulates a HH user in the UHH protocol ELSE user i simulates an idle user in the UHH protocol Hashing paradigm: Given pair-wise independent HASH: [d] [K] for some fixed K = poly(n): FOR j = 1 to K FOR each user i with item v i IF j = HASH( v i ) THEN user i simulates a HH user in the UHH protocol ELSE user i simulates an idle user in the UHH protocol Item whose frequency
Transforming to a protocol with 1-bit reports generate public random string; one for each user User i sends a biased bit B i Conditioned on B i = 1, the public string has the same distribution as the output of local randomizer Q i Gen( Q i, v i, s i ) vivi BiBi s i Local randomizer: Q i IF B i = 1, THEN report of user i = s i ELSE ignore user i IF B i = 1, THEN report of user i = s i ELSE ignore user i This transformation works for any local protocol not only heavy hitters. Key idea: What matters is the distribution of the output of each local randomizer. Public string does not depend on private data: can be generated by untrusted server. For our HH protocol, this transformation gives essentially same error and computational efficiency (Gen can be computed in O(log(d))).
Summary 1.Efficient -Local Private protocol for succinct histograms with optimal error: run in time poly(log(d), n). Estimate all frequencies up to error. 2.Matching lower bound on the error. 3.Generic transformation in a model with public randomness reducing the communication complexity to 1 bit/user.