Batch Codes and Their Applications Y.Ishai, E.Kushilevitz, R.Ostrovsky, A.Sahai Preliminary version in STOC 2004
Talk Outline Batch codes Amortized PIR –via hashing –via batch codes Constructing batch codes Concluding remarks
A Load-Balancing Scenario x
What’s wrong with a random partition? Good on average for “oblivious” queries. However: –Can’t balance adversarial queries –Can’t balance few random queries –Can’t relieve “hot spots” in multi-user setting
Example 3 devices, 50% storage overhead. By how much can the maximal load be reduced? –Replicating bits is no good: device s.t.1/6 of the bits can only be found at this device. –Factor 2 load reduction is possible: LR LR LRLR
Batch Codes (n,N,m,k) batch code: Notes –Rate = n / N –By default, insist on minimal load per bucket m≥k. –Load measured by # of probes. Generalizations –Allow t probes per bucket –Larger alphabet x n y1y1 y2y2 ymym N { i 1,…,i k }
Multiset Batch Codes (n,N,m,k) multiset batch code: Motivation –Models multiple users (with off-line coordination) –Useful as a building block for standard batch codes Nontrivial even for multisets of the form x n y1y1 y2y2 ymym N
Examples Trivial codes –Replication: N=kn, m=k Optimal m, bad rate. –One bit per bucket: N=m=n Optimal rate, bad m. (L,R,L R) code: rate=2/3, m=3, k=2. Goal: simultaneously obtain –High rate (close to 1) –Small m (close to k) multiset
Private Information Retrieval (PIR) Goal: allow user to query database while hiding the identity of the data-items she is after. Motivation: patent databases, web searches,... Paradox(?): imagine buying in a store without the seller knowing what you buy. Note: Encrypting requests is useful against third parties; not against server holding the data.
Modeling Database: n-bit string x User: wishes to –retrieve x i and –keep i private
Server User xixi ?? ?
Some “Solutions” 1. User downloads entire database. Drawback: n communication bits (vs. logn+1 w/o privacy). Main research goal: minimize communication complexity. 2. User masks i with additional random indices. Drawback: gives a lot of information about i. 3. Enable anonymous access to database. Note: addresses the different security concern of hiding user’s identity, not the fact that x i is retrieved. Fact: PIR as described so far requires (n) communication bits.
Two Approaches Computational PIR [KO97, CMS99,...] –Computational privacy –Based on cryptographic assumptions Information-Theoretic PIR [CGKS95,Amb97,...] –Replicate database among s servers –Unconditional privacy against t servers –Default: t=1
Communication Upper Bounds Computational PIR –O(n ), polylog(n), O( logn), O( +logn) [KO97,CMS99,…] Information-theoretic PIR –2 servers, O(n 1/3 ) [CGKS95] –s servers, O(n 1/c(s) ) where c(s)=Ω(slogs / loglogs) [CGKS95,Amb97,BIKR02] –O(logn/loglogn) servers, polylog(n)
Time Complexity of PIR Given low-communication protocols, efficiency bottleneck shifts to servers’ time complexity. –Protocols require (at least) linear time per query. –This is an inherent limitation! Possible workarounds: –Preprocessing –Amortize cost over multiple queries
Previous Results [BIM00] PIR with preprocessing –s-server protocols with O(n ) communication and O(n 1/s+ ) work per query, requiring poly(n) storage. –Disadvantages: Only work for multi-server PIR Storage typically huge Amortized PIR –Slight savings possible using fast matrix multiplication –Require a large batch of queries and high communication –Apply also to queries originating from different users. This work: –Assume a batch of k queries originate from a single user. –Allow preprocessing (not always needed). –Nearly optimal amortization
Model Server/s User ?? ? x i, x i, …, x i 1 2k
Amortized PIR via Hashing Let P be a PIR protocol. Hashing-based amortized PIR: –User picks h R H, defining a random partition of x into k buckets of size n/k, and sends h to Server/s. Except for 2 - failure probability, at most t=O( logk) queries fall in each bucket. –P is applied t times for each bucket. Complexity: –Time kt T(n/k) t T(n) –Communication kt C(n/k) –Asymptotically optimal up to “polylog factors”
So what’s wrong? Not much… Still: –Not perfect introduces either error or privacy loss –Useless for small k t=O( logk) overhead dominates –Cannot hash “once and for all” h bad k-tuple of queries Sounds familiar?
Amortized PIR via Batch Codes Idea: use batch-encoding instead of hashing. Protocol: –Preprocessing: Server/s encode x as y=(y 1,y 2,…,y m ). –Based on i 1,…,i k, User computes the index of the bit it needs from each bucket. –P is applied once for each bucket. Complexity –Time 1 j m T(N j ) T(N) –Communication 1 j m C(N j ) m C(n) Trivial batch codes imply trivial protocols. (L,R,L R) code: 2 queries,1.5 X time, 3 X communication
Constructing Batch Codes
Overview Recall notion Main qualitative questions: 1.Can we get arbitrarily high constant rate (n/N=1- ) while keeping m feasible in terms of k (say m=poly(k))? 2.Can we insist on nearly optimal m (say m=O(k)) and still get close to a constant rate? Several incomparable constructions Answer both questions affirmatively. x n y1y1 y2y2 ymym N i 1,…,i k ~
Batch Codes from Unbalanced Expanders By Hall’s theorem, the graph represents an (n,N=|E|,m,k) batch code iff every set S containing at most k vertices on the left has at least |S| neighbors on the right. Fully captures replication-based batch codes. n m
Parameters Non-explicit: N=dn, m=O(k (nk) 1/(d-1) ) –d=3: rate=1/3, m=O(k 3/2 n 1/2 ). –d=logn: rate=1/logn, m=O(k) Settles Q2 Explicit (using [TUZ01],[CRVW02] ) –Nontrivial, but quite far from optimal Limitations: –Rate < ½ (unless m= (n)) –For const. rate, m must also depend on n. –Cannot handle multisets.
The Subcube Code Generalize (L,R,L R) example in two ways –Trade better rate for larger m (Y 1,Y 2,…,Y s,Y 1 … Y s ) still k=2 –Handle larger k via composition
Geomertic Interpretation AB CD A B C D ABAB CDCD ACAC BDBD ABCDABCD
Parameters N k log(1+1/s) n, m k log(s+1) –s=O(logk) gives an arbitrary constant rate with m=k O(loglogk). “almost” resolves Q1 Advantages: –Arbitrary constant rate –Handles multisets –Very easy decoding Asymptotically dominated by subsequent construction.
The Gadget Lemma From now on, we can choose a “convenient” n and get same rate and m(k) for arbitrarily larger n. Primitive multiset batch code
Batch Codes vs. Smooth Codes Def. A code C: n m is q-smooth if there exists a (randomized) decoder D such that –D(i) decodes x i by probing q symbols of C(x). –Each symbol of C(x) is probed w/prob q/m. Smooth codes are closely related to locally decodable codes [KT00]. Two-way relation with batch codes: –q-smooth code primitive multiset batch code with k=m/q 2 (ideally would like k=m/q). –Primitive multiset batch code (expected) q-smooth for q=m/k Batch codes and smooth codes are very different objects: –Relation breaks when relaxing “multiset” or “primitive” –Gap between m/q and m/q 2 is very significant for high rate case Best known smooth codes with rate>1/2 require q>n 1/2 These codes are provably useless as batch codes.
Batch Codes from RM Codes (s,d) Reed-Muller code over F –Message viewed as s-variate polynomial p over F of total degree (at most) d. –Encoded by the sequence of its evaluations on all points in F s –Case |F|>d is useful due to a “smooth decoding” feature: p(z) can be extrapolated from the values of p on any d+1 points on a line passing through z.
s=2, d (2n) 1/2 x2x2 x1x1 xnxn Two approaches for handling conflicts: 1.Replicate each point t times 2.Use redundancy to “delete” intersections Slightly increases field size, but still allows constant rate.
Parameters Rate = (1/s!- ), m=k 1+1/(s-1)+o(1) –Multiset codes with constant rate (< ½) Rate = (1/k ), m=O(k) resolves Q2 for multiset codes as well Main remaining challenge: resolve Q1 ~
The Subset Code Choose s,d such that n Each data bit i [n] is associated T Each bucket j [m] is associated S Primitive code: y S = T S x T x y s d ( ) [s]d[s]d [s]d[s]d sdsd
Batch Decoding the Subset Code Lemma: For each T’ T, x T can be decoded from all y S such that S T=T’. –Let L T,T’ denote the set of such S. –Note: {L T,T’ : T’ T } defines a partition of xTxT y T’ ( ) [s]d[s]d **0110****
Batch Decoding the Subset Code (contd.) Goal: Given T 1,…,T k, find subsets T’ 1,…,T’ k such that L Ti, T’i are pairwise disjoint. –Easy if all T i are distinct or if all T i are the same. Attempt 1: T’ i is a random subset of T i –Problem: if T i,T j are disjoint, L Ti, T’i and L Tj, T’j intersect w.h.p. Attempt 2: greedily assign to T i the largest T’ i such that L Ti, T’i does not intersect any previous L Tj, T’j –Problem: adjacent sets may “block” each other. Solution: pick random T’ i with bias towards large sets. x3x1x2
Parameters Allows arbitrary constant rate with m=poly(k) Settles Q1 Both the subcube code and the subset code can be viewed as sub-codes of the binary RM code. –The full binary RM code cannot be batch decoded when the rate>1/2.
Concluding Remarks: Batch Codes A common relaxation of very different combinatorial objects –Expanders –Locally-decodable codes Problem makes sense even for small values of m,k. –For multiset codes with m=3,k=2, rate 2/3 is optimal. –Open for m k+2. Useful building block for “distributed data structures”.
Concluding Remarks: PIR Single-user amortization is useful in practice only if PIR is significantly more efficient than download. –Certainly true for multi-server PIR –Most likely true also for single-server PIR Killer app for lattice-based cryptosystems? Single user Multiple users AdaptiveNon-adaptive ?? ?