Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
Models for moving data Reset model Delta model Geometric and database motivations Given vector A[1..n] A[i] is a point in R d, d ≥ 1 A is updated in a streaming manner Probabilistic approximate computation of some function on A: ε : error parameter δ : confidence parameter Space and time: poly(log n, (1/ ε), (1/ δ))
Reset model Given vector A[1..n] A[i] is a point in R d, d ≥ 1 Updates reset(i, x) A[i] := x Motivation: Location data streams (tracking passive/dumb objects). Query self-tuning in databases.
Reset Model “Dynamic” geometric information Different from standard “dynamic” streams: insert(p), p in R d delete(p) In reset model, points have identity delete(p) + insert(p’) gives more information than reset
Delta Model Given vector A[1..n] A[i] is a point in R d, d ≥ 2 Process updates (i, x 1, x 2, …, x d ) A[i] := A[i] + (x 1, x 2, …, x d ) Motivation: Data is often multi-dimensional E.g. Direct generalization of turnstile model
Delta Model Problems involving several dimensions “extent” of points (sum of distances of points from a given center) k-median, diameter, minimum enclosing ball etc? regression: correlation of packet size with delay
Problems Reset model L p norm* L p sampling* 1-median Delta model “Extent” of points 1-median } monotone, d = 1
L p norm: Reset Model Assume wlog p=1 required to estimate ||A|| 1 = Σ |A[i]| Assume monotone updates A[i] initially zero reset(i,x) implies A[i] ≤ x A[i] := max(A[i], x) [GC] Estimation impossible if non-monotone reduction to estimating |X| - |X ∩Y |
L 1 norm (reset model) Reduction to counting distinct items A Buckets n i = number of items in ith bucket w i = width of ith bucket Σ(w i *n i )≤ ||A|| 1 ≤ (1+ε) Σ(w i *n i ) distinct
L 1 norm (reset model) Counting the number of distinct items in a stream ≡ L 0 norm poly-log space and time [FM,CIM] Need to keep only O((log n)/ε) buckets. Can we detect if the input is non- monotone?
L p sampling Query: sample() Choose i from {1,…,n} with probability proportional to |A[i]| p Successive calls may return same index, if no updates happen. Not known how to do this in the turnstile model Can be used to detect if ||A|| 1 ≤ (1 - ε) ||A*|| 1
L p sampling (reset model) Reduction to sampling distinct items A Buckets n i = number of distinct items in ith bucket wi = width of ith bucket Sample a random (distinct) index from each bucket Return sample from bucket i with probability proportional to w i * n i
1-median Assume A[i] contains coordinates of a set S of 2-D points Problem: find c in R 2 s.t. Σ p in S d(c,p) (Euclidean distance) is approximately minimized Monotonicity not required; cannot report Σ p in S d(c,p). Return (4/π + ε) ~ ( ε) estimator boosting: see later.
1-median (reset model) L 1 1-median: find c in R 2 such that Σ p in S d(c,p) is minimized. d(p,q) = L 1 distance d 1 (p,q) = |p x – q x | + |p y – q y | L 1 1-median c = (c x, c y ) c x = median of x-coordinates c y = median of y-coordinates
1-median (reset model) 1-D median sample O((1/ε) log (1/δ)) random indices; maintain position of sample. median m x of x-coordinates of sample is (1+ε)-approximation to median of x- coordinates of S. (1+ε)-approximate median is a (1+ε’)- approximate 1-median in 1-D Approximate L 1 1-median: return (m x, m y ) may not be in S.
Projections of points L 1 1-median is a √2-approximation to L 2 (Euclidean) 1-median: consider projections of S to do better:
Let l be a line segment of length x, and s be the sum of the lengths of the projections of l on k equally-spaced lines passing through the origin, then πs/(2k) = x(1 +/- Θ(1/k)).
1-median (reset model) Consider L 1 1-medians c 1 … c k Σ d(c i,S) ≤ (4k/π + O(1/k)) d(c*,S) One of the c i is a (4/π + ε) approx. Which one? λ d(p,S) + (1- λ)d(q,S) ≥ d(λp + (1- λ)q,S) return average of c 1 … c k Boosting confidence: take several independent samples, take mean. Q: how good is 1-median of sample? Similar to “projection median” [DK] ≤
Reset Model (conclusions) Computed extent and approximate 1- median. Many problems seem hard without some monotonicity assumptions CH, k-center, k-median, k > 1 What assumptions? strict: points moving away from known origin. (min encl ball, [GC]) points moving away from unknown origin. points moving monotonically along trajectories from known class (lines eg).
Delta Model A[1..n]; A[i] is a point in R d, d ≥ 2 S is set of points Updates (i, x 1, x 2, …, x d ) A[i] := A[i] + (x 1, x 2, …, x d ) “Extent” query: Given c, estimate Σ p in S d(c,p) (Euclidean distances)
Delta Model Extent query: Use projections and 1-D L 1 norm sketches (1+ε)-approximation to extent(c) 1-median Use L 1 1-median to find suitable search area. Using above, search for 1-median (1+ε)-approximation
Conclusions Introduced (1+ε) new models for “geometric” computation Gave solutions to some basic problems Many open questions: appropriate monotonicity assumptions for reset model statistical analysis of low-dimensional point set for delta model.