Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Slides:



Advertisements
Similar presentations
Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
Advertisements

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Searching on Multi-Dimensional Data
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)
Visual Recognition Tutorial
Uncertainty Principles, Extractors, and Explicit Embeddings of L 2 into L 1 Piotr Indyk MIT.
Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
Simple Linear Regression
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)
Visual Recognition Tutorial
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
1 Streaming Algorithms for Geometric Problems Piotr Indyk MIT.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
1 Streaming Algorithms for Geometric Problems Piotr Indyk MIT.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Facility Location in Dynamic Geometric Data Streams Christiane Lammersen Christian Sohler.
Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
1 Approximations and Streaming Algorithms for Geometric Problems Piotr Indyk MIT.
Clustering Data Streams A presentation by George Toderici.
Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Page 0 of 5 Dynamical Invariants of an Attractor and potential applications for speech data Saurabh Prasad Intelligent Electronic Systems Human and Systems.
11/25/03 3D Model Acquisition by Tracking 2D Wireframes Presenter: Jing Han Shiau M. Brown, T. Drummond and R. Cipolla Department of Engineering University.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,
Clustering Data Streams
Stream-based Geometric Algorithms
New Characterizations in Turnstile Streams with Applications
Data-Streams and Histograms
Finding Frequent Items in Data Streams
Estimating L2 Norm MIT Piotr Indyk.
Streaming & sampling.
Clustering (3) Center-based algorithms Fuzzy k-means
Sublinear Algorithmic Tools 2
LSI, SVD and Data Management
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Lecture 7: Dynamic sampling Dimension Reduction
Y. Kotidis, S. Muthukrishnan,
Overview Massive data sets Streaming algorithms Regression
CSCI B609: “Foundations of Data Science”
Parametric Methods Berlin Chen, 2005 References:
Minwise Hashing and Efficient Search
Calibration and homographies
Presentation transcript:

Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Models for moving data Reset model Delta model Geometric and database motivations Given vector A[1..n] A[i] is a point in R d, d ≥ 1 A is updated in a streaming manner Probabilistic approximate computation of some function on A: ε : error parameter δ : confidence parameter Space and time: poly(log n, (1/ ε), (1/ δ))

Reset model Given vector A[1..n] A[i] is a point in R d, d ≥ 1 Updates reset(i, x) A[i] := x Motivation: Location data streams (tracking passive/dumb objects). Query self-tuning in databases.

Reset Model “Dynamic” geometric information Different from standard “dynamic” streams: insert(p), p in R d delete(p) In reset model, points have identity delete(p) + insert(p’) gives more information than reset

Delta Model Given vector A[1..n] A[i] is a point in R d, d ≥ 2 Process updates (i, x 1, x 2, …, x d ) A[i] := A[i] + (x 1, x 2, …, x d ) Motivation: Data is often multi-dimensional E.g. Direct generalization of turnstile model

Delta Model Problems involving several dimensions “extent” of points (sum of distances of points from a given center) k-median, diameter, minimum enclosing ball etc? regression: correlation of packet size with delay

Problems Reset model L p norm* L p sampling* 1-median Delta model “Extent” of points 1-median } monotone, d = 1

L p norm: Reset Model Assume wlog p=1 required to estimate ||A|| 1 = Σ |A[i]| Assume monotone updates A[i] initially zero reset(i,x) implies A[i] ≤ x A[i] := max(A[i], x) [GC] Estimation impossible if non-monotone reduction to estimating |X| - |X ∩Y |

L 1 norm (reset model) Reduction to counting distinct items A Buckets n i = number of items in ith bucket w i = width of ith bucket Σ(w i *n i )≤ ||A|| 1 ≤ (1+ε) Σ(w i *n i ) distinct

L 1 norm (reset model) Counting the number of distinct items in a stream ≡ L 0 norm poly-log space and time [FM,CIM] Need to keep only O((log n)/ε) buckets. Can we detect if the input is non- monotone?

L p sampling Query: sample() Choose i from {1,…,n} with probability proportional to |A[i]| p Successive calls may return same index, if no updates happen. Not known how to do this in the turnstile model Can be used to detect if ||A|| 1 ≤ (1 - ε) ||A*|| 1

L p sampling (reset model) Reduction to sampling distinct items A Buckets n i = number of distinct items in ith bucket wi = width of ith bucket Sample a random (distinct) index from each bucket Return sample from bucket i with probability proportional to w i * n i

1-median Assume A[i] contains coordinates of a set S of 2-D points Problem: find c in R 2 s.t. Σ p in S d(c,p) (Euclidean distance) is approximately minimized Monotonicity not required; cannot report Σ p in S d(c,p). Return (4/π + ε) ~ ( ε) estimator boosting: see later.

1-median (reset model) L 1 1-median: find c in R 2 such that Σ p in S d(c,p) is minimized. d(p,q) = L 1 distance d 1 (p,q) = |p x – q x | + |p y – q y | L 1 1-median c = (c x, c y ) c x = median of x-coordinates c y = median of y-coordinates

1-median (reset model) 1-D median sample O((1/ε) log (1/δ)) random indices; maintain position of sample. median m x of x-coordinates of sample is (1+ε)-approximation to median of x- coordinates of S. (1+ε)-approximate median is a (1+ε’)- approximate 1-median in 1-D Approximate L 1 1-median: return (m x, m y ) may not be in S.

Projections of points L 1 1-median is a √2-approximation to L 2 (Euclidean) 1-median: consider projections of S to do better:

Let l be a line segment of length x, and s be the sum of the lengths of the projections of l on k equally-spaced lines passing through the origin, then πs/(2k) = x(1 +/- Θ(1/k)).

1-median (reset model) Consider L 1 1-medians c 1 … c k Σ d(c i,S) ≤ (4k/π + O(1/k)) d(c*,S) One of the c i is a (4/π + ε) approx. Which one? λ d(p,S) + (1- λ)d(q,S) ≥ d(λp + (1- λ)q,S) return average of c 1 … c k Boosting confidence: take several independent samples, take mean. Q: how good is 1-median of sample? Similar to “projection median” [DK] ≤

Reset Model (conclusions) Computed extent and approximate 1- median. Many problems seem hard without some monotonicity assumptions CH, k-center, k-median, k > 1 What assumptions? strict: points moving away from known origin. (min encl ball, [GC]) points moving away from unknown origin. points moving monotonically along trajectories from known class (lines eg).

Delta Model A[1..n]; A[i] is a point in R d, d ≥ 2 S is set of points Updates (i, x 1, x 2, …, x d ) A[i] := A[i] + (x 1, x 2, …, x d ) “Extent” query: Given c, estimate Σ p in S d(c,p) (Euclidean distances)

Delta Model Extent query: Use projections and 1-D L 1 norm sketches (1+ε)-approximation to extent(c) 1-median Use L 1 1-median to find suitable search area. Using above, search for 1-median (1+ε)-approximation

Conclusions Introduced (1+ε) new models for “geometric” computation Gave solutions to some basic problems Many open questions: appropriate monotonicity assumptions for reset model statistical analysis of low-dimensional point set for delta model.