On ‘Selection and Sorting with Limited Storage’ Graham Cormode Joint work with S. Muthukrishnan, Andrew McGregor, Amit Chakrabarti.

Slides:



Advertisements
Similar presentations
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Advertisements

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
Data Stream Algorithms Frequency Moments
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
Gillat Kol (IAS) joint work with Ran Raz (Weizmann + IAS) Interactive Channel Capacity.
Sorting Comparison-based algorithm review –You should know most of the algorithms –We will concentrate on their analyses –Special emphasis: Heapsort Lower.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
CS 206 Introduction to Computer Science II 12 / 05 / 2008 Instructor: Michael Eckmann.
Sketching in Adversarial Environments Or Sublinearity and Cryptography 1 Moni Naor Joint work with: Ilya Mironov and Gil Segev.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
CSE 326: Data Structures Sorting Ben Lerner Summer 2007.
Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
The Complexity of Algorithms and the Lower Bounds of Problems
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
CSE 373 Data Structures Lecture 15
Chapter 11 Limitations of Algorithm Power. Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples:
Computational Complexity Polynomial time O(n k ) input size n, k constant Tractable problems solvable in polynomial time(Opposite Intractable) Ex: sorting,
Computing and Communicating Functions over Sensor Networks A.Giridhar and P. R. Kumar Presented by Srikanth Hariharan.
Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
The Power of Incorrectness A Brief Introduction to Soft Heaps.
Sorting with Heaps Observation: Removal of the largest item from a heap can be performed in O(log n) time Another observation: Nodes are removed in order.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Trevor Brown – University of Toronto B-slack trees: Space efficient B-trees.
Analysis of Algorithms CS 477/677
1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
The Cost of Fault Tolerance in Multi-Party Communication Complexity Binbin Chen Advanced Digital Sciences Center Haifeng Yu National University of Singapore.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Facility Location in Dynamic Geometric Data Streams Christiane Lammersen Christian Sohler.
Data Stream Algorithms Lower Bounds Graham Cormode
COSC 3101A - Design and Analysis of Algorithms 6 Lower Bounds for Sorting Counting / Radix / Bucket Sort Many of these slides are taken from Monica Nicolescu,
LIMITATIONS OF ALGORITHM POWER
Data Structures Haim Kaplan & Uri Zwick December 2013 Sorting 1.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
The Message Passing Communication Model David Woodruff IBM Almaden.
Sorting & Lower Bounds Jeff Edmonds York University COSC 3101 Lecture 5.
Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
Open Problems in Streaming
Streaming & sampling.
Approximate Matchings in Dynamic Graph Streams
Linear sketching with parities
Y. Kotidis, S. Muthukrishnan,
Linear sketching over
Linear sketching with parities
Chapter 11 Limitations of Algorithm Power
Presentation transcript:

On ‘Selection and Sorting with Limited Storage’ Graham Cormode Joint work with S. Muthukrishnan, Andrew McGregor, Amit Chakrabarti

2 MikePaterson

3

4 Munro-Paterson 78 [MP78] One of the first papers to consider computing with limited storage – Storage sublinear in the size of the input Considered what could be accomplished in one or few passes over input treated as a one-way tape – Effectively the now-popular ‘streaming model’ Focused on the problem of selection (median and generalized median) – ‘Selection’: Find the K’th ranked item (integer) out of N – Dozens of papers on variations of these problems in the streaming world in last decade

5 Results in MP78 P-pass deterministic algorithm for selection – In each pass, narrow down the range of interest – Compute the exact ranks of a small range in the final pass – Recursively merge and thin out pairs of buffers, tracking bounds on the ranks of each retained item – Gives O(N 1/P poly-log(N)) space for P passes Implies P=log N passes in poly-log(N) space Revisited by Manku, Rajagopalan and Lindsay [1998] : – Obtain  N error in ranks in O(  -1 log 2  N) space, one pass – Improved to O(  -1 log  N) by Greenwald and Khanna

6 Results in MP78 Deterministic lower bound of  (N 1/P ) space for P-passes – Based on an adversary who ensures that there are many elements not stored whose relative ordering is unknown Later,  (1/  ) bound for one pass approximate selection allowing randomness—implies  (N) bound for exact – Shown by Henzinger, Raghavan, Rajagopalan [1998]

7 Results in MP78 Bounds “assuming that all input orderings are equally likely” – Now known as the “random order streams assumption” Shows a problem is hard even under favourable order – O(N 1/2P ) upper bound, and  (N 1/2 ) 1 pass lower bound Guha and McGregor [2006] give a P=O(log log N) pass algorithm for exact selection in O(polylog(N)) space – An exponential gap between the adversarial order case – Resolves a question posed in MP78. – Is this optimal?

8 Outline Selection and Sorting with Limited Storage One pass approximate selection with deletions Lower bounds for P pass selection on random order input

9 Approximate Selection with Deletions  - approximate selection: – Find any item with rank between (Φ-ε)N and (Φ+ε)N Streams with deletions: – Stream contains both “insertion” and “deletion” of items – Assume no deletions without preceding matching insertion – Captures e.g. database transactions, network connections Assumption: items drawn from bounded universe of size U – Model as integers 1…U Approach: solve a different streaming problem, then reduce – Estimate frequency of some item j with additive error  N

10 Count-Min Sketch Simple sketch idea, can be used for as the basis of many different stream analysis. Model input stream as a vector x of dimension U Creates a small summary as an array of w  d in size Use d hash function to map vector entries to [1..w] Works on arrivals only and arrivals & departures streams W d Array: CM[i,j]

11 CM Sketch Structure Each entry in vector x is mapped to one bucket per row. Merge two sketches by entry-wise summation Estimate x[j] by taking min k CM[k,h k (j)] – Guarantees error less than  ||x|| 1 in size O(1/  log 1/  ) – Probability of more error is less than 1-  +c h 1 (j) h d (j) j,+c d=log 1/  w = 2/  [C, Muthukrishnan ’04]

12 Approximation Approximate x’[j] = min k CM[k,h k (j)] Analysis: In k'th row, CM[k,h k (j)] = x[j] + X k,j – X k,j =  x[i] | h k (i) = h k (j) – E(X k,j )=  x[k]*Pr[h k (i)=h k (j)]  Pr[h k (i)=h k (k)] *  a[i] =  ||x|| 1 /2 by pairwise independence of h – Pr[X k,j   ||x|| 1 ] = Pr[X k,j  2E(X k,j )]  1/2 by Markov inequality So, Pr[x’[j]  x[j] +  ||x|| 1 ] = Pr[  k. X k,j >  ||x|| 1 ]  1/2 log 1/  =  Final result: with certainty x[j]  x’[j] and with probability at least 1- , x’[j]< x[j] +  ||x|| 1

13 Application To Selection Impose a binary tree over the domain of input items – Each node corresponds to the union of its leaves Keep a CM sketch to summarize each level of the tree Estimate the rank of any item from O(log U) dyadic ranges and estimate each from relevant sketch For selection, binary search over the domain of items to find one with the desired estimated rank Result: solve one-pass  -approximate selection with probability at least 1-  using O(1/  log 2 U log 1/  ) space – Deterministic solution requires  (1/  2 ) space

14 Outline Selection and Sorting with Limited Storage One pass approximate selection with deletions Lower bounds for P pass selection on random order input

15 Bounds Via Communication Complexity Viewing contents of memory as a message being passed, communication complexity techniques give space lower bounds – Sending the contents of memory gives a communication protocol – Similar style of argument used in [MP78] to bound space of a P-pass sorting algorithm Proving lower bounds for streams in random order led us to consider communication bounds for random partitions of the input between players [Chakrabarti, C, McGregor 08]

16 The P players (Alice, Bob, Charlie…) each receive a random partition of input (could be non-uniform) Each communicates a message in order to the next, in up to r rounds Lower bounds on communication imply streaming space lower bounds The Model

17 Instance: Function on nodes of P-level, t-ary tree, – if v is an internal node: f maps v to a child of v – if v is a leaf: f maps v to {0,1} Goal: Compute f(f(... f(v root )....)). For P-players, if i th player knows f(v) when level(v)=i: Any (P-1)-round protocol requires Ω(t/P 2 ) communication. – Even when input is picked uniformly at random f=0f=1f=0 f=1f=0 f=1 f=2f=1 f=3 Tree Pointer Jumping (TPJ) Level 3 Level 2 Level 1

18 With each node v associate two values  (v) <  (v) such that  (v) <  (u) <  (u) <  (v) for any descendent u of v. For each node: Generate more copies of  (v) and  (v) such that median of values corresponds to TPJ solution. Relationship between t and # copies determines bound. – Need more copies higher up in tree Reduction from TPJ to Median

19 αααββββα Consider tree node v where f(v) is known to Bob. Create Instance of Random-Partition Tree-Pointer Jumping: 1) Using public coin, players determine partition of tokens and set half to  and half to . 2) Bob “fixes” balance of tokens under his control. The resulting distribution is “close” so algorithm expecting a random partition should succeed with only slightly lower prob     Simulating Random-Partition Protocol

20 Implications for Selection Implies a communication lower bound of  (N 1/2 P ) (supressing lesser factors) Means any P-pass algorithm for median finding (more generally, selection) requires  (N 1/2 P /2 P ) space – poly(log N) space requires P=  (log log N) passes – 3 pass algorithm requires  (N 1/10 ) space

21 Conclusions ‘Selection and Sorting with Limited Storage’ continues to be an influential paper, three decades later. Several related papers accepted to SODA 2009: – Comparison-Based, Time-Space Lower Bounds for Selection (Timothy M. Chan) – Sorting and Selection in Posets (Constantinos Daskalakis, Richard M. Karp, Elchanan Mossel, Samantha Riesenfeld and Elad Verbin)