Download presentation
Presentation is loading. Please wait.
Published byThomas Baldwin Modified over 11 years ago
1
An Optimal Algorithm for the Distinct Elements Problem
Daniel Kane, Jelani Nelson, David Woodruff PODS, 2010
2
Problem Description Given a long stream of values from a universe of size n each value can occur any number of times count the number F0 of distinct values See values one at a time One pass over the stream Too expensive to store set of distinct values Algorithms should: Use a small amount of memory Have fast update time (per value processing time) Have fast recovery time (time to report the answer)
3
Randomized Approximation Algorithms
3, 141, 59, 265, 3, 58, 9, 7, 9, 32, 3, , 338, 32, 4, … Consider algorithms that store a subset S of distinct values E.g., S = {3, 9, 32, 265} Main drawback is that S needs to be large to know if next value is a new distinct value Any algorithm (whether it stores a subset of values or not) that is either deterministic or computes F0 exactly must use ¼ F0 memory Hence, algorithms must be randomized and settle for an approximate solution: output F 2 [(1-ε)F0, (1+ε)F0] with good probability
4
Problem History Long sequence of work on the problem
Flajolet and Martin introduced problem, FOCS 1983 Alon, Bar-Yossef, Beyer, Brody, Chakrabarti, Durand, Estan, Flajolet, Fisk, Fusy, Gandouet, Gemulla, Gibbons, P. Haas, Indyk, Jayram, Kumar, Martin, Matias, Meunier, Reinwald, Sismanis, Sivakumar, Szegedy, Tirthapura, Trevisan, Varghese, W Previous best algorithm: O(ε-2 log log n + log n) bits of memory and O(ε-2) update and reporting time Known lower bound on the memory: (ε-2 + log n) Our result: Optimal O(ε-2 + log n) bits of memory and O(1) update and reporting time
5
Previous Approaches Suppose we randomly hash F0 values into a hash table of 1/ε2 buckets and keep track of the number C of non-empty buckets If F0 < 1/ε2, there is a way to estimate F0 up to (1 ± ε) from C Problem: if F0 À 1/ε2, with high probability, every bucket contains a value, so there is no information Solution: randomly choose Slog n µ Slog n - 1 µ Slog n - 2 µ S1 µ {1, 2, …, n}, where |Si| ¼ n/2i Problem: It takes 1/ε2 log n bits of memory to keep track of this information stream: 3, 141, 59, 265, 3, 58, 9, 7, 9, 32, 3, , 338, 32, 4, … Si = {1, 3, 7, 9, 265} i-th substream: 3, 265, 3, 9, 7, 9, 3, … Run hashing procedure on each substream There is an i for which the # of distinct values in i-th substream ¼ 1/ε2 Hashing procedure on i-th substream works
6
Our Techniques Observation: - Have 1/ε2 global buckets
- In each bucket we keep track of the index i of the set Si for the largest i for which Si contains a value hashed to the bucket - This gives O(1/ε2 log log n) bits of memory New Ideas: - Can show with high probability, at every point in the stream, most buckets contain roughly the same index - We can just keep track of the offsets from this common index - We pack the offsets into machine words and use known fast read/write algorithms to variable length arrays to efficiently update offsets - Occasionally we need to decrement all offsets. Can spread the work across multiple updates
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.