Algorithms for data streams Foundations of Data Science 2014 Indian Institute of Science Navin Goyal
Introduction Data streams: Very large input data arriving sequentially, too large to fit in memory Examples: – networks (traffic passing through a router) – databases (transaction logs) – scientific data (satellites, sensors, LHC,…) – financial data What can we compute about the data in such situations? Today’s lecture: Start with an illustrative example problem, and then some generalities about the streaming model and problems
Example: Counting
Counting
Performance of Morris counter
Boosting the success probability I
Performance of Morris counter
Boosting the success probability II
Boosting success probability II
Test your understanding: Why don’t we just use the median all the time for boosting the probability of success instead of the mean?
Recap
Questions to ponder
Streaming data: models and problems
Models for streaming data
Restrictions on the algorithm
Some streaming problems: frequency moments
A general template for many streaming algorithms Come up with a basic random estimator for the quantity of interest (usually the non-trivial part) Give an efficient algorithm to compute the estimator (may need the use of hashing or some other way of reducing randomness requirements) Improve the probability of success by some trick such as the median of means estimator
Plan for next few lectures