Sampling in Space Restricted Settings Anup Bhattacharya IIT Delhi Joint work with Davis Issac (MPI), Ragesh Jaiswal (IITD) and Amit Kumar (IITD)

Slides:



Advertisements
Similar presentations
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Advertisements

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Xiaoming Sun Tsinghua University David Woodruff MIT
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
Resource-oriented Approximation for Frequent Itemset Mining from Bursty Data Streams SIGMOD’14 Toshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
PAC Learning adapted from Tom M.Mitchell Carnegie Mellon University.
Probabilistic Similarity Search for Uncertain Time Series Presented by CAO Chen 21 st Feb, 2011.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2002 Review Lecture Tuesday, 12/10/02.
Stream databases Strumieniowe Bazy Danych Przemysław Pawluk Supervisors: prof. Zygmunt Mazur (Wroclaw University of Technology) prof. Lars Lundberg (Blekinge.
Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
1 Streaming Computation of Combinatorial Objects Ziv Bar-Yossef U.C. Berkeley Omer Reingold AT&T Labs – Research Ronen.
An Introduction to Algorithmic Information Theory Nuri Taşdemir
Chapter 5. Operations on Multiple R. V.'s 1 Chapter 5. Operations on Multiple Random Variables 0. Introduction 1. Expected Value of a Function of Random.
Adaptive Sampling  Based on a hot-list algorithm by Gibbons and Matias (SIGMOD 1998)  Sample elements from the input set Frequently occurring elements.
A General Approach to Online Network Optimization Problems Seffi Naor Computer Science Dept. Technion Haifa, Israel Joint work: Noga Alon, Yossi Azar,
1 Krakow, Jan. 9, 2008 Approximation via Doubling Marek Chrobak University of California, Riverside Joint work with Claire Kenyon-Mathieu.
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.
Introduction to Monte Carlo Methods D.J.C. Mackay.
Load Balancing Tasks with Overlapping Requirements Milan Vojnovic Microsoft Research Joint work with Dan Alistarh, Christos Gkantsidis, Jennifer Iglesias,
Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden)
WEMAREC: Accurate and Scalable Recommendation through Weighted and Ensemble Matrix Approximation Chao Chen ⨳ , Dongsheng Li
Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.
Testing Collections of Properties Reut Levi Dana Ron Ronitt Rubinfeld ICS 2011.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Dynamic Covering for Recommendation Systems Ioannis Antonellis Anish Das Sarma Shaddin Dughmi.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Information Theory for Data Streams David P. Woodruff IBM Almaden.
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.
Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8.
Genetic Algorithms Genetic algorithms provide an approach to learning that is based loosely on simulated evolution. Hypotheses are often described by bit.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Data Stream Algorithms Lower Bounds Graham Cormode
Calculating frequency moments of Data Stream
Lecture VI Statistics. Lecture questions Mathematical statistics Sampling Statistical population and sample Descriptive statistics.
1 Approximation Algorithms for Generalized Scheduling Problems Ravishankar Krishnaswamy Carnegie Mellon University joint work with Nikhil Bansal, Anupam.
11 Lecture 17: Sublinear-time algorithms COMS E F15.
Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos.
Clustering Data Streams A presentation by George Toderici.
Lecture 1 (Part 1) Introduction/Overview Tuesday, 9/9/08
Clustering Data Streams
Stochastic Streams: Sample Complexity vs. Space Complexity
On Testing Dynamic Environments
A paper on Join Synopses for Approximate Query Answering
Core-Sets and Geometric Optimization problems.
Information Management course
ICICLES: Self-tuning Samples for Approximate Query Answering
StreamApprox Approximate Stream Analytics in Apache Flink
StreamApprox Approximate Stream Analytics in Apache Spark
StreamApprox Approximate Computing for Stream Analytics
Random Sampling over Joins Revisited
Range-Efficient Counting of Distinct Elements
Computational Learning Theory
Numerical Algorithms Quiz questions
Chapter 7 Lexical Analysis and Stoplists
Computational Learning Theory
Pramod Bhatotia, Ruichuan Chen, Myungjin Lee
Presentation transcript:

Sampling in Space Restricted Settings Anup Bhattacharya IIT Delhi Joint work with Davis Issac (MPI), Ragesh Jaiswal (IITD) and Amit Kumar (IITD)

Introduction: Sampling Select a subset of data Computations on “representative” subset would approximate computations on whole data Sampling variants: –Uniform sampling –Weighted sampling Study sampling algorithms with limited space

Outline

Sampling in Streaming Settings

Streaming Settings: The Model – Items/objects arrive in online fashion – #Total items not known in advance – Typically poly(log(n)) space allowed – One/multi-pass, space usage, time/item, overall time complexity, randomness, accuracy of output

Sampling in Streaming Settings

Reservoir Sampling … Throw it away Store

Reservoir Sampling

Uniform Sampling with ϵ-error

Lower Bound on Sampling with ϵ-error

Outline

Algorithm for Uniform Sampling ϵ-error

Doubling-Chopping Algorithm

Doubling-Chopping algorithm, ϵ=1/16

0 1

Doubling-Chopping algorithm, ϵ=1/

Doubling-Chopping algorithm, ϵ=1/

Doubling-Chopping algorithm, ϵ=1/

Doubling-Chopping algorithm, ϵ=1/16 Chop(): Move strings from blocks to new block

Doubling-Chopping algorithm, ϵ=1/16 Chop(): Move strings from blocks to new block

Doubling-Chopping algorithm, ϵ=1/16 Chop(): Move strings from blocks to new block

Doubling-Chopping algorithm, ϵ=1/

Doubling-Chopping algorithm, ϵ=1/

Doubling-Chopping algorithm, ϵ=1/

Algorithm Analysis

Analysis contd..

Sampling in Query Model

Space Restricted Setting: Query Model

Sampling in Query Model

Thank You Questions?