On Random Sampling over Joins Surajit Chaudhuri Rajeeve Motwani Vivek Narasayya Microsoft Research Stanford University Microsoft Research.

Slides:



Advertisements
Similar presentations
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
Advertisements

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Two-Pass Algorithms Based on Sorting
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes.
Convex Hulls in Two Dimensions Definitions Basic algorithms Gift Wrapping (algorithm of Jarvis ) Graham scan Divide and conquer Convex Hull for line intersections.
Lec 18 Nov 12 Probability – definitions and simulation.
Randomized Algorithms Randomized Algorithms CS648 Lecture 6 Reviewing the last 3 lectures Application of Fingerprinting Techniques 1-dimensional Pattern.
An Efficient Cost-Driven Selection Tool for Microsoft SQL Server Surajit ChaudhuriVivek Narasayya Indian Institute of Technology Bombay CS632 Course seminar.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
Advanced Topics in Algorithms and Data Structures 1 Rooting a tree For doing any tree computation, we need to know the parent p ( v ) for each node v.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph, any set of nodes that are not adjacent.
Advanced Topics in Algorithms and Data Structures Lecture 6.1 – pg 1 An overview of lecture 6 A parallel search algorithm A parallel merging algorithm.
1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.
Random-Variate Generation. Need for Random-Variates We, usually, model uncertainty and unpredictability with statistical distributions Thereby, in order.
Coloring Away Communication in Parallel Query Optimization Waqar Hasan, Rajeev Motwani Stanford University Παυλάτος Χρήστος
Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.
(hyperlink-induced topic search)
Evaluating Performance for Data Mining Techniques
ETM 607 – Random Number and Random Variates
Claims about a Population Mean when σ is Known Objective: test a claim.
Randomized Algorithms (Probabilistic algorithm) Flip a coin, when you do not know how to make a decision!
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.
Lecture 19 Nov10, 2010 Discrete event simulation (Ross) discrete and continuous distributions computationally generating random variable following various.
1 Memory-Limited Execution of Windowed Stream Joins Utkarsh Srivastava, Jennifer Widom Stanford University VLDB’04.
CSC 211 Data Structures Lecture 13
Sampling in Space Restricted Settings Anup Bhattacharya IIT Delhi Joint work with Davis Issac (MPI), Ragesh Jaiswal (IITD) and Amit Kumar (IITD)
S-012 Testing statistical hypotheses The CI approach The NHST approach.
Fundamentals of Algorithms MCS - 2 Lecture # 15. Bubble Sort.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Chapter 11 Understanding Randomness. What is Randomness? Some things that are random: Rolling dice Shuffling cards Lotteries Bingo Flipping a coin.
Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
Selecting Input Probability Distribution. Simulation Machine Simulation can be considered as an Engine with input and output as follows: Simulation Engine.
A Computational Study of Three Demon Algorithm Variants for Solving the TSP Bala Chandran, University of Maryland Bruce Golden, University of Maryland.
CS6321 Query Optimization Over Web Services Utkarsh Kamesh Jennifer Rajeev Shrivastava Munagala Wisdom Motwani Presented By Ajay Kumar Sarda.
Review 1 Selection Sort Selection Sort Algorithm Time Complexity Best case Average case Worst case Examples.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Section 7.4 Use of Counting Techniques in Probability.
Presented By Anirban Maiti Chandrashekar Vijayarenu
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru.
ICS 353: Design and Analysis of Algorithms
Continuous Random Variables Lecture 26 Section Mon, Mar 5, 2007.
Maximum Entropy … the fact that a certain prob distribution maximizes entropy subject to certain constraints representing our incomplete information, is.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
1 Ch. 2: Getting Started. 2 About this lecture Study a few simple algorithms for sorting – Insertion Sort – Selection Sort (Exercise) – Merge Sort Show.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CSC317 1 Quicksort on average run time We’ll prove that average run time with random pivots for any input array is O(n log n) Randomness is in choosing.
Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
SURAJIT CHAUDHURI RAJEEV MOTWANI VIVEK NARASAYYA On random sampling over Joins Presented by : Srikantha Nema.
An Efficient, Cost-Driven Index Selection Tool for MS-SQL Server
Overcoming Limitations of Sampling for Aggregation Queries
Random Testing.
The Nature of Probability and Statistics
Load Shedding Techniques for Data Stream Systems
Random Sampling over Joins Revisited
Randomized Algorithms CS648
network of simple neuron-like computing elements
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Presentation transcript:

On Random Sampling over Joins Surajit Chaudhuri Rajeeve Motwani Vivek Narasayya Microsoft Research Stanford University Microsoft Research

Subtitles: The difficulty of join sampling - Example. Semantic and algorithms of sample Two previous sampling strategies New strategies for join sampling Experiment’s results

The Difficulty of Join Sampling - Example: Suppose that we have the relations

Black-Box U2: Given relation R with n tuples, generate an unweighted WR sample of size r Initialize reservoir array A[1..r] with r dummy values. 3. While tuples are streaming by do begin (a) get next tuple t; (b) (c) for j=1 to r set A[j] to t with probability 1/N end

Black-Box WR2 : Given relation R with n tuples, generate a weighted WR sample of size r Initialize reservoir array A[1…r] with r dummy values. 3. While tuples are streaming by do begin (a) get next tuple t with weight w(t); (b) (c) for j=1 to r do set A[j] to t with prob. w(t)/W end.

The Classification of the Problem : Case A : No information is available for either or. Case B : No information is available for but indexes and /or statistics are available for. Case C : Indexes/statistics are available for and.

Previous Sampling Strategies Strategy Naive-Sample: 1. Compute the join. 2. As the tuples of J stream by, use Black-Box U1 or U2 to produce.

Previous Sampling Strategies Strategy Olken-Sample: 1. Let M be an upper bound on for all. 2.repeat (a) Sample a tuple uniformly at random. (b) Sample a random tuple from among all tuples that have. (c) Output with probability, and with remaining probability reject the sample. Until r tuples have been produced.

New Strategies for Join Sampling Strategy Stream Sample is more efficiency then Olken : 1. No information is required for - case B. 2. No tuple is rejected after computing the join. 3. Only one iteration is needed for each output tuple.

New Strategies for Join Sampling Strategy Stream Sample: 1. Use Black-Box WR1 or WR2 to produce a WR sample of size r, where the weight for a tuple is set to 2. While tuples of are streaming by do begin (a) get next tuple and let ; (b) sample a random tuple from among all tuples that have ; (c) output. end.

New Strategies for Join Sampling Strategy Group Sample 1. Use Black-Box WR1 or WR2 to produce a WR sample of size r, where the weight for a tuple is set to. 2. Let consist of the tuples. Produce whose tuples are grouped by ‘s tuples that generated them. 3. Use r invocations of Black-Box U1 or U2 to sample r sample, one of each group.

New Strategy for Join Sampling Strategy Frequency-Partition-Sample

Experimental Results:

Summery The difficulty of join sampling- example. The classification of the problem - 3 cases. Naive-sample Olken-sample previous strategies Stream-sample Group-sample new strategies Frequency-partition-sample Conclusion : The new strategies are better then the earlier techniques.