Join Size The join size is the space required to join two relations.

Slides:



Advertisements
Similar presentations
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Advertisements

IRT Equating Kolen & Brennan, IRT If data used fit the assumptions of the IRT model and good parameter estimates are obtained, we can estimate person.
Sampling: Final and Initial Sample Size Determination
Signal Spaces.
Evaluating Hypotheses
Ch 7.2: Review of Matrices For theoretical and computation reasons, we review results of matrix theory in this section and the next. A matrix A is an m.
k r Factorial Designs with Replications r replications of 2 k Experiments –2 k r observations. –Allows estimation of experimental errors Model:
The moment generating function of random variable X is given by Moment generating function.
Continuous Random Variables and Probability Distributions
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Probability theory 2008 Outline of lecture 5 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different.
5  Systems of Linear Equations: ✦ An Introduction ✦ Unique Solutions ✦ Underdetermined and Overdetermined Systems  Matrices  Multiplication of Matrices.
Chapter 10 Review: Matrix Algebra
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Market Design and Analysis Lecture 5 Lecturer: Ning Chen ( 陈宁 )
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
4.2 Binomial Distributions
UNIT 6 VOCABULARY By: Marissa. A value of something that does not change. Example: A is always A. Constant Term.
Continuous Random Variables and Probability Distributions
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
CHAPTER 4 ESTIMATES OF MEAN AND ERRORS. 4.1 METHOD OF LEAST SQUARES I n Chapter 2 we defined the mean  of the parent distribution and noted that the.
Section 2.4. Section Summary  Sequences. o Examples: Geometric Progression, Arithmetic Progression  Recurrence Relations o Example: Fibonacci Sequence.
Sampling and Sampling Distributions
3.3 Dividing Polynomials.
MAT 322: LINEAR ALGEBRA.
Evaluating Hypotheses
Physics 114: Lecture 13 Probability Tests & Linear Fitting
6-4 Large-Sample Confidence Intervals for the Population Proportion, p
Linear Algebra Review.
7.1 Matrices, Vectors: Addition and Scalar Multiplication
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
DISCRETE RANDOM VARIABLES
STATISTICAL INFERENCE
Visual Recognition Tutorial
Some General Concepts of Point Estimation
5 Systems of Linear Equations and Matrices
The problem: weight some objects (3 in the example)
Vectors Lesson 2 recap Aims:
Introduction, class rules, error analysis Julia Velkovska
STATISTICS Random Variables and Distribution Functions
Estimating L2 Norm MIT Piotr Indyk.
CH 5: Multivariate Methods
Generalized Linear Models
Statistics in Applied Science and Technology
Sequences and Summations
Load Shedding Techniques for Data Stream Systems
CONCEPTS OF ESTIMATION
Expectation & Variance of a Discrete Random Variable
Discrete Event Simulation - 8
Sampling Distribution
Sampling Distribution
Discrete Event Simulation - 4
Lecture 2 – Monte Carlo method in finance
7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to record.
Chernoff bounds The Chernoff bound for a random variable X is
Summarizing Data by Statistics
Square Root.
Integration of sensory modalities
Chebychev, Hoffding, Chernoff
Parametric Methods Berlin Chen, 2005 References:
Chapter 9: One- and Two-Sample Estimation Problems:
Mathematical Foundations of BME Reza Shadmehr
Math review - scalars, vectors, and matrices
2.3. Measures of Dispersion (Variation):
Chapter 8 Estimation.
Chapter 7 Estimation: Single Population
In practice,the method of least squares is widely used
Applied Discrete Mathematics Week 4: Functions
Modeling and Simulation: Exploring Dynamic System Behaviour
Presentation transcript:

Join Size The join size is the space required to join two relations. Let t be the number of possible values the join parameter can take on. Let ai be the number of items in relation A with value i, and let bi be defined similarly for B. Then, the join size will be i ai bi . Note that for a self join, ai = bi , and thus this reduces to iai ai = iai 2

Motivation The standard method for estimating the join size involves using random samples as “signatures”. The two relations are sampled with probability p. The join size of the signatures is then computed, and scaled by p-2. In order to get a good estimate with high probability, we need cn2/B samples, where B is a lower bound on the join size and c > 3 is a constant.

Using Self Join Estimates Instead, we use a generalization of the “tug-of-war” self-join size estimators that we saw earlier. We simply compute the tug-of-war estimates for each of the relations, and then multiply them together to obtain the join size estimate. Recall that in the self-join algorithm, we squared the estimates, because we were joining a relation to itself.

Algorithm Let i be four-wise independent {-1,1}-valued random variables. Let S(A) = i i ai and S(B) = i i bi . Our estimator is then S(A) S(B). Note that we use the same i for both sums.

Expectation and Variance Expected value of the estimate: E(S(A)*S(B)) = the join size of A and B. Recall that if we view A as the vector of the ai’s, then the self-join size of the relation is ||A||2 (the magnitude squared). Variance of the estimate: Var(S(A)*S(B)) ≤ 2 * ||A||2 * ||B||2

Proof of Expected Value E(S(A)*S(B)) = E(i i 2ai bi + i≠j i j ai bj ) Note that if i≠j, then E(i j ) = 0. And if i=j, then E(i j ) = 1. Thus, we get E(S(A)*S(B)) = E(i ai bi ). And this is just i ai bi , which is the join size.

Proof of Variance Let X = S(A)S(B)-E(S(A)S(B)) = i≠ji j ai bj Then E(X2) = Var(S(A)S(B)) Note that when we square the above summation, any term which contains an x raised to an odd power will have an expectation of 0. Thus these terms will be eliminated. In the other terms, the product of the ’s will always be 1.

Proof of Variance II Every term will look like i j i’ j’ ai bj ai’ bj’ where i≠j and i’≠j’. In order for all the -powers to be even, either i=i’ and j=j’, or i=j’ and j=i’. Thus, we are left with: Var(S(A)S(B)) = i≠jai 2bj2 + i≠jai bi aj bj Note that i≠jai 2bj2 ≤ iai 2 jbj2 - we just drop the requirement that i≠j. And this is equivalent to ||A||2 ||B||2.

Proof of Variance III Also, i≠jai bi aj bj ≤ iai bi jaj bj = (i ai bi )2. And this equals <A,B>2 (the square of the inner product of A and B). Also, by the Cauchy -Schwartz inequality, <A,B>2 ≤ ||A||2 ||B||2 Thus Var(S(A)S(B)) ≤ 2 ||A||2 ||B||2

Repeated trials If we want constant relative error with high probability, we can repeat the experiment multiple times, and take the mean of the results. Specifically, we need to try it c*||A||2||B||2/b times, where c>2 is a constant determined by the desired accuracy and confidence, and b is a lower bound on join size.

Future Work Exploring other approaches. Three-way joins. Experimental results for “tug-of-war” join scheme.