Join Size The join size is the space required to join two relations.

Slides:

Advertisements

Similar presentations

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Advertisements

IRT Equating Kolen & Brennan, IRT If data used fit the assumptions of the IRT model and good parameter estimates are obtained, we can estimate person.

Sampling: Final and Initial Sample Size Determination

Evaluating Hypotheses

Ch 7.2: Review of Matrices For theoretical and computation reasons, we review results of matrix theory in this section and the next. A matrix A is an m.

k r Factorial Designs with Replications r replications of 2 k Experiments –2 k r observations. –Allows estimation of experimental errors Model:

The moment generating function of random variable X is given by Moment generating function.

Continuous Random Variables and Probability Distributions

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Probability theory 2008 Outline of lecture 5 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different.

5  Systems of Linear Equations: ✦ An Introduction ✦ Unique Solutions ✦ Underdetermined and Overdetermined Systems  Matrices  Multiplication of Matrices.

Chapter 10 Review: Matrix Algebra

Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.

Market Design and Analysis Lecture 5 Lecturer: Ning Chen ( 陈宁 )

Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

4.2 Binomial Distributions

UNIT 6 VOCABULARY By: Marissa. A value of something that does not change. Example: A is always A. Constant Term.

Continuous Random Variables and Probability Distributions

Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.

CHAPTER 4 ESTIMATES OF MEAN AND ERRORS. 4.1 METHOD OF LEAST SQUARES I n Chapter 2 we defined the mean  of the parent distribution and noted that the.

Section 2.4. Section Summary  Sequences. o Examples: Geometric Progression, Arithmetic Progression  Recurrence Relations o Example: Fibonacci Sequence.

Sampling and Sampling Distributions

3.3 Dividing Polynomials.

MAT 322: LINEAR ALGEBRA.

Evaluating Hypotheses

Physics 114: Lecture 13 Probability Tests & Linear Fitting

6-4 Large-Sample Confidence Intervals for the Population Proportion, p

Linear Algebra Review.

7.1 Matrices, Vectors: Addition and Scalar Multiplication

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

DISCRETE RANDOM VARIABLES

STATISTICAL INFERENCE

Visual Recognition Tutorial

Some General Concepts of Point Estimation

5 Systems of Linear Equations and Matrices

The problem: weight some objects (3 in the example)

Vectors Lesson 2 recap Aims:

Introduction, class rules, error analysis Julia Velkovska

STATISTICS Random Variables and Distribution Functions

Estimating L2 Norm MIT Piotr Indyk.

CH 5: Multivariate Methods

Generalized Linear Models

Statistics in Applied Science and Technology

Sequences and Summations

Load Shedding Techniques for Data Stream Systems

CONCEPTS OF ESTIMATION

Expectation & Variance of a Discrete Random Variable

Discrete Event Simulation - 8

Sampling Distribution

Sampling Distribution

Discrete Event Simulation - 4

Lecture 2 – Monte Carlo method in finance

7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to record.

Chernoff bounds The Chernoff bound for a random variable X is

Summarizing Data by Statistics

Integration of sensory modalities

Chebychev, Hoffding, Chernoff

Parametric Methods Berlin Chen, 2005 References:

Chapter 9: One- and Two-Sample Estimation Problems:

Mathematical Foundations of BME Reza Shadmehr

Math review - scalars, vectors, and matrices

2.3. Measures of Dispersion (Variation):

Chapter 8 Estimation.

Chapter 7 Estimation: Single Population

In practice,the method of least squares is widely used

Applied Discrete Mathematics Week 4: Functions

Modeling and Simulation: Exploring Dynamic System Behaviour

Presentation transcript:

Join Size The join size is the space required to join two relations. Let t be the number of possible values the join parameter can take on. Let ai be the number of items in relation A with value i, and let bi be defined similarly for B. Then, the join size will be i ai bi . Note that for a self join, ai = bi , and thus this reduces to iai ai = iai 2

Motivation The standard method for estimating the join size involves using random samples as “signatures”. The two relations are sampled with probability p. The join size of the signatures is then computed, and scaled by p-2. In order to get a good estimate with high probability, we need cn2/B samples, where B is a lower bound on the join size and c > 3 is a constant.

Using Self Join Estimates Instead, we use a generalization of the “tug-of-war” self-join size estimators that we saw earlier. We simply compute the tug-of-war estimates for each of the relations, and then multiply them together to obtain the join size estimate. Recall that in the self-join algorithm, we squared the estimates, because we were joining a relation to itself.

Algorithm Let i be four-wise independent {-1,1}-valued random variables. Let S(A) = i i ai and S(B) = i i bi . Our estimator is then S(A) S(B). Note that we use the same i for both sums.

Expectation and Variance Expected value of the estimate: E(S(A)*S(B)) = the join size of A and B. Recall that if we view A as the vector of the ai’s, then the self-join size of the relation is ||A||2 (the magnitude squared). Variance of the estimate: Var(S(A)*S(B)) ≤ 2 * ||A||2 * ||B||2

Proof of Expected Value E(S(A)*S(B)) = E(i i 2ai bi + i≠j i j ai bj ) Note that if i≠j, then E(i j ) = 0. And if i=j, then E(i j ) = 1. Thus, we get E(S(A)*S(B)) = E(i ai bi ). And this is just i ai bi , which is the join size.

Proof of Variance Let X = S(A)S(B)-E(S(A)S(B)) = i≠ji j ai bj Then E(X2) = Var(S(A)S(B)) Note that when we square the above summation, any term which contains an x raised to an odd power will have an expectation of 0. Thus these terms will be eliminated. In the other terms, the product of the ’s will always be 1.

Proof of Variance II Every term will look like i j i’ j’ ai bj ai’ bj’ where i≠j and i’≠j’. In order for all the -powers to be even, either i=i’ and j=j’, or i=j’ and j=i’. Thus, we are left with: Var(S(A)S(B)) = i≠jai 2bj2 + i≠jai bi aj bj Note that i≠jai 2bj2 ≤ iai 2 jbj2 - we just drop the requirement that i≠j. And this is equivalent to ||A||2 ||B||2.

Proof of Variance III Also, i≠jai bi aj bj ≤ iai bi jaj bj = (i ai bi )2. And this equals <A,B>2 (the square of the inner product of A and B). Also, by the Cauchy -Schwartz inequality, <A,B>2 ≤ ||A||2 ||B||2 Thus Var(S(A)S(B)) ≤ 2 ||A||2 ||B||2

Repeated trials If we want constant relative error with high probability, we can repeat the experiment multiple times, and take the mean of the results. Specifically, we need to try it c*||A||2||B||2/b times, where c>2 is a constant determined by the desired accuracy and confidence, and b is a lower bound on join size.

Future Work Exploring other approaches. Three-way joins. Experimental results for “tug-of-war” join scheme.