OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

OLAP Over Uncertain and Imprecise Data T.S. Jayram (IBM Almaden) with Doug Burdick (Wisconsin), Prasad Deshpande (IBM), Raghu Ramakrishnan (Wisconsin),
OLAP over Uncertain and Imprecise Data
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Linear Regression.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Fast Algorithms For Hierarchical Range Histogram Constructions
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Testing Theories: Three Reasons Why Data Might not Match the Theory.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Integration of sensory modalities
OLAP Over Uncertain and Imprecise Data Adapted from a talk by T.S. Jayram (IBM Almaden) with Doug Burdick (Wisconsin), Prasad Deshpande (IBM), Raghu Ramakrishnan.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
Visual Recognition Tutorial
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Maximum likelihood (ML) and likelihood ratio (LR) test
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Point and Confidence Interval Estimation of a Population Proportion, p
Maximum likelihood (ML) and likelihood ratio (LR) test
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Today Today: More on the Normal Distribution (section 6.1), begin Chapter 8 (8.1 and 8.2) Assignment: 5-R11, 5-R16, 6-3, 6-5, 8-2, 8-8 Recommended Questions:
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Copyright © Cengage Learning. All rights reserved. 6 Point Estimation.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Maximum likelihood (ML)
Chapter 9 Numerical Integration Numerical Integration Application: Normal Distributions Copyright © The McGraw-Hill Companies, Inc. Permission required.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Review of Lecture Two Linear Regression Normal Equation
A brief maximum entropy tutorial. Overview Statistical modeling addresses the problem of modeling the behavior of a random process In constructing this.
Chapter 7: The Normal Probability Distribution
Chapter 3 Single-Table Queries
Theory testing Part of what differentiates science from non-science is the process of theory testing. When a theory has been articulated carefully, it.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Recitation on EM slides taken from:
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Computer simulation Sep. 9, QUIZ 2 Determine whether the following experiments have discrete or continuous out comes A fair die is tossed and the.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Copyright © Cengage Learning. All rights reserved.
12. Principles of Parameter Estimation
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Hidden Markov Models Part 2: Algorithms
More about Posterior Distributions
CS 188: Artificial Intelligence
Integration of sensory modalities
CUBE MATERIALIZATION E0 261 Jayant Haritsa
Probabilistic Databases
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Mathematical Foundations of BME Reza Shadmehr
12. Principles of Parameter Estimation
Further Topics on Random Variables: Derived Distributions
Further Topics on Random Variables: Derived Distributions
Further Topics on Random Variables: Derived Distributions
Presentation transcript:

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions

Attributes and Measures Attributes are columns with values from a fixed domain (foreign keys). Measures are numerical columns.

Imprecision and Uncertainity Imprecision in a tuple refers to an attribute instantiated by a set of values from the domain instead of a single value. Uncertainity refers to a measure represented by a pdf over the domain instead of a single value.

Hierarchical Domains : Star Schema Location Maharashtra Madhya Pradesh Mumbai PuneBhopal Indore

Restriction on Imprecision We restrict the sets of values in an imprecise fact to either: 1. A singleton set consisting of a leaf level member of the hierarchy, or, 2. The set of all the leaf level members under some non-leaf level member of the hierarchy.

Cells and Regions A region is a vector of attribute values from an imprecise domains of each dimension of the cube. A cell is a region in which all values are leaf level members. Let reg(R) represent the set of cells in a region R.

Queries on precise data A query Q = (R, M, A) refers to a region R, a measure M, and an aggregate function A. Eg : (, Repairs, Sum)‏ The result of the query in a precise database is obtained by applying A on the measure M of all cells in R. For the example above, the result is (P1 + P2)‏

Queries on imprecise data Consider the query region in the figure. It overlaps two imprecise facts P4 and P5. Here, we need to make a decision between 3 strategies : None : Ignore both P4 and P5 because of their imprecision Contains : Take P5 because it is contained inside the query Overlaps : Take P5, and somehow, P4 as well.

Contains option : Consistency Intuitively, consistency means that the answer to a query should be consistent with the aggregates from individual partitions of the query. Using the Contains option could give rise to inconsistent results. For example, consider the sum aggregate of the query above and that of its individual cells. With the Contains option, will the individual results add up to be the same as the collective?

None option Essentially, the none option ignores the imprecise facts, even if a fact is completely inside the region. Lays waste to the whole notion of having imprecise facts.

Overlaps option : Possible Worlds

Query semantics on Possible worlds With each possible world, assign a weight w i such that the sum of all weights is 1. Intuitively, the weight of a particular world is like probability that it is the correct underlying data. Given a query Q, we can calculate the result for each vi for each world. Thus, we can return a pdf over the answer Z as P[Z = o] = ∑ i : v_i = o wi A neat short answer could be the expected value of Z E[Z] =∑ i w i * v i Problem with this is : number of possible worlds is exponential in number of imprecise facts!

Solution : Extended data model With each cell c in a region r, we add a probability p r, c, called the allocation of r to c. The probability of a possible world becomes the multiple of allocations of ranges to cells that have been populated in the world. This leads to a (reasonable) restriction on the kind of probability distributions on possible worlds.

Advantages of EDM No extra infrastructure required for representing imprecision Efficient algorithms for aggregate queries : SUM and COUNT : linear time algo. AVERAGE : slightly complicated algorithm running in O(m + n 3 ) for m precise facts and n imprecise facts.

Allocation Policies For every region r in the database, we want to assign an allocation p c, r to each cell c in Reg(r), such that ∑ c Reg(r) p c, r = 1 Three ways of doing so: 1. Uniform : Assign each cell c in a region r an equal probability. p c, r = 1 / |Reg(r)|

Allocation Policies For every region r in the database, we want to assign an allocation p c, r to each cell c in Reg(r), such that ∑ c Reg(r) p c, r = 1 However, we can do better. Some cells may be naturally inclined to have more probability than others. Eg : Mumbai will clearly have more repairs than Bhopal. We can do this automatically by giving more probability to cells with higher number of precise facts. 2. Count based : where N c is the number of precise facts in cell c

Allocation Policies For every region r in the database, we want to assign an allocation p c, r to each cell c in Reg(r), such that ∑ c Reg(r) p c, r = 1 Again, we can arguably get a better result by looking at not just the count, but rather than the actual value of the measure in question. 3. Measure based : next slide.

Measure Based Allocation Assumes the following model : The given database D with imprecise facts has been generated by randomly injecting imprecision in a precise database D'. D' assigns value o to a cell c according to some unknown pdf P(o, c). If we could determine this pdf, the allocation is simply p c, r = P(c) / ∑ c' in Reg(r) P(c')‏

Maximum Likelihood Principle A reasonable estimate for this function P can be that which maximises the probability of generating the given imprecise data set D. Example : Suppose the pdf depends only on the cells and is independent of the measure values. Thus, the pdf is a mapping : C ℝ where C is the set of cells. This pdf can be found by maximising the likelihood function : ℒ () = r D ∑ c Reg(r) (c)‏

EM Algorithm The Expectation Maximization algorithm provides a standard way of maximizing the likelihood, when we have some unknown variables in the observation set. Expectation step (compute data): Calculate the expected value of the unknown variables, given the current estimate of variables. Maximization step (compute generator): Calculate the distribution that maximizes the probability of the current estimated data set.

Initialization Step: Data: [4, 10, ?, ?] Initial mean value: 0 New Data: [4, 10, 0, 0] Step 1: New Mean: 3.5 New Data:[4, 10, 3.5, 3.5] Step 2: New Mean: 5.25 New Data: [4, 10, 5.25, 5.25] Step 3: New Mean: New Data: [4, 10, 6.125, 6.125] Result: New Mean: EM Algorithm : Example Step 4: New Mean: New Data: [4, 10, , ] Step 5: New Mean: New Data: [4, 10, , ]

EM Algorithm : Application

Experiments : Allocation run time

Experiments : Query run time

Experiments : Accuracy

Summary Model for ambiguity : Imprecision, Uncertainity Querying on uncertain data : None v/s Contains v/s Overlaps option Consistency, Faithfulness Possible Worlds interpretation : size blowup Extended databases : allocation Aggregation algorithms on Extended databases Allocation policies : Uniform Count Measure : EM algorithm Experiments : Allocation time, query time, accuracy

References : OLAP over uncertain and imprecise data (Doug Burdick et al.) - The VLDB Journal (2007) 16:123–144 OLAP over uncertain and imprecise data(Doug Burdick et al.) - - The VLDB Journal (2005)‏ maximization_algorithm maximization_algorithm