OLAP over Uncertain and Imprecise Data

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

OLAP Over Uncertain and Imprecise Data T.S. Jayram (IBM Almaden) with Doug Burdick (Wisconsin), Prasad Deshpande (IBM), Raghu Ramakrishnan (Wisconsin),
Harrow School of Computer Science Data & Knowledge Management Group University of Westminster Watford Road, Northwick Park, HA1 3TP London, UK University.
Graphical Technique of Inference
Nguyen Ngoc Tuan – Le Nguyen Duy Vu /24/
Chapter 3 : Relational Model
Database Management Systems, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
Fast Algorithms For Hierarchical Range Histogram Constructions
CS 599 – Spatial and Temporal Databases Realm based Spatial data types: The Rose Algebra Ralf Hartmut Guting Markus Schneider.
OLAP Over Uncertain and Imprecise Data Adapted from a talk by T.S. Jayram (IBM Almaden) with Doug Burdick (Wisconsin), Prasad Deshpande (IBM), Raghu Ramakrishnan.
Dimensional Modeling Business Intelligence Solutions.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3 The Basic (Flat) Relational Model.
1 Relational Model. 2 Relational Database: Definitions  Relational database: a set of relations  Relation: made up of 2 parts: – Instance : a table,
Multiscale Visualization Using Data Cubes Chris Stolte, Diane Tang, Pat Hanrahan Stanford University Information Visualization October 2002 Boston, MA.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
Lab3 CPIT 440 Data Mining and Warehouse.
By N.Gopinath AP/CSE. Two common multi-dimensional schemas are 1. Star schema: Consists of a fact table with a single table for each dimension 2. Snowflake.
Chapter 13 The Data Warehouse
Data Mining – Intro.
Online Analytical Processing (OLAP) Hweichao Lu CS157B-02 Spring 2007.
INTRODUCTION TO DATABASE USING MS ACCESS 2013 PART 2 NOVEMBER 4, 2014.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Understanding Data Analytics and Data Mining Introduction.
On-Line Analytic Processing Chetan Meshram Class Id:221.
OLAP Theory-English version On-Line Analytical processing (Buisness Intzlligence) [Ing.Skorkovský,CSc] KPH_ESF_MU.
1/27 Ensemble Visualization for Cyber Situation Awareness of Network Security Data Lihua Hao 1, Christopher G. Healey 1, Steve E. Hutchinson 2 1 North.
25th VLDB, Edinburgh, Scotland, September 7-10, 1999 Extending Practical Pre-Aggregation for On-Line Analytical Processing T. B. Pedersen 1,2, C. S. Jensen.
Organizing Data and Information AD660 – Databases, Security, and Web Technologies Marcus Goncalves Spring 2013.
OnLine Analytical Processing (OLAP)
Instructor: Churee Techawut Basic Concepts of Relational Database Chapter 5 CS (204)321 Database System I.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
1 Data Warehouses BUAD/American University Data Warehouses.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 Supporting Imprecision in Multidimensional Databases Using Granularities T. B. Pedersen 1,2, C. S. Jensen.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
Modeling Issues for Data Warehouses CMPT 455/826 - Week 7, Day 1 (based on Trujollo) Sept-Dec 2009 – w7d11.
Fox MIS Spring 2011 Data Warehouse Week 8 Introduction of Data Warehouse Multidimensional Analysis: OLAP.
Aggregate Queries in Peer-to-Peer OLAP Mauricio Minuto Espil Faculty of Engineering Universidad Católica Argentina Alejandro A. Vaisman Computer Science.
UNIT-II Principles of dimensional modeling
VisDB: Database Exploration Using Multidimensional Visualization Maithili Narasimha 4/24/2001.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
CSE314 Database Systems Lecture 3 The Relational Data Model and Relational Database Constraints Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
Overview and Types of Data
Lection №4 Development of the Relational Databases.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
13 1 Chapter 13 The Data Warehouse Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Data Warehouse [ Example ] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001, ISBN Data Mining: Concepts and.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
SQL Server Analysis Services Understanding Unified Dimension Model (UDM)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
1 Copyright © 2006, Oracle. All rights reserved. Defining OLAP Concepts.
1 CS122A: Introduction to Data Management Lecture #4 (E-R  Relational Translation) Instructor: Chen Li.
“It is impossible to define every concept.” For example a “set” can not be defined. But Here are a list of things we shall simply assume about sets. A.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
OLAP Theory-English version On-Line Analytical processing (Buisness Intelligence) Ing.Skorkovský,CSc Department of Corporate Economy Faculty of Economics.
COP Introduction to Database Structures
Conceptual Design & ERD Modelling
Chapter 13 The Data Warehouse
Data storage is growing Future Prediction through historical data
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Associative Query Answering via Query Feature Similarity
Implementing Data Models & Reports with Microsoft SQL Server
Data Mining Concept Description
Chapter 13 The Data Warehouse
Presentation transcript:

OLAP over Uncertain and Imprecise Data Doug Burdick, Prasad Deshpande, T. S. Jayram, Raghu Ramakrishnan, Shivakumar Vaithyanathan Presented by Raghav Sagar

OLAP Overview Online Analytical Processing (OLAP) Interactive analysis of data, allowing data to be summarized and viewed in different ways in an online fashion Databases configured for OLAP use a multidimensional data model: Measures Numerical facts which can be measured, aggregated upon Dimensions Measures are categorized by dimensions (each dimension defines a property of the measure)

OLAP Data Hypercube (No. of Dimensions = 3)

Motivation Generalization of the OLAP model to addresses imprecise dimension values and uncertain measure values Answer aggregation queries over ambiguous data

Definitions Uncertain Domains Imprecise Domains Hierarchical Domains An uncertain domain U over base domain O is the set of all possible probability distribution functions over O Imprecise Domains An imprecise domain I over a base domain B is a subset of the power set of B with ∅ ∉ I. (elements of I are called imprecise values) Hierarchical Domains A hierarchical domain H over base domain B is defined to be an imprecise domain over B such that H contains every singleton set. For any pair of elements h1, h2 ∈ H, h1 ⊇ h2 or h1 ∩ h2 = ∅.

Hierarchy Domains

Definitions Fact Table Schemas Cells Region A fact table schema is <A1, A2, .. , Ak; M1, .. , Mn> where Ai are dimension attributes, i ∈ {1, .. k} Mj are measure attributes, j ∈ {1, .. n} Cells A vector <c1, c2, .. , ck> is called a cell if every ci is an element of the base domain of Ai , i ∈ {1, .. k} Region Region of a dimension vector <a1, a2, .. , ak> is the set of cells reg(r) denotes the region associated with a fact r

Example of a Fact Table

Definitions Queries Query Results A query Q over a database D with schema <A1, A2, .. , Ak; M1, .. , Mn> has the form Q(a1, .. , ak; Mi, A), where: a1, .. , ak describes the k-dimensional region being queried Mi describes the measure of interest A is an aggregation function Query Results The result of Q is obtained by applying aggregation function A to a set of 'relevant' facts in D

OLAP Data Hypercube (No. of Dimensions = 2)

Finding Relevant Facts All precise facts within the query region are naturally included Regarding imprecise facts, we have 3 options: None Ignore all imprecise facts Contains Include only those contained in the query region Overlaps Include all imprecise facts whose region overlaps

Aggregating Uncertain Measures Aggregating PDFs is closely related to opinion pooling (provide a consensus opinion from a set of opinions) LinOp(θ) provides a consensus PDF 𝑃 which is a weighted linear combination of the pdfs in θ 𝑃 𝑥 = 𝑃∈𝜃 𝑤 𝑝 ∗𝑃(𝑥)

Consistency α-consistency A query Q is partitioned into Q1, .. Qp s.t. reg(Q) = ∪i reg(Qi) reg(Qi) ∩ reg(Qj ) = ∅ for every i ≠ j Satisfied w.r.t to A if predicate α(q, q1, .. qp) holds for every database D and for every such collection of queries Q, Q1, .. Qp

Consistency Sum-consistency Boundedness-consistency Consequences 𝑞= 𝑖 𝑞 𝑖 Notion of consistency for SUM and COUNT Boundedness-consistency min 𝑖 𝑞 𝑖 ≤𝑞≤ max 𝑖 𝑞 𝑖 Notion of consistency for AVERAGE Consequences Contains option is unsuitable for handling imprecision, as it violates Sum-consistency

Faithfulness Measure Similar Databases (D and D’) D’ is obtained from Database D by modifying (only) the dimension attribute values Identically Precise Databases (D and D’) For a query Q, ∀ facts r ∈ D and r’ ∈ D’, either: Both reg(r) and reg(r’) are contained in reg(Q) Both reg(r) and reg(r’) are disjoint from reg(Q) Basic faithfulness Identical answers for every pair of measure-similar databases D and D’ that are identically precise with respect to Q

Faithfulness Consequences Partial Order ≼ 𝑄 None option is unsuitable for handling imprecision, as it violates Basic faithfulness for Sum and Average Partial Order ≼ 𝑄 IQ(D, D’) is a predicate which holds when D and D’ are identical, except for a single pair of facts r ∈ D and r’ ∈ D’ reg(r’) = reg(r) ∪ c c ∉ reg(Q) ∪ reg(r). Partial order ≼ 𝑄 is reflexive, transitive closure of IQ

Faithfulness β-faithfulness Sum-faithfulness Satisfied w.r.t to aggregate A if predicate β(q1, .. qp) holds for a set of databases and query Q, with: D1 ≼ 𝑄 D2 ≼ 𝑄 .. Dp Sum-faithfulness If Di ≼ 𝑄 Dj, then 𝑞 𝐷 𝑖 ≥ 𝑞 𝐷 𝑗

Possible Worlds Possible Worlds of an imprecise Database D, is a set of true databases {D1, D2, .. Dp} derived by D

Extended Data Model Allocation For a fact r in database D, cell c ∈ reg(r) Probability that r is completed to c = 𝑝 𝑐,𝑟 𝑐∈𝑟𝑒𝑔(𝑟) 𝑝 𝑐,𝑟 =1 If there are k imprecise facts in D, (r1, .. rk) Weight of possible world D’, w= 𝑖=1 𝑘 𝑝 𝑐 𝑖 , 𝑟 𝑖 For all possible worlds {D1, .. Dm}, 𝑖=1 𝑚 𝑤 𝑖 =1 Procedure for assigning 𝑝 𝑐,𝑟 is referred to as an allocation policy Allocated Database D* contains another table with schema : <Id(r), r, c, 𝑝 𝑐,𝑟 >

𝑝 𝑐 1,3 ,𝑝9 =0.3 𝑝 𝑐 2,3 ,𝑝9 =0.7 𝑝 𝑐 3,3 ,𝑝10 =0.4 𝑝 𝑐 3,4 ,𝑝10 =0.6 𝑤 1 =0.3∗0.4 𝑤 2 =0.3∗0.6 𝑤 3 =0.7∗0.4 𝑤 4 =0.7∗0.6

Summarizing Possible Worlds Consider possible worlds (D1, .. Dm) with weights (w1, .. wm) Query Q’s answer is a multiset (v1, .. vm), then we have answer variable Z P Z= 𝑣 𝑖 = 𝑗, 𝑣 𝑖 = 𝑣 𝑗 𝑤 𝑗 , 𝑖,𝑗∈{1, .. 𝑚} Basic faithfulness is satisfied by 𝐸[𝑍] But the no. of possible words(m) is exponential 𝑚= 𝑖 𝑐 𝑖 𝑐 𝑖 =𝐶𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦( 𝑟 𝑖 )

Summarizing Possible Worlds Definitions: Set of cells to which fact r has positive allocations 𝐶 𝑟 = 𝑐 𝑝 𝑐,𝑟 >0} Set of candidate facts for the query Q 𝑅 𝑄 = 𝑟 𝐶 𝑟 ∩ 𝑞 ≠ ∅} , 𝑞=𝑟𝑒𝑔(𝑄) For a candidate fact r, Yr is the 0-1 indicator random variable 𝑃 𝑌 𝑟 =1 = 𝑐 ∈ 𝐶 𝑟 ∩ 𝑞 𝑝 𝑐,𝑟 𝐸 𝑌 𝑟 is the allocation of r to the query Q 𝐸 𝑌 𝑟 =𝑃 𝑌 𝑟 =1

Summarizing Possible Worlds Step 1 Identify the set of candidate facts r ∈ R(Q) Compute the corresponding allocations 𝐸 𝑌 𝑟 to Q Step 2 Apply aggregation as per the aggregation operator (this step depends on operator type)

Summarizing Possible Worlds 𝑍= 𝑟𝜖𝑅(𝑄) 𝑣 𝑟 ∗ 𝑌 𝑟 𝐸[𝑍] satisfies Sum-consistency 𝐸[𝑍] does not guarantee β-faithfulness for arbitrary allocation policies Monotone Allocation Policy Database D and D’ are identical, except for a single pair of facts r ∈ D and r’ ∈ D’, reg(r’) = reg(r) ∪ c* 𝑝 𝑐,𝑟 ≥ 𝑝 𝑐, 𝑟 ′ , 𝑐≠ 𝑐 ∗ This allocation policy guarantees β-faithfulness for Sum

Monotone Allocation Policy: 𝑝 𝑐,𝑟 ≥ 𝑝 𝑐, 𝑟 ′

Summarizing Possible Worlds Average 𝑍= 𝑟𝜖𝑅(𝑄) 𝑣 𝑟 ∗ 𝑌 𝑟 𝑟𝜖𝑅(𝑄) 𝑌 𝑟 n = Partially allocated facts, m = Completely allocated facts 𝐸 𝑍 ~ 𝑂 𝑚+ 𝑛 3 Satisfies Basic-faithfulness Violates Boundedness-Consistency

Summarizing Possible Worlds Approximate Average 𝑍′= 𝐸[ 𝑟𝜖𝑅 𝑄 𝑣 𝑟 ∗ 𝑌 𝑟 ] 𝐸[ 𝑟𝜖𝑅(𝑄) 𝑌 𝑟 ] 𝐸 𝑍′ ~ 𝑂 𝑚+𝑛 𝐸 𝑍 ′ ≅𝐸 𝑍 , 𝑛≪𝑚 Satisfies Basic-faithfulness Satisfies Boundedness-Consistency

Expectation of Average violates Boundedness-Consistency

Summarizing Possible Worlds Uncertain Measures Consider possible worlds (D1, .. Dm) with weights (w1, .. wm) W(r) is set of i’s s.t. the cell to which r is mapped in Di belongs to reg(Q) 𝑍= 𝑟𝜖𝑅(𝑄) 𝑖𝜖𝑊(𝑟) 𝑣 𝑟 ∗ 𝑤 𝑖 𝑟𝜖𝑅(𝑄) 𝑖𝜖𝑊(𝑟) 𝑤 𝑖 Distribution 𝑍 is called AggLinOp

Allocation Policies Dimension-independent Allocation Suppose 𝑟𝑒𝑔 𝑟 = 𝐶 1 × 𝐶 2 … 𝐶 𝑘 ∀𝑖 ∀𝑏 𝜖 𝐶 𝑖 ∃ 𝛾 𝑖 𝑏 , 𝑏𝜖 𝐶 𝑖 𝛾 𝑖 𝑏 =1 𝑐= 𝑐 1 , 𝑐 2 , … 𝑐 𝑘 , 𝑝 𝑐,𝑟 = 𝑖 𝛾 𝑖 ( 𝑐 𝑖 ) Uniform Allocation Policy 𝑝 𝑐 𝑖 ,𝑟 = 𝑝 𝑐 𝑗 ,𝑟 , ∀𝑟 ∀ 𝑐 𝑖 , 𝑐 𝑗 𝜖 𝑟, 𝑐 𝑖 ≠ 𝑐 𝑗 Dimension-independent and monotone allocation policy No. of cells with positive allocation becomes very large for imprecise facts with large regions

Allocation Policies Measure-oblivious Allocation Given database D, database D’ is obtained from D, s.t. only measure attributes are changed Allocation to D and D’ is identical Count-based Allocation Policy Nc denote the number of precise facts that map to cell c 𝑝 𝑐,𝑟 = 𝑁 𝑐 𝑐 ′ 𝜖 𝑟𝑒𝑔(𝑟) 𝑁 𝑐 ′ Measure-oblivious and monotone allocation policy “Rich gets richer” effect

Allocation Policies Correlation-Preserving Allocation Correlation Dist=∆ 𝑐𝑜𝑟𝑟 𝐷 0 , 𝑖 𝑤 𝑖 ∗𝑐𝑜𝑟𝑟 𝐷 𝑖 Allocation policy A is correlation-preserving if for every database D, the correlation distance of A w.r.t D is the minimum Specifically Correlation Dist= 𝐷 𝐾𝐿 𝑃 0 , 𝑖 𝑤 𝑖 ∗ 𝑃 𝑖 𝐷 𝐾𝐿 : Kullback-Leibler divergence 𝑃 𝑖 =𝑐𝑜𝑟𝑟 𝐷 𝑖 , 𝑖 𝜖 {0,1,…,𝑚} 𝑐𝑜𝑟𝑟 ∗ is a PDF over dimension and measure attributes ( 𝐴 1 , 𝐴 2 , … 𝐴 𝑘 ,𝑀)

Allocation Policies Uncertain Domain Expectation Maximization Likelihood Function : 𝑟 𝐷 𝐾𝐿 ( 𝑣 𝑟 , 𝑐∈ 𝑟𝑒𝑔(𝑟) 𝑃 𝑐 |𝑟𝑒𝑔 𝑟 | ) Expectation Maximization E-step : For all facts r, cells c ∈ reg(r), base domain element o 𝑄 𝑐 𝑟,𝑜 = 𝑃 𝑐 𝑡 (𝑜) 𝑐 ′ ∈ 𝑟𝑒𝑔(𝑟) 𝑃 𝑐 ′ 𝑡 (𝑜) M-step : For all cells c, base domain element o 𝑃 𝑐 𝑡+1 𝑜 = 𝑟:𝑐∈ 𝑟𝑒𝑔(𝑟) 𝑣 𝑟 𝑜 ∗𝑄(𝑐|𝑟,𝑜) 𝑜 ′ 𝑟:𝑐∈ 𝑟𝑒𝑔(𝑟) 𝑣 𝑟 𝑜 ′ ∗𝑄(𝑐|𝑟, 𝑜 ′ )

Allocation Policies Calculating parameters 𝑝 𝑐,𝑟 =𝑄 𝑐 𝑟 ≔ 𝑜 𝑃 𝑐 ∞ 𝑜 𝑐 ′ 𝑃 𝑐 ′ ∞ 𝑜 ∗ 𝑣 𝑟 (𝑜)

Experiments Scalability of the Extended Data Model

Experiments Quality of the Allocation Policies

Conclusion Handling of uncertain measures as probability distribution functions (PDFs) Consistency requirements on aggregation operators for a relationship between queries on different hierarchy levels of imprecision Faithfulness requirements for direct relationship between degree of precision with quality of query results Correlation-Preserving requirements to make a strong, meaningful correlation between measures and dimensions Studying scalability vs quality trade offs between different allocation techniques