Tracing the Lineage of View Data in a Warehousing Environment Seminar : “Digital Information Curation“ Winterterm 2005/2006 Siniša Avramović

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

Dimensional Modeling.
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinVinayan Verenkar Computer Science Dept San Jose State University.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
ISOM Distributed Databases Arijit Sengupta. ISOM Learning Objectives Understand the concept and necessity of distributed databases Understand the types.
Introduction to Algorithms
Transaction.
Incremental Maintenance for Non-Distributive Aggregate Functions work done at IBM Almaden Research Center Themis Palpanas (U of Toronto) Richard Sidle.
Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.
Jennifer Widom On-Line Analytical Processing (OLAP) Introduction.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Data Warehouse IMS5024 – presented by Eder Tsang.
1 Distributed Databases Chapter Two Types of Applications that Access Distributed Databases The application accesses data at the level of SQL statements.
Dimensional Modeling – Part 2
Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.
1 Distributed Databases Chapter What is a Distributed Database? Database whose relations reside on different sites Database some of whose relations.
Data Warehouse View Maintenance Presented By: Katrina Salamon For CS561.
Ch3 Data Warehouse part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
An Overview of Data Warehousing and OLTP Technology Presenter: Parminder Jeet Kaur Discussion Lead: Kailang.
CS 255: Database System Principles slides: From Parse Trees to Logical Query Plans By:- Arunesh Joshi Id:
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
Objectives of the Lecture :
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
CS2008/CS5035 Exam Preparation. Dept. of Computing Science, University of Aberdeen2 Organization of Lecture Notes Group 1 - SQL –L1 – Introduction –L2.
1 Noget helt andet… Platon vil gerne være vært (i Århus) for et BIT møde i efteråret – SOA eller MDM – Fint for mig, men hvad siger i ? Platon inviterer.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25, Part B.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Database Design – Lecture 16
DW-1: Introduction to Data Warehousing. Overview What is Database What Is Data Warehousing Data Marts and Data Warehouses The Data Warehousing Process.
Database Management System Module 5 DeSiaMorewww.desiamore.com/ifm1.
Digit Sums of the factors of a number An Investigation.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
OnLine Analytical Processing (OLAP)
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
1 Data Warehouses BUAD/American University Data Warehouses.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
View Materialization & Maintenance Strategies By Ashkan Bayati & Ali Reza Vazifehdoost.
Data Structures R e c u r s i o n. Recursive Thinking Recursion is a problem-solving approach that can be used to generate simple solutions to certain.
Prepared By Aakanksha Agrawal & Richa Pandey Mtech CSE 3 rd SEM.
Distributed Database. Introduction A major motivation behind the development of database systems is the desire to integrate the operational data of an.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.
Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation.
The Data Warehouse “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of “all” an organisation’s data in support.
CS4432: Database Systems II Query Processing- Part 2.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Operational Data Store
Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.
CS4432: Database Systems II More on Index Structures 1.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
1 VLDB, Background What is important for the user.
Query Processing and Optimization, and Database Tuning
Module 11: File Structure
A paper on Join Synopses for Approximate Query Answering
Informix Red Brick Warehouse 5.1
Data Warehouse.
Applying Data Warehouse Techniques
On-Line Analytical Processing (OLAP)
February 7th – Exam Review
Database.
Data Warehousing and Decision Support
Probabilistic Databases
Views 1.
Presentation transcript:

Tracing the Lineage of View Data in a Warehousing Environment Seminar : “Digital Information Curation“ Winterterm 2005/2006 Siniša Avramović

Outline - The Lineage Problem : Introduction and Definition -Motivating Examples -Basic Assumptions and Definitions -Derivation Tracing Algorithms for SPJ and ASPJ Views

Outline II -Derivation Tracing in a DWH Environment * handling intermediate results (Materialization vs. Recomputing) * The Multisource Problem * Example :DWH with Derivation Tracing Support - Discussion

Introduction Data Warehouse : integration of data from multiple, heterogeneous, distributed, and possibly very large data sources. Queries posed to the DW : - for analytical purposes (OLAP,OLAM) - often of high complexity (join, aggregation) Materialized Views : intermediate results of query processing stored in DW for query efficiency improvement.

Introduction II In general : view definition - mapping from base data to view data

Introduction III In general : view definition - mapping from base data to view data View definition

Introduction II View Data Lineage : inverse mapping from view of data item to the base relation data that produced it View data lineage * In this presentation, we deal with SPJ and ASPJ views

Introduction III Unfortunately : inverse mapping not as straightforward as view computing. Here not only view definition needed, but also some additional information required. Data Warehousing Environment introduces some additional challenges to the lineage problem. Questions to be Answered : * how to trace lineage in a distributed database environment * how to deal with inaccessible or inconsistent sources

Motivating Examples σ state=‘ca‘ π store_name, item_name, num_sold California item(item_id,item_name,category); store(store_id,store_name,city,state); sales(store_id,item_id,price,num_sold); CREATE VIEW California AS SELECT store.store_name,item.item_name, sales.num_sold FROM store,item,sales WHERE sales.store_id = store.store_id AND sales.item_id = item.item_id AND store.sate = “CA“

Motivating Examples store_idstore_namecitystate 001TargetPACA 002TargetALNY 003Macy‘sSFCA 004Macy‘sNY Item_idItem_nameCategory 0001Binderstationery 0002Pencilstationery 0003Shirtclothing 0004Pantsclothing 0005Potkitchenware store_idItem_idpricenum_sold store_nameitem_nameNum_sold TargetBinder1000 TargetPencil3000 TargetPants600 Macy‘sShirt1500 Macy‘spants600 item tablestore table sales tablecalifornia view

Motivating Examples store_idstore_namecitystate 001TargetPACA 002TargetALNY 003Macy‘sSFCA 004Macy‘sNY Item_idItem_nameCategory 0001Binderstationery 0002Pencilstationery 0003Shirtclothing 0004Pantsclothing 0005Potkitchenware store_idItem_idpricenum_sold store_nameitem_nameNum_sold TargetBinder1000 TargetPencil3000 TargetPants600 Macy‘sShirt1500 Macy‘spants600 item tablestore table sales tablecalifornia view

Motivating Examples store_idstore_namecitystate 001TargetPACA 002TargetALNY 003Macy‘sSFCA 004Macy‘sNY Item_idItem_nameCategory 0001Binderstationery 0002Pencilstationery 0003Shirtclothing 0004Pantsclothing 0005Potkitchenware store_idItem_idpricenum_sold store_nameitem_nameNum_sold TargetBinder1000 TargetPencil3000 TargetPants600 Macy‘sShirt1500 Macy‘spants600 item tablestore table sales tablecalifornia view

Motivating Examples store_idstore_namecitystate 001TargetPACA 002TargetALNY 003Macy‘sSFCA 004Macy‘sNY Item_idItem_nameCategory 0001Binderstationery 0002Pencilstationery 0003Shirtclothing 0004Pantsclothing 0005Potkitchenware store_idItem_idpricenum_sold store_nameitem_nameNum_sold TargetBinder1000 TargetPencil3000 TargetPants600 Macy‘sShirt1500 Macy‘spants600 item tablestore table sales tablecalifornia view

Motivating Examples store_idstore_namecitystate 001TargetPACA 002TargetALNY 003Macy‘sSFCA 004Macy‘sNY Item_idItem_nameCategory 0001Binderstationery 0002Pencilstationery 0003Shirtclothing 0004Pantsclothing 0005Potkitchenware store_idItem_idpricenum_sold store_nameitem_nameNum_sold TargetBinder1000 TargetPencil3000 TargetPants600 Macy‘sShirt1500 Macy‘spants600 item tablestore table sales tablecalifornia view

Basic Assumptions Class of views defined over base relations using relation algebra operation ( σ,π,, ⍺ ) considered  SPJ and ASPJ Views Set Semantics considered Basic Definitions Derivation of view tupels : tupels of base relations which produce the View tupel. Those tupel said to contribute to t* * t : materialized view tupel

Basic Definitions: Tupel Derivation for an Operator σ state=‘ca‘ π store_name, item_name, num_sold California T1T2T3 - OP : any relational operator ( σ,π, ⋈, ⍺ ) over T1,T2,T3 -T=Op(T1,..,T3); -t ∊ T ; t‘ s Derivation in T1…T3 : Op -1 〈T1,..,T3〉 (t)=〈T 1 *,…. T 3 *〉 ; T 1 *,…. T 3 * maximal subsets of T1,…,T3 such that: a) the derivation tupel sets T i derive exactly t b) Each tuple in derivation in fact contribute something to t

Basic Definitions: Tupel Derivation for Aggregation

Basic Definitions: Tupel Derivation for Views Derivation of view tuple set T contains all tuples that contribute to any view tuple in set T Any View V with complex definition can be broken into intermediate views, and compute tuples derivation recursively by tracing the hierarchy of intermediate views.

SPJ View Derivation Tracing Derivation tracing queries queries written for specific view definition, such that if applied to database D, it return t‘s derivation in D. Given database D, with base relations R 1,…R m ; view definition v over R 1,…R m and tuple t ∊ v ( R 1,…R m ); TQ tv is the derivation tracing for t and v if : TQ tv (D)=v -1 D (t) * v -1 D (t) : t‘s derivation over D according to v

SPJ View Derivation Tracing SPJ canonical form : using a sequence of algebraic transformations- all SPJ views can be transformed into the form : ∏ A (σ C ( R 1 ⋈ …. ⋈ R m )) Split Operator : given table T with schema T, Split breaks T into list of tables, each table in the list is projection onto A ⊆ T

SPJ View Derivation Tracing store_idstore_namecitystate 001TargetPACA 002TargetALNY 003Macy‘sSFCA 004Macy‘sNY store_nameitem_nameNum_sold TargetBinder1000 TargetPencil3000 TargetPants600 store_idItem_idpricenum_sold Item_idItem_nameCategory 0001Binderstationery 0002Pencilstationery 0003Shirtclothing California View Store Table Sales Table Item Table

SPJ View Derivation Tracing Given Database D, with Relations R 1,…R m, and : v (D)= ∏ A (σ C ( R 1 ⋈ …. ⋈ R m )). Derivation of t : For T ⊆ v (D) :

SPJ View Derivation Tracing : Optimization -Optimization by pushing selection conditions below the join operator.

ASPJ View Derivation Tracing Clothing Additional

ASPJ View Derivation Tracing Problems with ASPJ Views : - most ASPJ Views not traceable without storing additional information (intermediate results). - no simple canonical form, since selection operators cannot be pushed above or below aggregation operators. Basic Idea : ASPJ canonical form transform general ASPJ views into sequences of ⍺, π, σ, ⋈ by commuting some SPJ operators ASPJ Segment. Each segment include aggregation operators. View defined by one ASPJ segment – one-level ASPJ view.

ASPJ View Derivation Tracing Given a one–level ASPJ view V : and t ∊ T ; t‘s derivation :

Recursive Derivation Tracing Algorithm for Multi-Level ASPJ Views Problem : multiple queries required for tracing computations. Idea : - transfer the view into ASPJ canonical form -Divide into set of ASPJ segments -Define intermediate views for each segment

Recursive Derivation Tracing Algorithm for Multi-Level ASPJ Views Problem : multiple queries required for tracing computations. Idea : -Trace recursively trough hierarchy of intermediate Views top-down list -Use at each level a one-level ASPJ query to compute derivations -Concatenate local results to form derivation of whole view tuple list

Derivation Tracing in a Warehousing Environment Until now, always assumed that all base relations and intermediate results accessible for tracing computations. By looking at DWH as integration of heterogeneous and distributed data sources, following problems may arise: - Efficiency problem : querying remote sources with join and aggregation operation, recomputing intermediate results for each query very inefficient in the distributed environment. - Consistence problem : view refreshing and view maintenance problems - Legacy sources : Views defined on inaccessible, legacy sources not traceable

Derivation Tracing in a Warehousing Environment Solutions to the intermediate aggregation results problem: The “Recomputation Approach“: -Recompute the intermediate results for each query. -No permanent extra storage, tracing process takes much longer, especially in distributed environments. The “Materialized View“ Approach -Maintain auxiliary materialized views with intermediate results in the DWH. -Less computation required but.. -extra storage and maintenance costs

Storing Derivation Views in a Multi-Source DWH Various strategies for storing auxiliary views : simple extreme solution : store all base relations or store less information with higher tracing computation cost. Goal : intermediate scheme with low tracing cost and modest extra storage and maintenance costs. Idea : Derivation Views -break down view definition again to ASPJ segments. -Only view defined by lowest segments directly over base relations -Those views can be computed by a simple selection with split operator

Storing Derivation Views in a Multi-Source DWH All_clothing DV_All_clothing

Warehousing System Supporting Derivation Tracing I - Auxiliary view AllClothing maintained for tracing tuples from Clothing - AllClothing records the intermediate aggregation results - to trace tuples from AllClothing, derivation view DV_AllClothing is maintained.

Warehousing System Supporting Derivation Tracing I

Tracing the Lineage of View Data in a Warehousing Environment Questions ?