Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Dimensional Modeling.
Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.
IPAW'08 – Salt Lake City, Utah, June 2008 Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame,
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
gSpan: Graph-based substructure pattern mining
Fast Algorithms For Hierarchical Range Histogram Constructions
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
More Efficient Generation of Plane Triangulations Shin-ichi Nakano Takeaki Uno Gunma University National Institute of JAPAN Informatics, JAPAN 23/Sep/2003.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Tracing the Lineage of View Data in a Warehousing Environment Seminar : “Digital Information Curation“ Winterterm 2005/2006 Siniša Avramović
Introduction to Databases
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Jennifer Widom On-Line Analytical Processing (OLAP) Introduction.
Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA
Clustering II.
Data Integration Aggregate Query Answering under Uncertain Schema Mappings Avigdor Gal, Maria Vanina Martinez, Gerardo I. Simari, VS Subrahmanian Presented.
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
File Systems and Databases
Cache Placement in Sensor Networks Under Update Cost Constraint Bin Tang, Samir Das and Himanshu Gupta Department of Computer Science Stony Brook University.
Database management concepts Database Management Systems (DBMS) An example of a database (relational) Database schema (e.g. relational) Data independence.
Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.
Discrete Mathematics Recursion and Sequences
Automated Planning and HTNs Planning – A brief intro Planning – A brief intro Classical Planning – The STRIPS Language Classical Planning – The STRIPS.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Ch3 Data Warehouse part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
G.Anuradha Information Package Diagram. Information Packages – novel idea for determining and recording information requirements for a data warehouse.
IST Databases and DBMSs Todd S. Bacastow January 2005.
1 Noget helt andet… Platon vil gerne være vært (i Århus) for et BIT møde i efteråret – SOA eller MDM – Fint for mig, men hvad siger i ? Platon inviterer.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
25th VLDB, Edinburgh, Scotland, September 7-10, 1999 Extending Practical Pre-Aggregation for On-Line Analytical Processing T. B. Pedersen 1,2, C. S. Jensen.
OnLine Analytical Processing (OLAP)
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Roadmap 1.What is the data warehouse, data mart 2.Multi-dimensional data modeling 3.Data warehouse design – schemas, indices 4.The Data Cube operator –
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Bdbms: A Database System for Scientific Data Management Mohamed Y. Eltabakh, Mourad Ouzzani, Walid G. Aref, Ahmed Elmagarmid, Yasin Silva, Umer Arshad,
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
View Materialization & Maintenance Strategies By Ashkan Bayati & Ali Reza Vazifehdoost.
Differentiating “Combined” Functions ---Part I Constant Multiples, Sums and Differences.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
GIS Data Models GEOG 370 Christine Erlien, Instructor.
CS4432: Database Systems II Query Processing- Part 2.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Presentation on Database management Submitted To: Prof: Rutvi Sarang Submitted By: Dharmishtha A. Baria Roll:No:1(sem-3)
1 Using Network Coding for Dependent Data Broadcasting in a Mobile Environment Chung-Hua Chu, De-Nian Yang and Ming-Syan Chen IEEE GLOBECOM 2007 Reporter.
7 1 Database Systems: Design, Implementation, & Management, 7 th Edition, Rob & Coronel 7.6 Advanced Select Queries SQL provides useful functions that.
Answering Queries Using Views Presented by: Mahmoud ELIAS.
SQL: Interactive Queries (2) Prof. Weining Zhang Cs.utsa.edu.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Introduction toData structures and Algorithms
Hidden Markov Models BMI/CS 576
Databases and DBMSs Todd S. Bacastow January
Boosting and Additive Trees (2)
Data Mining K-means Algorithm
Chapter 13 The Data Warehouse
Data Mining Concept Description
Database management concepts
Database Vs. Data Warehouse
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Query Processing B.Ramamurthy Chapter 12 11/27/2018 B.Ramamurthy.
Differentiating “Combined” Functions ---Part I
Differentiating “Combined” Functions ---Part I
Database management concepts
Convolutional networks
Probabilistic Databases
Course Instructor: Supriya Gupta Asstt. Prof
Presentation transcript:

Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation by Aaron St.Clair

Outline  What is lineage tracing?  Why is tracing lineage data important?  How can we find lineage data?  Performance results

Data Warehouses  Integrate data from multiple sources  Data undergoes series of transformations  Transformations vary in complexity Data Source 1 Data Source 2 Data Source N … Transformation Summarized Data

Lineage Tracing  Identifying the specific data items in the sources that derive a given data item in the warehouse  Allows  In-depth data analysis  Data mining  Authorization management  View update  Efficient warehouse recovery

An Example Selects items whose last quarter sales are more than twice the average of the last three quarter’s sales

An Example

Lineage Granularity  Coarse-Grained  Schema-level, attribute mapping  Fine-Grained  Set of source data items

Existing Work  Mostly coarse-grained lineage  Existing methods for fine-grained lineage  Extra annotation  Developer-defined weak inverses  Statistical estimation  Can’t handle complex procedural transformations

Tracing Lineage - Definitions  Data set – set of data items without duplicates  Transformation – any procedure that takes data sets as input and produces data sets as output  Stable (no spurious output)  Deterministic (under some conditions)  Lineage of a data item – set of input data items that contribute to that item

Determining Contributions Need to find relevant data items – Easy for simple relational operators – Difficult for procedural transformations Select positives vs. Aggregation and sum

Lineage Tracing Use of hierarchical model – Transformation classes – Schema mappings – Defined inverses

Transformation Classes  Transformation class defines procedure lineage determination  For a dispatcher:  Iteratively apply transformation to inputs  If T(I) is in output set add I to lineage of the output set

Schema Mappings  Defined schema for input and output of a transformation Backward key-maps – A key  g(B) – T1  Forward key-maps  f(A)  B key  T4  Backward total-maps  A  g(B)  T5

Provided Inverses/Tracing Procedures  Best case; someone has defined a function mapping output items to their deriving lineage items  Know nothing about efficiency of function

Property Hierarchy

Finding Lineage Recursively apply algorithms based on the transformation type until we reach top level

Optimizations  Indexing input data set improves performance  Functional index using the schema optimizes queries of the form F(i) = v  Store auxiliary or intermediate views in the warehouse  Reduce number by composing transformations

Transformation Graphs  Create a tracing sequence for each path from input to output in the graph  Combine the results of each sequence

Performance 1 GB warehouse Schema mapping better than transformation class- specific algorithms Indexing helps Combining attributes reduces trace time

Questions?