Data Warehouse View Maintenance Presented By: Katrina Salamon For CS561.

Slides:



Advertisements
Similar presentations
CS3771 Today: deadlock detection and election algorithms  Previous class Event ordering in distributed systems Various approaches for Mutual Exclusion.
Advertisements

CS4432: Database Systems II
Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.
CMPT 354 Views and Indexes Spring 2012 Instructor: Hassan Khosravi.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Tutorial 8 CSI 2132 Database I. Exercise 1 Both disks and main memory support direct access to any desired location (page). On average, main memory accesses.
Efficient Solutions to the Replicated Log and Dictionary Problems
Incremental Maintenance for Non-Distributive Aggregate Functions work done at IBM Almaden Research Center Themis Palpanas (U of Toronto) Richard Sidle.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Temporal Indexing Snapshot Index. Transaction Time Environment Assume that when an event occurs in the real world it is inserted in the DB A timestamp.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Hash Tables1 Part E Hash Tables  
DBMS Functions Data, Storage, Retrieval, and Update
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.
CS4432: Database Systems II
Recovery Techniques in Distributed Databases Naveen Jones December 5, 2011.
Chapter 5.
Objectives of the Lecture :
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25, Part B.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 13: Query Processing.
CMPE 421 Parallel Computer Architecture
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Virtual Memory CS Introduction to Operating Systems.
1 Chapter 3.2 : Virtual Memory What is virtual memory? What is virtual memory? Virtual memory management schemes Virtual memory management schemes Paging.
Part Two: - The use of views. 1. Topics What is a View? Why Views are useful in Data Warehousing? Understand Materialised Views Understand View Maintenance.
1 What is database 2? What is normalization? What is SQL? What is transaction?
B-Trees And B+-Trees Jay Yim CS 157B Dr. Lee.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Chapter No 4 Query optimization and Data Integrity & Security.
Lamport’s Logical Clocks & Totally Ordered Multicasting.
View Materialization & Maintenance Strategies By Ashkan Bayati & Ali Reza Vazifehdoost.
Chapter 9 Database Systems Introduction to CS 1 st Semester, 2014 Sanghyun Park.
3 / 12 Databases MIS105 Lec13 Irfan Ahmed Ilyas CHAPTER Prepared By:
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Lecture 1- Query Processing Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Session 1 Module 1: Introduction to Data Integrity
Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.
Logical Clocks. Topics r Logical clocks r Totally-Ordered Multicasting.
CS4432: Database Systems II
SQL: Interactive Queries (2) Prof. Weining Zhang Cs.utsa.edu.
W4118 Operating Systems Instructor: Junfeng Yang.
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Computer Organization
Jonathan Walpole Computer Science Portland State University
Indexing and hashing.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Database Management System
Chapter 9 Database Systems
Dynamic Hashing (Chapter 12)
Chapter 12: Query Processing
Computer Architecture
Indexing and Hashing Basic Concepts Ordered Indices
Fundamentals of Databases
Data Warehousing and Decision Support
Business Application Development
Data Warehousing Concepts
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
Presentation transcript:

Data Warehouse View Maintenance Presented By: Katrina Salamon For CS561

What is a Data Warehouse? Repository of integrated information As information becomes available from a source it is added to the repository

What is a View? A function from a set of base tables to a derived table Can be recreated every time the view is accessed

What’s a Materialized View A view where the tuples are stored in a database (or warehouse) Can create indexes on them Provides fast data access Similar to a cache

What’s View Maintenance View data becomes out of date when base tables are changed Updating the view to reflect these changes is called view maintenance

Sounds Easy Right!

Here’s Why... Data sources are typically legacy systems and do not understand views Sources can tell the warehouse there is new data, but they cannot determine if any additional data is needed

Examples Ideal World – new record is added to base relation and view is notified and updated The Real World –Maintenance Anomaly – trying to update a view while the underlying data is changing Update Anomaly Deletion Anomaly

The (Possible) Solutions Recompute the view Store all relations involved in the warehouse Eager Compensating Algorithm (ECA)

Recompute the View When? –Whenever an update occurs –At a periodic interval Time and Resource intensive especially in a distributed environment (transferring of data from one source to the other)

Storing Base Relations Keep up-to-date copies of all relations in the warehouse, queries can be evaluated locally and no anomalies occur Takes up extra space in the warehouse, storing duplicate data Copied relations still need to be updated

Eager Compensating Algorithm Most promising solution –No duplicating base relations or recomputing overheads All queries sent have compensating queries added to them to offset concurrent updates to the source data

ECA cont... Strongly Consistent –Upon competition of activity, view is consistent with base relations –Every View state has a corresponding state in the base relations and they are completed in order Not complete –Every source state may not be reflected in a view state (direct mapping)

How ECA Works - 4 basic events 1. Source executes an update (U) and notification is sent to the warehouse 2. Warehouse receives update (U) and creates query (Q) to be evaluated by the source 3. Source evaluates query (Q) against base relations and sends answer (A) to warehouse 4. Warehouse receives query result and updates view

Resolving Anomalies Two Updates: Query1 is assumed to be computed before Update2 but is actually computed after Update2 –ECA knows that is happens and takes Update2 into account when Updating the view by using a compensating query for each query it creates

Resolving Issues When using compensating queries, we should not apply the results until after all related queries have been received If updates occurred after each query the view could temporally be in an invalid state To avoid invalid states ECA collects the intermediate answers in a relation called Collect (initialized to empty set)

Example Three insertions in to three base relations and its affect on the view that references them

ECA-Key Used to streamline the algorithm when a key from the base relations are available in the View The Collect relation is initialized to current View and becomes a working copy of the View

ECA-Key Algorithms Delete received, no query sent, delete is directly applied to Collect Insert received, query sent, no compensating queries created, answers are added to Collect and duplicate values are ignored because of the keys Once completed the tuples in Collect replace the tuples for the View

ECA - Local Combines the compensating queries of ECA and the local updates of ECA-Key to create a more streamlined query Maintaining order of execution of local and non-local processes is complicated and will create a greater over head then other algorithms Future work needs to be done to see if this is a worthwhile approach

Performance Comparison Total Bytes Transferred vs. Cardinality of Relation Total Bytes Transferred vs. # of Source Updates

Review of ECA Incremental updating approach, it doesn’t start from scratch every time No additional burden placed on sources (timestamps or locks) Compensating queries are only used when more then update is occurring, keeping computation costs low