GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

Slides:



Advertisements
Similar presentations
Examples of Physical Query Plan Alternatives
Advertisements

What is a Database By: Cristian Dubon.
Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
Maintenance of Materialized Views: Problems, Techniques, and Applications Ashish Gupta IBM almaden Research Center Inderpal Singh Mumick AT&T Bell Laboratories.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 SQL: Queries, Programming, Triggers Chapter 5 Modified by Donghui Zhang.
CS 540 Database Management Systems
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.
COMP 3715 Spring 05. Working with data in a DBMS Any database system must allow user to  Define data Relations Attributes Constraints  Manipulate data.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
INFS614, Fall 08 1 Relational Algebra Lecture 4. INFS614, Fall 08 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Xyleme A Dynamic Warehouse for XML Data of the Web.
CS 290C: Formal Models for Web Software Lecture 10: Language Based Modeling and Analysis of Navigation Errors Instructor: Tevfik Bultan.
CS 4432query processing1 CS4432: Database Systems II.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
Data Cleaning and Transformation - Tools Helena Galhardas DEI IST.
PowerPoint Presentation for Dennis, Wixom & Tegarden Systems Analysis and Design Copyright 2001 © John Wiley & Sons, Inc. All rights reserved. Slide 1.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Chapter 14: Advanced Topics: DBMS, SQL, and ASP.NET
SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas.
10/3/2000SIMS 257: Database Management -- Ray Larson Relational Algebra and Calculus University of California, Berkeley School of Information Management.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Advanced Database CS-426 Week 2 – Logic Query Languages, Object Model.
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
XML-QL A Query Language for XML Charuta Nakhe
Efficient Evaluation of XQuery over Streaming Data Xiaogang Li Gagan Agrawal The Ohio State University.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Lecture 05 Structured Query Language. 2 Father of Relational Model Edgar F. Codd ( ) PhD from U. of Michigan, Ann Arbor Received Turing Award.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Object Persistence (Data Base) Design Chapter 13.
Object Persistence Design Chapter 13. Key Definitions Object persistence involves the selection of a storage format and optimization for performance.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
CS4432: Database Systems II Query Processing- Part 2.
7 Strategies for Extracting, Transforming, and Loading.
Dec. 13, 2002 WISE2002 Processing XML View Queries Including User-defined Foreign Functions on Relational Databases Yoshiharu Ishikawa Jun Kawada Hiroyuki.
Session 1 Module 1: Introduction to Data Integrity
Mining real world data RDBMS and SQL. Index RDBMS introduction SQL (Structured Query language)
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Chapter 13: Query Processing
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
CS4432: Database Systems II Query Processing- Part 1 1.
1 Chengkai Li Kevin-Chen-Chuan Chang Ihab Ilyas Sumin Song Presented by: Mariam John CSE /20/2006 RankSQL: Query Algebra and Optimization for Relational.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Efficient Evaluation of XQuery over Streaming Data
Potter’s Wheel: An Interactive Data Cleaning System
Pig Latin - A Not-So-Foreign Language for Data Processing
Introduction to Query Optimization
Relational Algebra Chapter 4, Part A
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
Data Integration for Relational Web
Implementation of Relational Operations
Evaluation of Relational Operations: Other Techniques
Evaluation of Relational Operations: Other Techniques
Relational Algebra Chpt 4a Xintao Wu Raghu Ramakrishnan
Presentation transcript:

GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

GTI 2007/08 H.Galhardas Existing technology Ad-hoc programs written in a programming language like C or Java or using an RDBMS proprietary language –Programs difficult to optimize and maintain RDBMS mechanisms for guaranteeing integrity constraints –Do not address important data instance problems Data transformation scripts using an ETL (Extraction-Transformation-Loading) or data quality tool

GTI 2007/08 H.Galhardas Problems of data quality solutions (1) The semantics of some data transformations is defined in terms of their implementation algorithms App. Domain 1 App. Domain 2 App. Domain 3 Data cleaning transformations...

GTI 2007/08 H.Galhardas There is a lack of interactive facilities to tune a data cleaning application program Problems of data quality solutions (2) Dirty Data Cleaning process Clean dataRejected data

GTI 2007/08 H.Galhardas Motivating example (1) DirtyData(paper:String) Data Cleaning & Transformation Events(eventKey, name) Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year) Authors(authorKey, name) PubsAuthors(pubKey, authorKey)

GTI 2007/08 H.Galhardas Motivating example (2) [1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996 [2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self- maintianable for data warehousing, PDIS’95 DirtyData Data Cleaning & Transformation PDIS | Conference on Parallel and Distributed Information Systems Events QGMW96| Making Views Self-Maintainable for Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996 Publications Authors DQua | Dallan Quass AGup | Ashish Gupta JWid | Jennifer Widom ….. QGMW96 | DQua QGMW96 | AGup …. PubsAuthors

GTI 2007/08 H.Galhardas Modeling a data quality process A data quality process is modeled by a directed acyclic graph of data transformations DirtyData DirtyAuthors Authors Duplicate Elimination Extraction Standardization Formatting DirtyTitles...DirtyEvents Cities Tags

GTI 2007/08 H.Galhardas AJAX features An extensible data quality framework –Logical operators as extensions of relational algebra –Physical execution algorithms A declarative language for logical operators –SQL extension A debugger facility for tuning a data cleaning program application –Based on a mechanism of exceptions

GTI 2007/08 H.Galhardas AJAX features An extensible data quality framework –Logical operators as extensions of relational algebra –Physical execution algorithms A declarative language for logical operators –SQL extension A debugger facility for tuning a data cleaning program application –Based on a mechanism of exceptions

GTI 2007/08 H.Galhardas Logical level: parametric operators View: arbitrary SQL query Map: iterator-based one-to-many mapping with arbitrary user-defined functions Match: iterator-based approximate join Cluster: uses an arbitrary clustering function Merge: extends SQL group-by with user-defined aggregate functions Apply: executes an arbitrary user-defined algorithm Map Match Merge ClusterView Apply

GTI 2007/08 H.Galhardas Logical level DirtyData DirtyAuthors Authors Duplicate Elimination Extraction Standardization Formatting DirtyTitles... Cities Tags

GTI 2007/08 H.Galhardas Logical level DirtyData DirtyAuthors Map Cluster Match Merge Authors Map Duplicate Elimination Extraction Standardization Formatting DirtyTitles... Cities Tags DirtyData DirtyAuthors TC NL Authors SQL Scan Java Scan Physical level DirtyTitles... Java Scan Cities Tags

GTI 2007/08 H.Galhardas Match Input: 2 relations Finds data records that correspond to the same real object Calls distance functions for comparing field values and computing the distance between input tuples Output: 1 relation containing matching tuples and possibly 1 or 2 relations containing non-matching tuples

GTI 2007/08 H.Galhardas Example Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors

GTI 2007/08 H.Galhardas Example CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist INTO MatchAuthors Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors

GTI 2007/08 H.Galhardas Example CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist INTO MatchAuthors Input: DirtyAuthors(authorKey, name) 861|johann christoph freytag 822|jc freytag 819|j freytag 814|j-c freytag Output: MatchAuthors(authorKey1, authorKey2, name1, name2) 861|822|johann christoph freytag| jc freytag 822|814|jc freytag|j-c freytag... Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors

GTI 2007/08 H.Galhardas Implementation of the match operator  s 1  S 1, s 2  S 2 (s 1, s 2 ) is a match if editDistance (s 1, s 2 ) < maxDist

GTI 2007/08 H.Galhardas Nested loop S1S1 S2S2... Very expensive evaluation when handling large amounts of data  Need alternative execution algorithms for the same logical specification editDistance

GTI 2007/08 H.Galhardas A database solution CREATE TABLE MatchAuthors AS SELECT authorKey1, authorKey2, distance FROM ( SELECT a1.authorKey authorKey1, a2.authorKey authorKey2, editDistance (a1.name, a2.name) distance FROM DirtyAuthors a1, DirtyAuthors a2) WHERE distance < maxDist;  No optimization supported for a Cartesian product with external function calls

GTI 2007/08 H.Galhardas Window scanning S n

GTI 2007/08 H.Galhardas Window scanning S n

GTI 2007/08 H.Galhardas Window scanning S n  May loose some matches

GTI 2007/08 H.Galhardas String distance filtering S1S1 S2S2 maxDist = 1 John Smith John Smit Jogn Smith John Smithe length length- 1 length length + 1 editDistance

GTI 2007/08 H.Galhardas Annotation-based optimization The user specifies types of optimization The system suggests which algorithm to use Ex: CREATE MATCHING MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET dist = editDistance(da1.name, da2.name) WHERE dist < maxDist % distance-filtering: map= length; dist = abs % INTO MatchAuthors

GTI 2007/08 H.Galhardas AJAX features An extensible data quality framework –Logical operators as extensions of relational algebra –Physical execution algorithms A declarative language for logical operators –SQL extension A debugger facility for tuning a data cleaning program application –Based on a mechanism of exceptions

GTI 2007/08 H.Galhardas DEFINE FUNCTIONS AS Choose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerException Generate.generateId(INTEGER) RETURN STRING Normal.removeCitationTags(STRING) RETURN STRING (600) DEFINE ALGORITHMS AS TransitiveClosure SourceClustering(STRING) DEFINE INPUT DATA FLOWS AS TABLE DirtyData (paper STRING (400) ); TABLE City (city STRING (80), citysyn STRING (80) ) KEY city,citysyn; DEFINE TRANSFORMATIONS AS CREATE MAPPING mapKeDiDa FROM DirtyData Dd LET keyKdd = generateId(1) {SELECT keyKdd AS paperKey, Dd.paper AS paper KEY paperKey CONSTRAINT NOT NULL mapKeDiDa.paper } Declarative specification

GTI 2007/08 H.Galhardas DEFINE FUNCTIONS AS Choose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerException Generate.generateId(INTEGER) RETURN STRING Normal.removeCitationTags(STRING) RETURN STRING (600) DEFINE ALGORITHMS AS TransitiveClosure SourceClustering(STRING) DEFINE INPUT DATA FLOWS AS TABLE DirtyData (paper STRING (400) ); TABLE City (city STRING (80), citysyn STRING (80) ) KEY city,citysyn; DEFINE TRANSFORMATIONS AS CREATE MAPPING mapKeDiDa FROM DirtyData Dd LET keyKdd = generateId(1) {SELECT keyKdd AS paperKey, Dd.paper AS paper KEY paperKey CONSTRAINT NOT NULL mapKeDiDa.paper } Graph of data transformations Declarative specification

GTI 2007/08 H.Galhardas AJAX features An extensible data quality framework –Logical operators as extensions of relational algebra –Physical execution algorithms A declarative language for logical operators –SQL extension A debugger facility for tuning a data cleaning program application –Based on a mechanism of exceptions

GTI 2007/08 H.Galhardas Management of exceptions Problem: to mark tuples not handled by the cleaning criteria of an operator Solution: to specify the generation of exception tuples within a logical operator –exceptions are thrown by external functions –output constraints are violated

GTI 2007/08 H.Galhardas Debugger facility Supports the (backward and forward) data derivation of tuples wrt an operator to debug exceptions Supports the interactive data modification and, in the future, the incremental execution of logical operators

GTI 2007/08 H.Galhardas Debugging exceptions

GTI 2007/08 H.Galhardas Architecture

GTI 2007/08 H.Galhardas References Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Cristian- Augustin Saita: “Declarative Data Cleaning: Language, Model, and Algorithms”. VLDB 2001: