GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)
GTI 2007/08 H.Galhardas Existing technology Ad-hoc programs written in a programming language like C or Java or using an RDBMS proprietary language –Programs difficult to optimize and maintain RDBMS mechanisms for guaranteeing integrity constraints –Do not address important data instance problems Data transformation scripts using an ETL (Extraction-Transformation-Loading) or data quality tool
GTI 2007/08 H.Galhardas Problems of data quality solutions (1) The semantics of some data transformations is defined in terms of their implementation algorithms App. Domain 1 App. Domain 2 App. Domain 3 Data cleaning transformations...
GTI 2007/08 H.Galhardas There is a lack of interactive facilities to tune a data cleaning application program Problems of data quality solutions (2) Dirty Data Cleaning process Clean dataRejected data
GTI 2007/08 H.Galhardas Motivating example (1) DirtyData(paper:String) Data Cleaning & Transformation Events(eventKey, name) Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year) Authors(authorKey, name) PubsAuthors(pubKey, authorKey)
GTI 2007/08 H.Galhardas Motivating example (2) [1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996 [2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self- maintianable for data warehousing, PDIS’95 DirtyData Data Cleaning & Transformation PDIS | Conference on Parallel and Distributed Information Systems Events QGMW96| Making Views Self-Maintainable for Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996 Publications Authors DQua | Dallan Quass AGup | Ashish Gupta JWid | Jennifer Widom ….. QGMW96 | DQua QGMW96 | AGup …. PubsAuthors
GTI 2007/08 H.Galhardas Modeling a data quality process A data quality process is modeled by a directed acyclic graph of data transformations DirtyData DirtyAuthors Authors Duplicate Elimination Extraction Standardization Formatting DirtyTitles...DirtyEvents Cities Tags
GTI 2007/08 H.Galhardas AJAX features An extensible data quality framework –Logical operators as extensions of relational algebra –Physical execution algorithms A declarative language for logical operators –SQL extension A debugger facility for tuning a data cleaning program application –Based on a mechanism of exceptions
GTI 2007/08 H.Galhardas AJAX features An extensible data quality framework –Logical operators as extensions of relational algebra –Physical execution algorithms A declarative language for logical operators –SQL extension A debugger facility for tuning a data cleaning program application –Based on a mechanism of exceptions
GTI 2007/08 H.Galhardas Logical level: parametric operators View: arbitrary SQL query Map: iterator-based one-to-many mapping with arbitrary user-defined functions Match: iterator-based approximate join Cluster: uses an arbitrary clustering function Merge: extends SQL group-by with user-defined aggregate functions Apply: executes an arbitrary user-defined algorithm Map Match Merge ClusterView Apply
GTI 2007/08 H.Galhardas Logical level DirtyData DirtyAuthors Authors Duplicate Elimination Extraction Standardization Formatting DirtyTitles... Cities Tags
GTI 2007/08 H.Galhardas Logical level DirtyData DirtyAuthors Map Cluster Match Merge Authors Map Duplicate Elimination Extraction Standardization Formatting DirtyTitles... Cities Tags DirtyData DirtyAuthors TC NL Authors SQL Scan Java Scan Physical level DirtyTitles... Java Scan Cities Tags
GTI 2007/08 H.Galhardas Match Input: 2 relations Finds data records that correspond to the same real object Calls distance functions for comparing field values and computing the distance between input tuples Output: 1 relation containing matching tuples and possibly 1 or 2 relations containing non-matching tuples
GTI 2007/08 H.Galhardas Example Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors
GTI 2007/08 H.Galhardas Example CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist INTO MatchAuthors Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors
GTI 2007/08 H.Galhardas Example CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist INTO MatchAuthors Input: DirtyAuthors(authorKey, name) 861|johann christoph freytag 822|jc freytag 819|j freytag 814|j-c freytag Output: MatchAuthors(authorKey1, authorKey2, name1, name2) 861|822|johann christoph freytag| jc freytag 822|814|jc freytag|j-c freytag... Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors
GTI 2007/08 H.Galhardas Implementation of the match operator s 1 S 1, s 2 S 2 (s 1, s 2 ) is a match if editDistance (s 1, s 2 ) < maxDist
GTI 2007/08 H.Galhardas Nested loop S1S1 S2S2... Very expensive evaluation when handling large amounts of data Need alternative execution algorithms for the same logical specification editDistance
GTI 2007/08 H.Galhardas A database solution CREATE TABLE MatchAuthors AS SELECT authorKey1, authorKey2, distance FROM ( SELECT a1.authorKey authorKey1, a2.authorKey authorKey2, editDistance (a1.name, a2.name) distance FROM DirtyAuthors a1, DirtyAuthors a2) WHERE distance < maxDist; No optimization supported for a Cartesian product with external function calls
GTI 2007/08 H.Galhardas Window scanning S n
GTI 2007/08 H.Galhardas Window scanning S n
GTI 2007/08 H.Galhardas Window scanning S n May loose some matches
GTI 2007/08 H.Galhardas String distance filtering S1S1 S2S2 maxDist = 1 John Smith John Smit Jogn Smith John Smithe length length- 1 length length + 1 editDistance
GTI 2007/08 H.Galhardas Annotation-based optimization The user specifies types of optimization The system suggests which algorithm to use Ex: CREATE MATCHING MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET dist = editDistance(da1.name, da2.name) WHERE dist < maxDist % distance-filtering: map= length; dist = abs % INTO MatchAuthors
GTI 2007/08 H.Galhardas AJAX features An extensible data quality framework –Logical operators as extensions of relational algebra –Physical execution algorithms A declarative language for logical operators –SQL extension A debugger facility for tuning a data cleaning program application –Based on a mechanism of exceptions
GTI 2007/08 H.Galhardas DEFINE FUNCTIONS AS Choose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerException Generate.generateId(INTEGER) RETURN STRING Normal.removeCitationTags(STRING) RETURN STRING (600) DEFINE ALGORITHMS AS TransitiveClosure SourceClustering(STRING) DEFINE INPUT DATA FLOWS AS TABLE DirtyData (paper STRING (400) ); TABLE City (city STRING (80), citysyn STRING (80) ) KEY city,citysyn; DEFINE TRANSFORMATIONS AS CREATE MAPPING mapKeDiDa FROM DirtyData Dd LET keyKdd = generateId(1) {SELECT keyKdd AS paperKey, Dd.paper AS paper KEY paperKey CONSTRAINT NOT NULL mapKeDiDa.paper } Declarative specification
GTI 2007/08 H.Galhardas DEFINE FUNCTIONS AS Choose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerException Generate.generateId(INTEGER) RETURN STRING Normal.removeCitationTags(STRING) RETURN STRING (600) DEFINE ALGORITHMS AS TransitiveClosure SourceClustering(STRING) DEFINE INPUT DATA FLOWS AS TABLE DirtyData (paper STRING (400) ); TABLE City (city STRING (80), citysyn STRING (80) ) KEY city,citysyn; DEFINE TRANSFORMATIONS AS CREATE MAPPING mapKeDiDa FROM DirtyData Dd LET keyKdd = generateId(1) {SELECT keyKdd AS paperKey, Dd.paper AS paper KEY paperKey CONSTRAINT NOT NULL mapKeDiDa.paper } Graph of data transformations Declarative specification
GTI 2007/08 H.Galhardas AJAX features An extensible data quality framework –Logical operators as extensions of relational algebra –Physical execution algorithms A declarative language for logical operators –SQL extension A debugger facility for tuning a data cleaning program application –Based on a mechanism of exceptions
GTI 2007/08 H.Galhardas Management of exceptions Problem: to mark tuples not handled by the cleaning criteria of an operator Solution: to specify the generation of exception tuples within a logical operator –exceptions are thrown by external functions –output constraints are violated
GTI 2007/08 H.Galhardas Debugger facility Supports the (backward and forward) data derivation of tuples wrt an operator to debug exceptions Supports the interactive data modification and, in the future, the incremental execution of logical operators
GTI 2007/08 H.Galhardas Debugging exceptions
GTI 2007/08 H.Galhardas Architecture
GTI 2007/08 H.Galhardas References Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Cristian- Augustin Saita: “Declarative Data Cleaning: Language, Model, and Algorithms”. VLDB 2001: