Download presentation
Presentation is loading. Please wait.
2
SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas
3
SAD Tagus 2 Plan Context Problem statement Contributions Our data cleaning solution Validation Related solutions Conclusions
4
SAD Tagus 3 Application context –Eliminate errors and duplicates within a single source –Integrate data from different sources –Migrate poorly structured data into structured data
5
SAD Tagus 4 Typical architecture Human Knowledge Human Knowledge Data Extraction Data Loading Data Transformation Metadata Dictionaries Data Analysis Schema Integration... SOURCE DATA TARGET DATA Data Transformation
6
SAD Tagus 5 Data cleaning Activity of transforming source data into target data without errors, duplicates, and inconsistencies
7
SAD Tagus 6 Motivating example (1) DirtyData(paper:String) Data Cleaning Events(eventKey, name) Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year) Authors(authorKey, name) PubsAuthors(pubKey, authorKey)
8
SAD Tagus 7 Motivating example (2) [1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996 [2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self- maintianable for data warehousing, PDIS’95 DirtyData Data Cleaning PDIS | Conference on Parallel and Distributed Information Systems Events QGMW96| Making Views Self-Maintainable for Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996 Publications Authors DQua | Dallan Quass AGup | Ashish Gupta JWid | Jennifer Widom ….. QGMW96 | DQua QGMW96 | AGup …. PubsAuthors
9
SAD Tagus 8 Plan Context Problem statement Contributions Our data cleaning solution Validation Related solutions Conclusions
10
SAD Tagus 9 Modeling a data cleaning process A data cleaning process is modeled by a directed acyclic graph of data transformations DirtyData DirtyAuthors Authors Duplicate Elimination Extraction Standardization Formatting DirtyTitles...DirtyEvents Cities Tags
11
SAD Tagus 10 Existing technology Ad-hoc code –difficult to maintain Extraction Transformation Loading (ETI, Informatica, Sagent) –limited cleaning functionality Data Reengineering (Integrity) –fixed implementation for certain operators Specific-domain cleaning (idCentric, PureIntegrate) –names and addresses Duplicate elimination (DataCleanser, matchIt) –finds/eliminates duplicates
12
SAD Tagus 11 Problems of existing solutions (1) The semantics of some data transformations is defined in terms of their implementation algorithms App. Domain 1 App. Domain 2 App. Domain 3 Data cleaning transformations...
13
SAD Tagus 12 There is a lack of interactive facilities to tune a data cleaning application program Problems of existing solutions (2) Dirty Data Cleaning process Clean dataRejected data
14
SAD Tagus 13 AJAX An extensible data cleaning framework A declarative language for logical operators Efficient implementation of the match operator A debugger facility for tuning a data cleaning program application
15
SAD Tagus 14 Data cleaning framework Logical level: set of logical operators to express cleaning criteria enclosed in each data transformation Physical level: set of algorithms that implement the logical operations
16
SAD Tagus 15 Logical level: parametric operators View: arbitrary SQL query Map: iterator-based one-to-many mapping with arbitrary user-defined functions Match: iterator-based approximate join Cluster: uses an arbitrary clustering function Merge: extends SQL group-by with user-defined aggregate functions Apply: executes an arbitrary user-defined algorithm Map Match Merge ClusterView Apply
17
SAD Tagus 16 Logical level DirtyData DirtyAuthors Authors Duplicate Elimination Extraction Standardization Formatting DirtyTitles... Cities Tags
18
SAD Tagus 17 Logical level DirtyData DirtyAuthors Map Cluster Match Merge Authors Map Duplicate Elimination Extraction Standardization Formatting DirtyTitles... Cities Tags DirtyData DirtyAuthors TC NL Authors SQL Scan Java Scan Physical level DirtyTitles... Java Scan Cities Tags
19
SAD Tagus 18 Contributions An extensible data cleaning framework A declarative language for logical operators Efficient implementation of the match operator A debugger facility for tuning a data cleaning program application
20
SAD Tagus 19 Match Input: 2 relations Finds data records that correspond to the same real object Calls distance functions for comparing field values and computing the distance between input tuples Output: 1 relation containing matching tuples and possibly 1 or 2 relations containing non-matching tuples
21
SAD Tagus 20 Example Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors
22
SAD Tagus 21 Example CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist INTO MatchAuthors Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors
23
SAD Tagus 22 Example CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist INTO MatchAuthors Input: DirtyAuthors(authorKey, name) 861|johann christoph freytag 822|jc freytag 819|j freytag 814|j-c freytag Output: MatchAuthors(authorKey1, authorKey2, name1, name2) 861|822|johann christoph freytag| jc freytag 822|814|jc freytag|j-c freytag... Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors
24
SAD Tagus 23 Implementation of the match operator s 1 S 1, s 2 S 2 (s 1, s 2 ) is a match if editDistance (s 1, s 2 ) < maxDist
25
SAD Tagus 24 Nested loop S1S1 S2S2... Very expensive evaluation when handling large amounts of data Need alternative execution algorithms for the same logical specification editDistance
26
SAD Tagus 25 A database solution CREATE TABLE MatchAuthors AS SELECT authorKey1, authorKey2, distance FROM ( SELECT a1.authorKey authorKey1, a2.authorKey authorKey2, editDistance (a1.name, a2.name) distance FROM DirtyAuthors a1, DirtyAuthors a2) WHERE distance < maxDist; No optimization supported for a Cartesian product with external function calls
27
SAD Tagus 26 Window scanning S n
28
SAD Tagus 27 Window scanning S n
29
SAD Tagus 28 Window scanning S n May loose some matches
30
SAD Tagus 29 String distance filtering S1S1 S2S2 maxDist = 1 John Smith John Smit Jogn Smith John Smithe length length- 1 length length + 1 editDistance
31
SAD Tagus 30 Annotation-based optimization The user specifies types of optimization The system suggests which algorithm to use Ex: CREATE MATCHING MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET dist = editDistance(da1.name, da2.name) WHERE dist < maxDist % distance-filtering: map= length; dist = abs % INTO MatchAuthors
32
SAD Tagus 31 Contributions An extensible data cleaning framework A declarative language for logical operators Efficient implementation of the match operator A debugger facility for tuning a data cleaning program application
33
SAD Tagus 32 Management of exceptions Problem: to mark tuples not handled by the cleaning criteria of an operator Solution: to specify the generation of exceptional tuples within a logical operator –exceptions are thrown by external functions –output constraints are violated
34
SAD Tagus 33 Example (1) CREATE MAP ExtractionCities FROM StandardizedDirtyData dd LET city = extractCities(dd.paper, Cities), { SELECT dd.paperKey AS pubKey, city AS city INTO ExtractedCities CONSTRAINT NOT NULL city } Map ExtractedCities (pubKey, city) Extraction Cities StandardizedDirtyData (pubKey, paper)
35
SAD Tagus 34 Example(2) ExtractionCities Cities ExtractedCities StandardizedDirtyData exc 4| ManyDifferentCities StandardizedDirtyData 4|y ioannidis r ng k shim and t sellis parametric query optimization technical report univ of wisconsin madison and univ of maryland college park
36
SAD Tagus 35 Debugger facility Supports the (backward and forward) data derivation of tuples wrt an operator to debug exceptions Supports the interactive data modification and the incremental execution of some logical operators
37
SAD Tagus 36 4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison and Univ. Of Maryland, College Park, 1992 4| ManyDifferentCities 4|Technical Report, Univ. Of Wisconsin, and Univ. Of Maryland StandardizedDirtyDataForExtraction StandardizeDataForExtraction ExtractionAuthorsTitleEvent DirtyEvents KeyDirtyData StandardizeData StandardizedDirtyData ExtractionCities ExtractedCities StandardizedDirtyData exc BackwardDerivation ForwardDerivation Backward/forward data derivation Cities
38
SAD Tagus 37 4| ManyDifferentCities 4|Technical Report, Univ. Of Wisconsin and Univ. Of Maryland StandardizedDirtyDataForExtraction StandardizeDataForExtraction ExtractionAuthorsTitleEvent DirtyEvents KeyDirtyData StandardizeData StandardizedDirtyData ExtractionCities ExtractedCities StandardizedDirtyData exc 4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison, 1992 101| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Maryland, College Park, 1992 Interactive data correction (1) Cities
39
SAD Tagus 38 4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison, 1992 101| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Maryland, College Park, 1992 KeyDirtyData Interactive data correction(2) 4| Technical Report, Univ. Of Wisconsin 101| Technical Report, Univ. Of Maryland 4| Madison 101| College Park StandardizedDirtyDataForExtraction StandardizeDataForExtraction ExtractionAuthorsTitleEvent DirtyEvents StandardizeData StandardizedDirtyData ExtractionCities ExtractedCities incremental Cities
40
SAD Tagus 39 AJAX Architecture
41
SAD Tagus 40 AJAX Demo
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.