Presentation is loading. Please wait.

Presentation is loading. Please wait.

SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas.

Similar presentations


Presentation on theme: "SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas."— Presentation transcript:

1

2 SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

3 SAD Tagus 2 Plan  Context Problem statement Contributions Our data cleaning solution Validation Related solutions Conclusions

4 SAD Tagus 3 Application context –Eliminate errors and duplicates within a single source –Integrate data from different sources –Migrate poorly structured data into structured data

5 SAD Tagus 4 Typical architecture Human Knowledge Human Knowledge Data Extraction Data Loading Data Transformation Metadata Dictionaries Data Analysis Schema Integration... SOURCE DATA TARGET DATA Data Transformation

6 SAD Tagus 5 Data cleaning Activity of transforming source data into target data without errors, duplicates, and inconsistencies

7 SAD Tagus 6 Motivating example (1) DirtyData(paper:String) Data Cleaning Events(eventKey, name) Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year) Authors(authorKey, name) PubsAuthors(pubKey, authorKey)

8 SAD Tagus 7 Motivating example (2) [1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996 [2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self- maintianable for data warehousing, PDIS’95 DirtyData Data Cleaning PDIS | Conference on Parallel and Distributed Information Systems Events QGMW96| Making Views Self-Maintainable for Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996 Publications Authors DQua | Dallan Quass AGup | Ashish Gupta JWid | Jennifer Widom ….. QGMW96 | DQua QGMW96 | AGup …. PubsAuthors

9 SAD Tagus 8 Plan Context  Problem statement Contributions Our data cleaning solution Validation Related solutions Conclusions

10 SAD Tagus 9 Modeling a data cleaning process A data cleaning process is modeled by a directed acyclic graph of data transformations DirtyData DirtyAuthors Authors Duplicate Elimination Extraction Standardization Formatting DirtyTitles...DirtyEvents Cities Tags

11 SAD Tagus 10 Existing technology Ad-hoc code –difficult to maintain Extraction Transformation Loading (ETI, Informatica, Sagent) –limited cleaning functionality Data Reengineering (Integrity) –fixed implementation for certain operators Specific-domain cleaning (idCentric, PureIntegrate) –names and addresses Duplicate elimination (DataCleanser, matchIt) –finds/eliminates duplicates

12 SAD Tagus 11 Problems of existing solutions (1) The semantics of some data transformations is defined in terms of their implementation algorithms App. Domain 1 App. Domain 2 App. Domain 3 Data cleaning transformations...

13 SAD Tagus 12 There is a lack of interactive facilities to tune a data cleaning application program Problems of existing solutions (2) Dirty Data Cleaning process Clean dataRejected data

14 SAD Tagus 13 AJAX An extensible data cleaning framework A declarative language for logical operators Efficient implementation of the match operator A debugger facility for tuning a data cleaning program application

15 SAD Tagus 14 Data cleaning framework Logical level: set of logical operators to express cleaning criteria enclosed in each data transformation Physical level: set of algorithms that implement the logical operations

16 SAD Tagus 15 Logical level: parametric operators View: arbitrary SQL query Map: iterator-based one-to-many mapping with arbitrary user-defined functions Match: iterator-based approximate join Cluster: uses an arbitrary clustering function Merge: extends SQL group-by with user-defined aggregate functions Apply: executes an arbitrary user-defined algorithm Map Match Merge ClusterView Apply

17 SAD Tagus 16 Logical level DirtyData DirtyAuthors Authors Duplicate Elimination Extraction Standardization Formatting DirtyTitles... Cities Tags

18 SAD Tagus 17 Logical level DirtyData DirtyAuthors Map Cluster Match Merge Authors Map Duplicate Elimination Extraction Standardization Formatting DirtyTitles... Cities Tags DirtyData DirtyAuthors TC NL Authors SQL Scan Java Scan Physical level DirtyTitles... Java Scan Cities Tags

19 SAD Tagus 18 Contributions An extensible data cleaning framework  A declarative language for logical operators Efficient implementation of the match operator A debugger facility for tuning a data cleaning program application

20 SAD Tagus 19 Match Input: 2 relations Finds data records that correspond to the same real object Calls distance functions for comparing field values and computing the distance between input tuples Output: 1 relation containing matching tuples and possibly 1 or 2 relations containing non-matching tuples

21 SAD Tagus 20 Example Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors

22 SAD Tagus 21 Example CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist INTO MatchAuthors Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors

23 SAD Tagus 22 Example CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist INTO MatchAuthors Input: DirtyAuthors(authorKey, name) 861|johann christoph freytag 822|jc freytag 819|j freytag 814|j-c freytag Output: MatchAuthors(authorKey1, authorKey2, name1, name2) 861|822|johann christoph freytag| jc freytag 822|814|jc freytag|j-c freytag... Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors

24 SAD Tagus 23 Implementation of the match operator  s 1  S 1, s 2  S 2 (s 1, s 2 ) is a match if editDistance (s 1, s 2 ) < maxDist

25 SAD Tagus 24 Nested loop S1S1 S2S2... Very expensive evaluation when handling large amounts of data  Need alternative execution algorithms for the same logical specification editDistance

26 SAD Tagus 25 A database solution CREATE TABLE MatchAuthors AS SELECT authorKey1, authorKey2, distance FROM ( SELECT a1.authorKey authorKey1, a2.authorKey authorKey2, editDistance (a1.name, a2.name) distance FROM DirtyAuthors a1, DirtyAuthors a2) WHERE distance < maxDist;  No optimization supported for a Cartesian product with external function calls

27 SAD Tagus 26 Window scanning S n

28 SAD Tagus 27 Window scanning S n

29 SAD Tagus 28 Window scanning S n  May loose some matches

30 SAD Tagus 29 String distance filtering S1S1 S2S2 maxDist = 1 John Smith John Smit Jogn Smith John Smithe length length- 1 length length + 1 editDistance

31 SAD Tagus 30 Annotation-based optimization The user specifies types of optimization The system suggests which algorithm to use Ex: CREATE MATCHING MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET dist = editDistance(da1.name, da2.name) WHERE dist < maxDist % distance-filtering: map= length; dist = abs % INTO MatchAuthors

32 SAD Tagus 31 Contributions An extensible data cleaning framework A declarative language for logical operators Efficient implementation of the match operator  A debugger facility for tuning a data cleaning program application

33 SAD Tagus 32 Management of exceptions Problem: to mark tuples not handled by the cleaning criteria of an operator Solution: to specify the generation of exceptional tuples within a logical operator –exceptions are thrown by external functions –output constraints are violated

34 SAD Tagus 33 Example (1) CREATE MAP ExtractionCities FROM StandardizedDirtyData dd LET city = extractCities(dd.paper, Cities), { SELECT dd.paperKey AS pubKey, city AS city INTO ExtractedCities CONSTRAINT NOT NULL city } Map ExtractedCities (pubKey, city) Extraction Cities StandardizedDirtyData (pubKey, paper)

35 SAD Tagus 34 Example(2) ExtractionCities Cities ExtractedCities StandardizedDirtyData exc 4| ManyDifferentCities StandardizedDirtyData 4|y ioannidis r ng k shim and t sellis parametric query optimization technical report univ of wisconsin madison and univ of maryland college park

36 SAD Tagus 35 Debugger facility Supports the (backward and forward) data derivation of tuples wrt an operator to debug exceptions Supports the interactive data modification and the incremental execution of some logical operators

37 SAD Tagus 36 4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison and Univ. Of Maryland, College Park, 1992 4| ManyDifferentCities 4|Technical Report, Univ. Of Wisconsin, and Univ. Of Maryland StandardizedDirtyDataForExtraction StandardizeDataForExtraction ExtractionAuthorsTitleEvent DirtyEvents KeyDirtyData StandardizeData StandardizedDirtyData ExtractionCities ExtractedCities StandardizedDirtyData exc BackwardDerivation ForwardDerivation Backward/forward data derivation Cities

38 SAD Tagus 37 4| ManyDifferentCities 4|Technical Report, Univ. Of Wisconsin and Univ. Of Maryland StandardizedDirtyDataForExtraction StandardizeDataForExtraction ExtractionAuthorsTitleEvent DirtyEvents KeyDirtyData StandardizeData StandardizedDirtyData ExtractionCities ExtractedCities StandardizedDirtyData exc 4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison, 1992 101| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Maryland, College Park, 1992 Interactive data correction (1) Cities

39 SAD Tagus 38 4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison, 1992 101| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Maryland, College Park, 1992 KeyDirtyData Interactive data correction(2) 4| Technical Report, Univ. Of Wisconsin 101| Technical Report, Univ. Of Maryland 4| Madison 101| College Park StandardizedDirtyDataForExtraction StandardizeDataForExtraction ExtractionAuthorsTitleEvent DirtyEvents StandardizeData StandardizedDirtyData ExtractionCities ExtractedCities incremental Cities

40 SAD Tagus 39 AJAX Architecture

41 SAD Tagus 40 AJAX Demo


Download ppt "SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas."

Similar presentations


Ads by Google