Using AutoMed Metadata in Data Warehousing Environments Hao FanAlexandra Poulovassilis School of Computer Science & Information Systems Birkbeck college, University of London ACM International Workshop on Data Warehousing and OLAP 7 th November 2003
Outline What is AutoMed? Creating AutoMed DW Metadata Using AutoMed DW Metadata Comparison of AutoMed and Conceptual Data Model (CDM) approaches Conclusion
What is AutoMed HDM (Hypergraph Data Model) schemas consist of a set of Nodes, Edges and Constraints Transformation Pathways add/extend delete/contract rename IQL language (See for a technical report on The Automed Intermediate Query Language.)
A Data Integration Example 1. addRel ( >, >); 2. addAtt ( >, >); 3. addAtt ( >, gc sum [(d,s)|(i,j,s) >; (i',j',d) >; i=i'; j=j']); 4. delEdge ( >, >); 5. delNode ( >,[n|(d,n) >); 6. delNode ( >, >); 7. contractHierarchy ( >); 8. contractHierarchy ( >); 9. contractAtt ( >); 10. contractAtt ( >); 11. contractFact ( >); 12. contractAtt ( >); 13. contractDim ( >); 14. contractAtt ( >); 15. contractDim ( >);
Data Transformation/Integration
Creating AutoMed Metadata Create Automed metadata repository Any DBMS supporting JDBC Specify data models All data Models used in DW schemas e.g., RDB, XML, Multi-Dim, etc. Extract data source schemas Define transformation pathways Manually Automatically
Creating AutoMed Transformation Pathways 1)Transforming 2)Single-source cleansing 3)Multi-source cleansing 4)Integration 5)Summarizing 6)Creating data marts AutoMed Transformation Pathways can be used for the following data warehousing activities:
Data Cleansing adds a new construct `temp to the schema, whose extent consists of clean data; adds a new construct `temp to the schema, whose extent consists of clean data; contracts the dirty construct, C, which is being cleaned contracts the dirty construct, C, which is being cleaned adds a new construct, C, derived from the the data in`temp ; adds a new construct, C, derived from the the data in`temp ; deletes or contracts the `temp construct. deletes or contracts the `temp construct. The general pathway used for Data Cleansing:
Single-source Cleansing Person (id, name, address, zip, city, country) addRel ( >, toolCall 'QuickAddress Batch' ' >' ' ' ' >'); contractAtt ( >); addAtt ( >, [(i,z)|(i,a,z) >]); addAtt ( >, [(i,a)|(i,a,z) >]); deleteRel ( >, [(i,a,z)|(i,a) >; (i',z) >;i=i']);
Multi-source Cleansing Person (id, maritalStatus) Emp (id, name, maritalStatus) addAtt ( >, >-- [(i,s)|(i,s) >; (i',s') >; i = i'; not (s = s')]); contractAtt ( >); renameAtt ( >, >);
Using AutoMed Metadata Incremental View Maintenance Data Lineage Tracing
Using AutoMed Metadata for IVM Incremental View Maintenance S GS D V TP = tp 1, …, tp r 1 1 tp 1 i i tp 2, …, tp i r r tp i+1, …, tp r See H. Fan. Incremental view maintenance and data lineage tracing in heterogeneous database environments. In proc. BNCOD02 PhD Summer school, Sheffied, 2002.
Using AutoMed Metadata for DLT Data Lineage Tracing Algorithms Fully Materialized Pathway Fully Virtual Pathway Partially Materialized Pathway Data Lineage Affect-Pool Origin-Pool DLT formulae q s AP (t) q s OP (t) See H. Fan and A. Poulovassilis. Tracing data lineage using schema transformation pathways. In knowledge Transformation for the Semantic Web, IOS Press, 2003.
AutoMed vs. CDM approach
Discussion Semantic mismatches Tightly coupled with the CDM Not straightforward to reuse the integration effort if a source schema is changed No semantic mismatch Possible to extend data warehouse views into a different data model Easily reuse the trans- formation and integration efforts if a source schema is changed - see Section 5 of the paper Conceptual Data Model:AutoMed:
Conclusion AutoMed metadata can be used for expressing data warehousing activities, including data cleansing; AutoMed metadata can be used for incrementally maintaining the DW data and data lineage tracing; Compared with CDM, AutoMed has several advantages; In contrast to commercial ETL tools, AutoMed metadata provides sufficient information for IVM and DLT. Limitations: Not all data warehouse metadata can be captured by AutoMed Currently, transformation pathways are created manually. However, we are investigating automatic/semi-automatic generation techniques
Acknowledge Thank you!