DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010
INTRODUCTION USER – Difficult finding correct mappings for applications – Schema mappings are complex, effectively communicating subtleties involved – Understanding source data difficult, hence provide facility for schema and data exploration – Complexities of mapping and subtle difference between alternative mappings – Reasoning about complex non-associative operators – Increase of data and necessity to integrate data from multiple source – Mappings between these schemas – But Still some issues need to be addressed
ILLUSTRATIONS “ The Ultimate goal of schema is not building correct queries but to extract correct data from source to populate target schema” The user is expected to have thorough understanding of data Debug complex SQL queries or procedural transformations Clio makes it easy
ILLUSTRATIONS Source: Ling Ling Yan, Ren\&\#233;e J. Miller, Laura M. Haas, and Ronald Fagin Data-driven understanding and refinement of schema mappings. SIGMOD Rec. 30, 2 (May 2001), DOI= /
MAPPINGS Mapping is a query on source schema that produces subset of target relation Mapping involves three main activities Determining Correspondences Data Linking Data trimming
A be set of attributes A A A relation on schema S is named finite set of tuples on S t[A] dom(A) value of t on A Assumption: Relation in source database do not contain any tuple that are null on any attribute
Predicate P over schema S maps tuples on S to true or false – Join Predicate – Selection predicate A predicate is strong if it evaluates to false for every tuple that is null for all attributes in S Join Predicate is strong predicate Selection predicate is not required to be strong
Correspondence to Target What attribute and how it should appear in target relation E.g: Kids.FamilyIncome = parents.salary + parents2.salary (ref
DATA LINKING
DATA TRIMMING All tuples in Query Graph G may not be semantically meaningful Data associations in some category may be too incomplete to include User decides some categories are excluded as they have incomplete coverage
MAPPING DEFINITION
Mapping defines the relationship between a target relation and set of source relations, defined with three main components : – Query graph G – Set V of Value Components – Two sets of filter Cs and C T defining conditions source and target should satisfy
MAPPING EXAMPLES Positive example states how source tuples contribute successfully to target relation Negative example states how source tuples are combined correctly but fails to contribute
MAPPINGS OPERATORS Correspondence Operators Permit users to change value of correspondences Data Trimming Operators Modify the source and target filters of a mapping. They do not change the query graph of a mapping. Data Linking Operators Directly change the query graph of mapping. They are of two type: Data Walk Data Chase
DATA WALK In a data walk, the user knows where the missing data resides in the source or more specifically what source relation(s) contain this data. A data walk makes use of Clio’s knowledge of the source schema (which is gathered from schema and constraint definitions and from mining the source data, views, stored queries and metadata).
DATA CHASE In a data Chase, the user does not know where the missing data resides. The chase permits the user to explore the source data incrementally to locate the desired data. The user may not know which relations to include in the extended query graph.
CLIO FOR LARGE MAPPINGS Manage and manipulate multiple (possible) mappings while the user explores the data, creates new correspondences and extends the query graph. More complex the relationship between source and target, the more (possible) mappings we must handle. Large schemas are a source of complexity. Large volumes of data need to be transformed. Unfamiliar data sources the amount of data itself might be an obstacle for mapping.
CLIO MAPPING FRAMEWORK Clio provides Target Viewer “What You Is What You Get” flavor to the mapping. Source Viewer Serves as a palette from which users can choose the relations with which they want to work or explicitly select an edge to follow. Provides a visualization of the query graph being constructed. A set of workspaces, each associated with a single mapping alternative.
COMPLEX MAPPINGS Many single target mappings create will have great deal of overlap, differing only in a few correspondences or a small portion of query graph. The decisions made in creating one mapping can be stored and made available to the user in order reduce the burden and overhead of re-creating the bulk of each mapping from scratch.
CLIO FOR COMPLEX MAPPINGS Clio automatically computes both possible mappings and the user can accept one or several, adding filters as needed. Clio’s rich framework supports the user in specifying complex target mappings.
SUMMARY presents a new framework that uses examples drawn from source data to illustrate complex schema mappings. Provides formal definitions of mappings, mapping examples and mapping operators and shows how they can be used to help a user understand the data and develop mappings.
QUESTIONS?