INTEGRATION INTEGRATION Ramon Lawrence University of Iowa
USING UNITY Ken Barker University of Calgary
Summary The Unity prototype tackles the schema integration problem by constructing an integrated, global view in a bottom-up approach. u Constructing a global view in this manner requires describing data source semantics using a dictionary and a XML-based language. The extraction process, which is semi-automatic in nature, is separated from the integration process. u Thus, the integration process is automatic, and there is no requirement for a global human integrator. Systematic naming using a dictionary allows global queries to be graphically constructed without specifying joins between global relations. u The global view produced demonstrates properties similar to a dynamically constructed Universal Relation.
Benefits and Contributions The architecture automatically integrates relational schemas into a global view for querying. Unique contributions: u Synthesizing a global view from the bottom-up instead of top-down improves integration scalability. u Organizing the global view as a hierarchy of concepts instead of relations or predicates simplifies querying as the user does not have to specify specific relations or join conditions. This is called Querying by Context (QBC). u Query processing is achieved by dynamically discovering extraction rules based on the naming of fields and tables. ïThe discovered rules are similar to the extraction rules of global- as-view (GAV) systems.
Unity Overview Unity is a software package that performs bottom- up integration with a GUI. u Developed using Microsoft Visual C++ 6 and Microsoft Foundation Classes (MFC). Unity allows the user to: u Construct and modify standard dictionaries. u Build X-Specs to describe data sources including extraction of metadata using ODBC and mapping system names to dictionary terms. u Integrate X-Specs into an integrated view. u Transparently query integrated systems using ODBC and automatically generate SQL queries.
Architecture Components The architecture consists of four components: u A standard dictionary (SD) to capture data semantics ïSD terms are used to build semantic names describing semantics of schema elements. u X-Specs for storing data source descriptions ïRelational database info. stored and transmitted using XML. ïStores semantic names to describe schema elements. u Integration Algorithm ïIdentical concepts in different databases are identified by similar semantic names. ïProduces an integrated view of all database concepts. u Query Processor ïAllows the user to formulate queries on the view. ïTranslates from semantic names in integrated view to SQL queries and integrates and formats results. s Involves determining correct field and table mappings s and discovery of join conditions and join paths.
Querying by Context (QBC) Querying by context (QBC) is a methodology for querying relational databases by semantics. u Querying is performed by selecting semantic names that represent query concepts from the integrated view. u The integrated, context view contains all concepts present in the databases referenced by semantic names. Query by Context performs dynamic closure relating concepts for the user as they browse the integrated view. u This allows a limited form of recursive queries and eliminates the need for the user to specify joins. The query processor maps the user’s selections and criteria to an actual SQL query.
References Publications: u Unity - A Database Integration Tool, R. Lawrence and K. Barker, TRLabs Emerging Technology Bulletin, Jan u Multidatabase Querying by Context, R. Lawrence and K. Barker, DataSem2000, pages , Oct u Integrating Relational Database Schemas using a Standardized Dictionary, SAC’ ACM Symposium on Applied Computing, pages , March u Querying Relational Databases without Explicit Joins DASWIS International Workshop on Data Semantics in Web Information Systems (with ER'2001), Nov Further Information: u
Integration Example
BodyWorks Systems Web Server Custom Accounting Package Shipment Tracking Software Customer Order Database Invoice Database Shipment Database Bodyworks is a fictional company with 3 legacy databases that must be integrated for management reporting.
Query-Driven Data Extraction Invoice Database Order Database Shipment Database Unity Software ODBC Querying Integrated Context View Query Processor and ODBC Manager X-Spec Editor Standard Dictionary Integration Algorithm
Integration is performed with 3 separate processes: u Capture process: independently extract database schema information into a XML document called a X-Spec. ïThis process is a semi-automatic description using a dictionary. u Integration process: combines X-Specs into a structurally-neutral hierarchy of database concepts called an integrated context view. ïThis process performs automatic name matching, but imprecision may occur. u Query process: allows the user to formulate queries on the integrated view that are mapped by the query processor to structural queries (SQL), executed using ODBC, and the results are combined using global keys. ïUsers do not have to specify joins when querying the global view. Integration Processes
The Unity Prototype
What is the open problem? The GAV and LAV approaches are both viable methods for solving data integration. However, the open problem is that neither approach performs schema integration - the construction of the global view itself. u GAV - GV constructed (schema integration performed) by global designer when specifying extraction rules. u LAV - GV is pre-defined using some previous integration process (most likely manual in nature). u Both methods rely on the concept of a global user to create the global schema.
How Unity is Different Our integration architecture called Unity is different because it approaches the integration problem from a different perspective: Thus, the integration problem is tackled from a different set of starting assumptions: u Do not assume pre-existing or manually created GV. u However, assume we have a dictionary and a language for describing schema and data element semantics. u Attempt to automatically build a GV from source descriptions of each data source. How can we automate, or semi-automate, the construction of the global view by extracting information from the local data sources?
The Unity Approach Given a set of data sources and a dictionary and a language to describe data semantics: u 1) Semi-automatically extract and represent data source semantics in the language using the dictionary. u 2) Automatically match concepts across data sources by using the dictionary to determine related concepts. ïThis process effectively builds the global level relations or objects initially assumed or created in other approaches. ïHowever, since there is no manual intervention, the precision of global view construction is affected by inconsistencies in the descriptions of the data sources and matching concepts. u 3) Automatically generate queries specified by the user using dictionary terms (not structures) and map the user's query to appropriate data elements in the local sources.
What is wrong with SQL? There is nothing wrong with SQL. However, SQL is not a simple query language for many reasons: u Querying by structure does not hide complexities introduced due to database normalization. u Structures (fields and tables) may be assigned poor names that do not adequately describe their semantics. u Notion of a “join” is confusing for beginner users especially when multiple joins are present. u SQL forces structural access which does not provide logical query transparency and restricts logical schema evolution. u Querying multiple databases (without a global view) using SQL-variants is complex because naming and structural conflicts must be resolved during query formulation.