Data integration mediation system “ … The mountain is a mountain, The mountain is not a mountain The mountain is a mountain. “ Presented by Taras Mahlin Heterogeneous reasoning and mediator system
Problem: Mountain is not a mountain The past few decades have witnessed a spectacular explosion in the quantity of data available in one electronic form or another. This vast quantity of data has been gathered, organized, and stored by a small army of individuals, working for different organizations on varied problems.
Solution: Mountain is a mountain Synergetic approach - the complete thing is much more then all it’s components together. Integration of disparate data sources by pooling fragmented data together, resolving data conflicts, and transforming them into information objects All these while user continue to use existing systems for routine function of add, change and delete.
Advantages Advantages An integration alleviates the burden of duplicating the data gathering efforts. Synergetic effect - it enables the extraction of information that would otherwise be impossible. For example: For example: –Law enforcement agencies ( Interpol ) –Insurance companies –Medical researchers and epidemiologists Integrating Heterogeneous Data Sources
Integrating Heterogeneous Reasoning Paradigms In conjunction with the ability to integrate a variety of data sources is the need to integrate diverse forms of reasoning. Access to such reasoning systems provides mediators with sophisticated abilities to extract and produce new information from existing data. For example: For example: –Problem of terrain reasoning Determining where resources can be physically situated Integrating multiple forms of reasoning that may include logical inference, numerical optimization, planning, pattern recognition, scheduling, and learning.
Mediator technology Seamless integration of information located across multiple, heterogeneous computer platforms and recorded in multiple, heterogeneous electronic formats. –relational database management systems, –other non-relational database management systems, –flat files, text files etc. Mediator technology defines a structure and architecture that allows software applications to be independent of the underlying data resources.
Mediators provide: – Intelligence for understanding, selecting, accessing, merging, and manipulating data. –New level of knowledge –Consistent responses to questions regardless of who asks the question. – Seamless integration of information from multiple existing sources without having to redesign existing databases (i.e., legacy data) or change existing operational systems. Mediator technology cont
Mediator technology - summary Mediators perform ``mediation'' between applications and databases. Mediators are software modules that occupy an explicit, active layer between an end user application and the data sources the application is accessing. In this way, the Mediator forms a distinct middle layer, making user applications independent of data sources. They capture knowledge from the data experts so that the common user can find the information. Mediators do not create a new database. A mediator creates a ``virtual'' database that supplies data contained in the existing database(s). Mediators use existing databases and require no redesign or changes in these databases or existing operational systems. Mediators provide easy access to information. They support a heterogeneous computing environment (i.e., multiple hardware, software, and databases). It provides a cost effective means to integrate data from heterogeneous information systems.
Mediators - goals and implementation The aim of the system is to develop the principled methodology for –integrating multiple data sources and –reasoning systems, –and to propose a mediator language within which access to the data sources and reasoning systems can be expressed uniformly. There are two important aspects to constructing a mediator: domain integration and semantic integration. –Domain integration –Domain integration is the physical linking of the data sources and reasoning systems. –Semantic integration –Semantic integration is the coherent extraction and combination of the information provided by the data and reasoning sources, serving a given purpose.
Domain integration Goal: –Adding a new source of data or reasoning system to an existing mediated system (or one being developed) such that Requirements: – resources provided by the new system, whether it is new data, or new representations of data, or a corpus of new reasoning algorithms, may be accessed by various mediators. –no recompilation of the whole system is needed –integrity of the system is preserved
Semantic integration Semantic integrationSemantic integration is the process of specifying methods –to resolve conflicts, –pool information together, –and define new, compositional operations based on existing operations in the individual data sources.
Data Integration and Mediation System DIMS is an implementation of "intelligent middleware” that resides between user applications and independent data sources. Data sources can reside on multiple, heterogeneous computer platforms and may be recorded in a variety of formats DIMS creates a “virtual object database” so that the user application sees the data retrieved from the various sources as though it were returned from a single, integrated database.
System major functions DIMS performs five major functions: –query decomposition/routing –object unification and fusion –removal of data redundancies –identification/resolution of data inconsistencies –advanced data integration techniques Although DIMS performs query decomposition/routing to multiple, heterogeneous data sources, DIMS’s main advantage is its data instance integration functionality.
Query processing example Query :Query : retrieve information about Employees and their associated Dependents. We assumes that the Employee and Dependent information is spread across three disparate data sources: –Personnel database – Payroll database – Benefits database. The Employee information is distributed across the Personnel and Payroll databases. The Dependent information is contained in the Benefits database.
Query processing example cont. Initially, a single query for Employee and their associated Dependent(s) information is sent from a user application to DIMS.
Query processing example cont. First retrieve the Employee objects which meet the specified constraint. Based upon domain-specific knowledge, we know that the Personnel database can supply the Employee name and title information, whereas the Payroll database can supply the Employee name and salary information. In both cases, DIMS automatically “knows” to also retrieve the Employee ID which will be needed for later data integration functions.
Tabular results are returned from the Personnel and Payroll databases to DIMS. Note that Mark Smith is returned only from the Personnel database and Jane Peterson is returned only from the Payroll database. Query processing example cont.
Query processing - data retrieving DIMS performs object unification based on the data returned from the data sources. Object unification is the combining of the data into object instances. Notice that the “Mark Smith” and the “Jane Peterson” objects have empty attributes since their information was returned from single sources with only partial information.Notice that the “Mark Smith” and the “Jane Peterson” objects have empty attributes since their information was returned from single sources with only partial information.
Query processing - redundancy elimination Once the object instances have been created, DIMS then removes any extraneous data redundancies. In this example, the “Tim Andrews” object has the same name listed twice Assume that the domain object model specified that each Employee object should have only one name attribute. Therefore, the “Tim Andrews” object has an extraneous, redundant name attribute which should not exist.
Query processing - redundancy elimination The system will automatically remove the second, redundant occurrence of the name for the “Tim Andrews” object.
Inconsistency resolving DIMS will then identify data inconsistencies within the objects. It can also provide resolutions to these data inconsistencies. In this example, there is a data inconsistency in the “Sarah Jones/Kaiser” object because –the Personnel database returned the name as “Sarah Jones” –whereas the Payroll database returned the name as “Sarah Kaiser” for the same Employee ID.
DIMS identifies the data inconsistency. DIMS will then flag the identified inconsistencies within an object. DIMS can also provides the source information associated with each data inconsistency to allow further automated and/or manual inconsistency handling. Inconsistency resolving cont.
Data inconsistency rules can be defined for a specific domain for DIMS. DIMS uses a rules-based expert system to apply the rules over the data. In this example, assume that a data inconsistency rule that specifies to use data from the Payroll database if there is an inconsistency in an employee’s name attribute is defined. Data inconsistency rules.
Based upon the example’s rule, DIMS will remove the “Sarah Jones” name that came from the Personnel database from the “Sarah Jones/Kaiser” object. Data inconsistency rules cont.
After the Employee objects have been integrated, DIMS will then send another query for the Dependent information associated with each of these Employees. This example assumes that only the Benefits database contains Dependent information. Based on the domain-specific knowledge, DIMS “knows” that each Employee object is associated to its Dependent object(s) via the Employee ID attribute. Therefore DIMS uses this information to constrain the new query. Getting dependent information
The Dependents information is returned from the Benefits database. Getting dependent information cont.
DIMS again performs object unification on the new results. However, instead of making totally independent objects for the Dependents, DIMS integrates the Dependent objects with the appropriate Employee objects. Since the Dependent objects contained no redundant data nor data inconsistencies, no further processing is needed on the Dependent information. Object unification
Finally, DIMS returns all the Employee objects and their associated Dependent objects to the user application as a single, packaged integrated response. The user application never had to “know” anything about all the extra processing that DIMS performed -- it simply knows that it had to send one query to DIMS and received one “clean”, integrated response. Composing the result
Units conversion Data abstraction Data aggregations Expert rules Advanced integration