CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:

CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date: 10/26/05 4:00 pm

Introduction Mapping between different representations of data are important. Mapping between different representations of data are important. Because it is necessary to integrate and exchange of data residing at a multiple sites, in different formats (or schemas), and even under different data models (such as relational or XML) Because it is necessary to integrate and exchange of data residing at a multiple sites, in different formats (or schemas), and even under different data models (such as relational or XML)

Data Exchange vs. Data Integration Data Exchange (or data translation): The task of restructuring data from a source format (or schema) into a target format(or schema). Data Exchange (or data translation): The task of restructuring data from a source format (or schema) into a target format(or schema). Data Integration (or federation) : The ability to query a set of heterogeneous data sources via a virtual unified target schema. Data Integration (or federation) : The ability to query a set of heterogeneous data sources via a virtual unified target schema.

Relationship or Mapping In both cases (data exchange, data integration), relationship or mapping must first be established between the source schema and the target schema. In both cases (data exchange, data integration), relationship or mapping must first be established between the source schema and the target schema. Two complementary level Two complementary level Syntactic one: schema matching Syntactic one: schema matching Operational one: instances over the source schema with instances over target schema Operational one: instances over the source schema with instances over target schema Move actual data from source to a target Move actual data from source to a target Answer queries Answer queries

Clio System Two important component Two important component Mapping Generation Component Mapping Generation Component Takes as input correspondences between the source and target schemas Takes as input correspondences between the source and target schemas Generates a schema mapping consisting of logical mapping that provide an interpretation of the given correspondences Generates a schema mapping consisting of logical mapping that provide an interpretation of the given correspondences Query Generation Component Query Generation Component To convert a set of logical mappings into an executable transformation script To convert a set of logical mappings into an executable transformation script SQL, SQL/XML, XQuery and XSLT SQL, SQL/XML, XQuery and XSLT The user can interact with the system through GUI during a design The user can interact with the system through GUI during a design The user can view, add, and remove correspondences The user can view, add, and remove correspondences

Clio System Two main component Two main component Mapping Generation Mapping Generation Query Generation Query Generation

Mapping & Query Generation Figure 2 illustrates an actual Clio screenshot showing portions of two gene expression schemas and correspondences Figure 2 illustrates an actual Clio screenshot showing portions of two gene expression schemas and correspondences Left: Source schema Left: Source schema Relational schema Relational schema Right: Target schema Right: Target schema XML schema XML schema

Mapping & Query Generation Mapping Generation Mapping Generation For any two related elements in the source schema, for which there exist correspondences into two related elements in the target schema. For any two related elements in the source schema, for which there exist correspondences into two related elements in the target schema. FACTOR_NAME and BIOLOGY_DESC : There is a foreign key that links the EXPERIMENTFACOR table to the EXPERIMENTSET table FACTOR_NAME and BIOLOGY_DESC : There is a foreign key that links the EXPERIMENTFACOR table to the EXPERIMENTSET table biology_desc and factor_name : the latter is a child of the former in XML schema. biology_desc and factor_name : the latter is a child of the former in XML schema. Therefore there will be a mapping that maps related instances of FACTOR_NAME and BIOLOGY_DESC into related instances biology_desc and factor_name Therefore there will be a mapping that maps related instances of FACTOR_NAME and BIOLOGY_DESC into related instances biology_desc and factor_name

Mapping & Query Generation Generation of tableaux Generation of tableaux The first step of the algorithm is to generate all the basic ways in which elements relate to each other within one schema The first step of the algorithm is to generate all the basic ways in which elements relate to each other within one schema In figure 3 we show several of the source and target tableaux that are generated for our example In figure 3 we show several of the source and target tableaux that are generated for our example

Mapping & Query Generation Generation of logical mapping Generation of logical mapping The second step of the algorithm is the generation of logical mapping The second step of the algorithm is the generation of logical mapping The basic algorithm pairs all the existing tableaux in the source with the existing tableaux in the target, and find the correspondences that are covered by each pair The basic algorithm pairs all the existing tableaux in the source with the existing tableaux in the target, and find the correspondences that are covered by each pair M 1 is obtained from (S 2, T 2 ) M 1 is obtained from (S 2, T 2 ) M 3 is obtained from the pair(S 4, T 4 ) M 3 is obtained from the pair(S 4, T 4 )

Mapping & Query Generation Mapping language Mapping language Skolem functions: This functions can explicitly represent target elements for which no source value is given Skolem functions: This functions can explicitly represent target elements for which no source value is given For example, the mapping m1 will not specify a value for the @id attribute under exp_factor For example, the mapping m1 will not specify a value for the @id attribute under exp_factor A Skolem function creates a unique value for this attributes A Skolem function creates a unique value for this attributes

Mapping & Query Generation Query Generation Query Generation Each logical mapping is compiled into a query graph Each logical mapping is compiled into a query graph For each logical mapping, query generators walk the relevant part of the target schema and create the necessary join and grouping condition For each logical mapping, query generators walk the relevant part of the target schema and create the necessary join and grouping condition In Figure 5, the XQuery fragment produces the target In Figure 5, the XQuery fragment produces the target

Mapping & Query Generation Lines 1-4 in Figure 5 implements join of two tables EXPERIMENTFACOR and EXPERIMENTSET Lines 1-4 in Figure 5 implements join of two tables EXPERIMENTFACOR and EXPERIMENTSET Lines 7-10 output the attributes within exp_set Lines 7-10 output the attributes within exp_set Lines 25-31 produce an actual exp_factor element Lines 25-31 produce an actual exp_factor element Line 30 crates a value for the id element Line 30 crates a value for the id element

Practical Challenges (Mapping Generation) Redundancy Check Redundancy Check If there a is one-to-one mapping of the variables of T 1 into the T 2, T 1 is a sub-tableau of T 2 If there a is one-to-one mapping of the variables of T 1 into the T 2, T 1 is a sub-tableau of T 2 Can significantly reduce the amount of irrelevant mappings Can significantly reduce the amount of irrelevant mappings

Practical Challenges (Mapping Generation) Hybrid Algorithm Hybrid Algorithm There are two phases in mapping generation (precomputation, insertion of correspondences) There are two phases in mapping generation (precomputation, insertion of correspondences) The separation enables users to speed up the addition of correspondences in the GUI The separation enables users to speed up the addition of correspondences in the GUI The disadvantage of this separation might be occurred when the schemas are large The disadvantage of this separation might be occurred when the schemas are large A large amount of memory may be needed to hold all the data structure A large amount of memory may be needed to hold all the data structure

Practical Challenges (Mapping Generation) Hybrid Algorithm Hybrid Algorithm The main idea behind this algorithm is to precompute only a bounded number of source tableaux and target tableaux The main idea behind this algorithm is to precompute only a bounded number of source tableaux and target tableaux Sometimes correspondences between elements may fail because of the deeper schema trees Sometimes correspondences between elements may fail because of the deeper schema trees Generate a source tableau that includes all the set- type elements -> The tableaux are closed under the chase. -> Thus including all the other schema elements associated via foreign key Generate a source tableau that includes all the set- type elements -> The tableaux are closed under the chase. -> Thus including all the other schema elements associated via foreign key The data structures holding the sub-tableaux and the sub-skeleton relationship are updated. The data structures holding the sub-tableaux and the sub-skeleton relationship are updated. The algorithm may lose its completeness (small price) The algorithm may lose its completeness (small price)

Practical Challenges (Mapping Generation) Performance evaluation: mapping MAGE-ML Performance evaluation: mapping MAGE-ML MAGE-ML is a complex XML schema MAGE-ML is a complex XML schema Two experiments are performed Two experiments are performed Control the nesting level of the precomputed tableaux (maximum 6 nested level of sets), no limits on the total number of precomputed tableaux (Lower bound) Control the nesting level of the precomputed tableaux (maximum 6 nested level of sets), no limits on the total number of precomputed tableaux (Lower bound) Control the nesting level of the precomputed tableaux (maximum 6 nested level of sets), the total number of precomputed tableaux(maxium 110 per schema (actual improvement) Control the nesting level of the precomputed tableaux (maximum 6 nested level of sets), the total number of precomputed tableaux(maxium 110 per schema (actual improvement)

Practical Challenges (Mapping Generation) First experiment First experiment Load the MAGE-ML (source): less than 1 sec Load the MAGE-ML (source): less than 1 sec Precompute all(1030) tableaux: 2.6 sec Precompute all(1030) tableaux: 2.6 sec Computing subtableaux relationship: 74 sec Computing subtableaux relationship: 74 sec Memory to hold the data structure: 335 MB Memory to hold the data structure: 335 MB Loading the MAGE-ML schema: run out of memory Loading the MAGE-ML schema: run out of memory Second experiment Second experiment Precomputation of the tableaux(116 now): 0.5 sec Precomputation of the tableaux(116 now): 0.5 sec Computing subtableaux relationship: 0.7 sec Computing subtableaux relationship: 0.7 sec Memory to hold the data structure: 163 MB Memory to hold the data structure: 163 MB The amount of memory needed to hold everything: 251 MB The amount of memory needed to hold everything: 251 MB Overall, the performance of hybrid algorithm is quite acceptable. Overall, the performance of hybrid algorithm is quite acceptable.

Practical Challenges (Query Generation: Deep Union) There are two drawbacks There are two drawbacks There is no duplicate removal within and among query fragments There is no duplicate removal within and among query fragments There is no grouping of data among query fragments There is no grouping of data among query fragments (OrderID, ItemID) (OrderID, ItemID) Input data :{(o 1,i 1 ),(o 1,i 2 )} Input data :{(o 1,i 1 ),(o 1,i 2 )} Ouput data: {(o 1,(i 1,i 2 )),(o 1,(i 1,i 2 ))} Ouput data: {(o 1,(i 1,i 2 )),(o 1,(i 1,i 2 ))} Second query: {(o 1,(i 3 ))} Second query: {(o 1,(i 3 ))} We would expect this second tuple to be merged with previous result and produce only one tuple for o 1 with {(o 1,(i 1,i 2, i 3 )} We call this special union operation is deep union We would expect this second tuple to be merged with previous result and produce only one tuple for o 1 with {(o 1,(i 1,i 2, i 3 )} We call this special union operation is deep union

Remaining Challenges Complex mappings are need a more expressive correspondence selection mechanism than that supported by Clio Complex mappings are need a more expressive correspondence selection mechanism than that supported by Clio Exploring the need for logical mapping that nest other logical mapping inside Exploring the need for logical mapping that nest other logical mapping inside Mapping adaptation issues when source and target schemas change Mapping adaptation issues when source and target schemas change

CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:

Similar presentations

Presentation on theme: "CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:

Similar presentations

Presentation on theme: "CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:"— Presentation transcript:

Similar presentations

About project

Feedback