Presentation is loading. Please wait.

Presentation is loading. Please wait.

Grid Based Data Integration with Automatic Wrapper Generation

Similar presentations


Presentation on theme: "Grid Based Data Integration with Automatic Wrapper Generation"— Presentation transcript:

1 Grid Based Data Integration with Automatic Wrapper Generation
Xuan Zhang Gagan Agrawal Ohio State University

2 Overall Goal Tools for data integration driven by: Data explosion
Data size & number of data sources New analysis tools and need for workflows Autonomous resources Heterogeneous data representation & various interfaces

3 Motivation (Contd.) Other Issues: Frequent updates to data formats
Flat-file datasets Ad-hoc sharing of data

4 Current Approaches Manually written wrappers
Problems O(N2) wrappers needed, O(N) for a single updates Portability of wrappers in a distributed environment Mediator-based integration systems Need a common intermediate format Unnecessary data transformation Integration using web/grid services Needs all tools to be web-services (all data in XML?)

5 Our Approach Automatically generate wrappers
One layout descriptor per resource Stand-alone wrapper programs For integrated DBs, (grid) workflow systems Transform data in files of arbitrary formats No domain- or format-specific heuristics Layout information provided by users

6 Our Approach (Contd.) Help write layout descriptors using data mining techniques (dils 2005, bibe 2005) Particularly attractive for Data grid environments and workflows flat-file datasets ad hoc data sharing

7 Our Approach: Advantages
No need to write wrappers while integrating data or creating workflows Only one descriptor per resource needed No unnecessary transformations / storage New resources can be integrated on-the-fly

8 Our Approach: Challenges
Description language Format and logical view of data in flat files Easy to interpret and write Wrapper generation and execution Correspondence between data items Separating wrapper analysis and execution Interactive tools for writing layout descriptors What data mining techniques to use ? (dils 2005, bibe 2005)

9 Wrapper Generation System Overview
Layout Descriptor Schema Descriptors Parser Mapping Generator Data Entry Representation Schema Mapping Application Analyzer WRAPINFO Source Dataset Target Dataset DataReader DataWriter Synchronizer

10 Suitability for a Grid Environment
Wrapper analysis can be implemented as a grid service Very low execution costs Wrapper execution modules are task-independent Just need to port three modules on different systems

11 Assumptions for the Current Prototype
One tabular, the other semi-structured Both datasets are stored record-wise Order of records not disturbed Suitable for bioinformatics Semi-structured tabular

12 Layout Description Language
Goal To describe data in arbitrary flat file format Easy to interpret and write Components: Schema description Layout description Example: FASTA

13 Layout Description Language
>seq1 comment1 \n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 … Key observations on data layout Strings of variable length Delimiters widely used Data fields divided into variables Repetitive structures Key tokens “constant string” LINESIZE [optional] <repeating>

14 Layout Description Language
>seq1 comment1\n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 … Component I: Schema Description [FASTA] //Schema Name ID = string //Data type definitions DESCRIPTION = string SEQ = string

15 Layout Description Language
>seq1 comment1 \n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 … Component II: Layout Description LOOP ENTRY 1:EOF:1 { “>” ID “ ” DESCRIPTION < “\n” SEQ > “\n” | EOF }

16 Mapping Cardinality TRANSFAC Reference table … FA factor1_name
RA reference1.1_authors RA reference1.2_authors RA reference1.3_authors Reference table FA RA factor1_name reference1.1_ authors reference1.2_ reference1.3_ One-to-multiple data field One-to-one data field

17 Analyzing Application
Goals - WRAPINFO Summarize all application related information necessary for the wrapper Represent the information in look-up tables and constant parameters Represent the information in a platform-independent format, XML

18 Wrapper Generated Value buffer one_to_multiple_values Output dataset
FA Output dataset Input dataset Dataset buffer DataReader DataWriter one_to_one_values load run run halt Synchronizer

19 Wrapper Generated Suitable for data grid Three general modules
DataReader Extract one data field value Write value to the value buffer if useful DataWriter Write one data field value Remove value from list in the value buffer Synchronizer Switch between calling DataReader and DataWriter Manage dataset buffer Application specific information in WRAPINFO

20 Experimental Results TRANSFAC-to-Reference Problem (in logarithm)

21 Experimental Results SWISSPROT-to-FASTA Problem

22 Summary Automatically generated wrappers can perform well
Wrapper task analysis and wrapper execution can be separated Key Open Question: How hard it is to write layout descriptors ? Can we make the process semi-automatic ? Data mining techniques seem quite promising (dils 2005, bibe 2005)


Download ppt "Grid Based Data Integration with Automatic Wrapper Generation"

Similar presentations


Ads by Google