Grid Based Data Integration with Automatic Wrapper Generation

Grid Based Data Integration with Automatic Wrapper Generation
Xuan Zhang Gagan Agrawal Ohio State University

Overall Goal Tools for data integration driven by: Data explosion
Data size & number of data sources New analysis tools and need for workflows Autonomous resources Heterogeneous data representation & various interfaces

Motivation (Contd.) Other Issues: Frequent updates to data formats
Flat-file datasets Ad-hoc sharing of data

Current Approaches Manually written wrappers
Problems O(N2) wrappers needed, O(N) for a single updates Portability of wrappers in a distributed environment Mediator-based integration systems Need a common intermediate format Unnecessary data transformation Integration using web/grid services Needs all tools to be web-services (all data in XML?)

Our Approach Automatically generate wrappers
One layout descriptor per resource Stand-alone wrapper programs For integrated DBs, (grid) workflow systems Transform data in files of arbitrary formats No domain- or format-specific heuristics Layout information provided by users

Our Approach (Contd.) Help write layout descriptors using data mining techniques (dils 2005, bibe 2005) Particularly attractive for Data grid environments and workflows flat-file datasets ad hoc data sharing

Our Approach: Advantages
No need to write wrappers while integrating data or creating workflows Only one descriptor per resource needed No unnecessary transformations / storage New resources can be integrated on-the-fly

Our Approach: Challenges
Description language Format and logical view of data in flat files Easy to interpret and write Wrapper generation and execution Correspondence between data items Separating wrapper analysis and execution Interactive tools for writing layout descriptors What data mining techniques to use ? (dils 2005, bibe 2005)

Wrapper Generation System Overview
Layout Descriptor Schema Descriptors Parser Mapping Generator Data Entry Representation Schema Mapping Application Analyzer WRAPINFO Source Dataset Target Dataset DataReader DataWriter Synchronizer

Suitability for a Grid Environment
Wrapper analysis can be implemented as a grid service Very low execution costs Wrapper execution modules are task-independent Just need to port three modules on different systems

Assumptions for the Current Prototype
One tabular, the other semi-structured Both datasets are stored record-wise Order of records not disturbed Suitable for bioinformatics Semi-structured tabular

Layout Description Language
Goal To describe data in arbitrary flat file format Easy to interpret and write Components: Schema description Layout description Example: FASTA

… >seq1 comment1 \n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 … Key observations on data layout Strings of variable length Delimiters widely used Data fields divided into variables Repetitive structures Key tokens “constant string” LINESIZE [optional] <repeating> …

… >seq1 comment1\n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 … Component I: Schema Description [FASTA] //Schema Name ID = string //Data type definitions DESCRIPTION = string SEQ = string

… >seq1 comment1 \n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 … Component II: Layout Description … LOOP ENTRY 1:EOF:1 { “>” ID “ ” DESCRIPTION < “\n” SEQ > “\n” | EOF }

Mapping Cardinality TRANSFAC Reference table … FA factor1_name
RA reference1.1_authors RA reference1.2_authors RA reference1.3_authors Reference table … FA RA factor1_name reference1.1_ authors reference1.2_ reference1.3_ One-to-multiple data field One-to-one data field

Analyzing Application
Goals - WRAPINFO Summarize all application related information necessary for the wrapper Represent the information in look-up tables and constant parameters Represent the information in a platform-independent format, XML

Wrapper Generated Value buffer one_to_multiple_values Output dataset
FA Output dataset Input dataset Dataset buffer DataReader DataWriter one_to_one_values load run run halt Synchronizer

Wrapper Generated Suitable for data grid Three general modules
DataReader Extract one data field value Write value to the value buffer if useful DataWriter Write one data field value Remove value from list in the value buffer Synchronizer Switch between calling DataReader and DataWriter Manage dataset buffer Application specific information in WRAPINFO

Experimental Results TRANSFAC-to-Reference Problem (in logarithm)

Experimental Results SWISSPROT-to-FASTA Problem

Summary Automatically generated wrappers can perform well
Wrapper task analysis and wrapper execution can be separated Key Open Question: How hard it is to write layout descriptors ? Can we make the process semi-automatic ? Data mining techniques seem quite promising (dils 2005, bibe 2005)

Grid Based Data Integration with Automatic Wrapper Generation

Similar presentations

Presentation on theme: "Grid Based Data Integration with Automatic Wrapper Generation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Grid Based Data Integration with Automatic Wrapper Generation

Similar presentations

Presentation on theme: "Grid Based Data Integration with Automatic Wrapper Generation"— Presentation transcript:

Similar presentations

About project

Feedback