Xuan Zhang Kaushik Sinha Ruoming Jin Gagan Agrawal

Xuan Zhang Kaushik Sinha Ruoming Jin Gagan Agrawal
Using (and Generating) Metadata for Automatic Wrapper Generation Xuan Zhang Kaushik Sinha Ruoming Jin Gagan Agrawal April 19

Overall Goal Informatics tools for data integration driven by:
Data explosion Data size & number of data sources New analysis tools Autonomous resources Heterogeneous data representation & various interfaces Frequent Updates Common Situations: Flat-file datasets Ad-hoc sharing of data

Current Approaches Manually written wrappers
Problems O(N2) wrappers needed, O(N) for a single updates Mediator-based integration systems Need a common intermediate format Unnecessary data transformation Integration using web/grid services Needs all tools to be web-services (all data in XML?)

Our Approach Use Metadata for Capturing Layout Information
Automatically generate wrappers Stand-alone programs For integrated DBs, (grid) workflow systems Help write layout descriptors using data mining techniques Particularly attractive for flat-file datasets ad hoc data sharing data grid environments

Our Approach: Advantages
No DB or query support required One descriptor per resource needed No unnecessary transformation New resources can be integrated on-the-fly

Our Approach: Challenges
Description language Format and logical view of data in flat files Easy to interpret and write Wrapper generation and Execution Correspondence between data items Separating wrapper analysis and execution Interactive tools for writing layout descriptors What data mining techniques to use ?

Layout Description Language
Goal To describe data in arbitrary flat file format Easy to interpret and write Components: Schema description Layout description Example: FASTA

… >seq1 comment1\n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 … Component I: Schema Description [FASTA] //Schema Name ID = string //Data type definitions DESCRIPTION = string SEQ = string

… >seq1 comment1 \n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 … Key observations on data layout Strings of variable length Delimiters widely used Data fields divided into variables Repetitive structures Key tokens “constant string” LINESIZE [optional] <repeating> …

… >seq1 comment1 \n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 … Component II: Layout Description … LOOP ENTRY 1:EOF:1 { “>” ID “ ” DESCRIPTION < “\n” SEQ > “\n” | EOF }

Wrapper Generation System Overview
Layout Descriptor Schema Descriptors Parser Mapping Generator Data Entry Representation Schema Mapping Application Analyzer WRAPINFO Source Dataset Target Dataset DataReader DataWriter Synchronizer

Assumptions for the Current Prototype
One tabular, the other semi-structured Both datasets are stored record-wise Order of records not disturbed Semi-structured tabular

Layout Parser Key Observations: FASTA Example
Many repetitive structures DLM-VAR pairs FASTA Example … LOOP ENTRY 1:EOF:1 { “>” ID “ ” DESCRIPTION < “\n” SEQ > “\n” | EOF }

Mapping Cardinality TRANSFAC Reference table … FA factor1_name
RA reference1.1_authors RA reference1.2_authors RA reference1.3_authors Reference table … FA RA factor1_name reference1.1_ authors reference1.2_ reference1.3_ One-to-multiple data field One-to-one data field

Analyzing Application
Goals - WRAPINFO Summarize all application related information necessary for the wrapper Represent the information in look-up tables and constant parameters Represent the information in a platform-independent format, XML

Wrapper Generated Value buffer one_to_multiple_values Output dataset
FA Output dataset Input dataset Dataset buffer DataReader DataWriter one_to_one_values load run run halt Synchronizer

Wrapper Generated Suitable for data grid Three general modules
DataReader Extract one data field value Write value to the value buffer if useful DataWriter Write one data field value Remove value from list in the value buffer Synchronizer Switch between calling DataReader and DataWriter Manage dataset buffer Application specific information in WRAPINFO

Experimental Results TRANSFAC-to-Reference Problem (in logarithm)

Experimental Results SWISSPROT-to-FASTA Problem

Observations Automatically generated wrappers can perform well
Wrapper task analysis and wrapper execution can be separated Key Open Question: How hard it is to write layout descriptors ? Can we make the process semi-automatic ?

Learning Layout of a Flat-File
In general – intractable Try and learn the layout, have a domain expert verify Key issue: what delimiters are being used ?

Finding Delimiters Difficult problem
Some knowledge from domain expert is required (Semi-automatic) Naïve approaches Frequency Counting Counts frequently occurring single tokens (word separated by space) Sequence Mining Counts frequently occurring sequence of tokens

Frequency Counting Problems Possible Solution
Some tokens, appearing very frequently, are not delimiters Delimiters could be a sequence of token rather than a single token Possible Solution Use knowledge from frequency of token sequence and all its subsequences to decide possible delimiter sequence

Sequence Mining Example
For any sequence of tokens s, f(s) represents frequency of s Lets say A,B,C are tokens Case 1: f(ABC)=10, f(AB)=10, f(BC)=10, f(CA)=10 Information about AB, BC, CA is already embedded in ABC ABC is possible delimiter but AB, BC, CA are not Case 2: f(ABC)=10, f(AB)=20, f(BC)=10, f(CA)=10 BC and CA occur less frequently than AB ABC cannot be a delimiter AB is possible delimiter

Limitations of Sequence Mining
Does not work very well if token frequencies are distributed in a skewed manner An example where it does not work in (Pfam dataset) \n, #=GF, AC are tokens with f(\n,#=GF)>>f(#=GF,AC) F(\n,#=GF)>>f(\n,#=GF,AC) \n #=GF is concluded as possible delimiter In reality \n #=GF AC is a delimiter

Can we do better? Biological datasets are written for humans to read
It is very unlikely that delimiters will be scattered all around, in different places in a line Position of the possible delimiters might provide useful information Combination of positional and frequency information might be a better choice

Positional Weight Let P be the different positions in a line where a token can appear For each position i є P, tot_seqji represents total # of token sequences of length j starting at position i For each position i є P, tot_unique_seqji represents total # of unique token sequences of length j starting at position i For any tuple (i,j), p_ratio(i,j) is defined as shown above p_ratio(i,j) can be log normalized to get positional weight, p_wt(i,j) with the property p_wt(i,j) є (0,1)

Delimiter score (d_score)
Frequency weight for any token sequence sji with length j and starting at position i, f_wt(sji), is obtained by log normalizing frequency f(sji) Obviously, f_wt(sji) є (0,1) Positional and frequency weight now can be combined together to get d_score as follows, d_score(sji)= α * p_wt(i,j) + (1-α) * f_wt(sji) Where α є(0,1) Thus d_scrore has the following two properties, d_score(sji) є(0,1) d_score(sji) > d_score(sjk) implies sji is more likely to be a delimiter than sjk

Finding delimiters using d_score
Since delimiter sequence length is not known in advance, an iterative algorithm is used to get a superset S of potential delimiters, where, At any iteration i, ci represents the cut-off value which is determined by observing a substantial difference in sorted d_score values All token sequences above ci are called Ni

Generating layout descriptor
Once the delimiters are identified, an NFA can be built scanning the whole database where, delimiters are different states of the NFA This NFA can be used to generate a layout descriptor since it nicely represents optional and repeating states The following figures shows an NFA, where A, B, C, D, and E are delimiters with B being an optional delimiter and C D being a repeating delimiters

Results By suitably varying α, a tight superset of possible delimiters are found A domain expert can then help to identify the true delimiters Results from 3 different flat file datasets are as follows

Comparison with naïve approaches
d_score based approach does a better job as compared to the naïve approaches The following table clearly shows the improvement

Summary Semi-automatic tool for learning the layout of a flat-file dataset Automatic tool for wrapper generation Once the layout descriptor is known Can ease integration of new/updated sources Ongoing work Integrating and transforming data from multiple sources Supporting querying conditions Learning optional delimiters using contrast analysis Learning layout of a directory structure

Xuan Zhang Kaushik Sinha Ruoming Jin Gagan Agrawal

Similar presentations

Presentation on theme: "Xuan Zhang Kaushik Sinha Ruoming Jin Gagan Agrawal"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Xuan Zhang Kaushik Sinha Ruoming Jin Gagan Agrawal

Similar presentations

Presentation on theme: "Xuan Zhang Kaushik Sinha Ruoming Jin Gagan Agrawal"— Presentation transcript:

Similar presentations

About project

Feedback