New (Applications of) Compiler Techniques for Data Grids Gagan Agrawal
Outline Automatic Data Virtualization Automatic Wrapper Generation SQL Implementation XML/XQuery Automatic Wrapper Generation Data Integration in Bioinformatics Compiling XML Query Language XQuery Issues with streaming data
Data Virtualization An abstract view of data dataset Data Data Service -- Scientific Data being shared on Web/Grids -- Low-level layouts -- Need for efficient storage and processing
Our Approach: Automatic Data Virtualization Automatically create data services A new application of compiler technology A meta-data descriptor describes the layout of data in a repository An abstract view is exposed to the users Two implementations: Relational /SQL-based (HPDC 2004, LCPC 2004) XML/XQuery based (ICS 2003, LCPC 2003)
SQL/Relational Implementation SELECT < Data Elements > FROM < Dataset Name > WHERE …. AND Filter( < Data Element> );
XML/XQuery Implementation ??? XQuery HDF5 NetCDF XML TEXT RMDB …
Approach / Contributions Use of XML Schemas to provide high-level abstractions on complex datasets Using XQuery with these Schemas to specify processing Issues in Translation High-level to low-level code Data-centric transformations for locality in low-level codes Issues specific to XQuery Recognizing recursive reductions Type inferencing and translation
Wrappers Goal: to provide the integration system transparent access to data sources Challenges Development cost Performance Scripting languages can be slow Updates Data Formats can change frequently
Our Approach Machine-interpretable metadata A layout descriptor associated with each dataset Wrappers generated on the fly Applied to several bioinformatics examples
Layout Descriptor DATASET “FASTAData” { DATATYPE {FASTA} Dataset name Schema name DATASET “FASTAData” { DATATYPE {FASTA} DATASPACE LINESIZE=80 { LOOP ENTRY 1:EOF:1 { “>” ID “ “ DESCRIPTION < “\n” SEQ > “\n” | EOF } } DATA {osu/fasta} ID DESCRIPTION >Example1 envelope protein ELRLRYCAPAGFALLKCNDA DYDGFKTNCSNVSVVHCTNL MNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKH >Example2 synthetic peptide HITREPLKHIPKERYRGTNDT… SEQ SEQ File layout SEQ SEQ File location
XQuery on Streaming Data Infinite data streams All processing must be single pass Interesting Compiler Questions: How do I transform a code to execute on a single pass How to tell that it can be executed correctly with a single pass Addressed this problem for XML Streams and XML query language XQuery Appears in VLDB 2005