Presentation is loading. Please wait.

Presentation is loading. Please wait.

New (Applications of) Compiler Techniques for Data Grids

Similar presentations


Presentation on theme: "New (Applications of) Compiler Techniques for Data Grids"— Presentation transcript:

1 New (Applications of) Compiler Techniques for Data Grids
Gagan Agrawal

2 Outline Automatic Data Virtualization Automatic Wrapper Generation
SQL Implementation XML/XQuery Automatic Wrapper Generation Data Integration in Bioinformatics Compiling XML Query Language XQuery Issues with streaming data

3 Data Virtualization An abstract view of data dataset Data
Data Service -- Scientific Data being shared on Web/Grids -- Low-level layouts -- Need for efficient storage and processing

4 Our Approach: Automatic Data Virtualization
Automatically create data services A new application of compiler technology A meta-data descriptor describes the layout of data in a repository An abstract view is exposed to the users Two implementations: Relational /SQL-based (HPDC 2004, LCPC 2004) XML/XQuery based (ICS 2003, LCPC 2003)

5 SQL/Relational Implementation
SELECT < Data Elements > FROM < Dataset Name > WHERE …. AND Filter( < Data Element> );

6 XML/XQuery Implementation
??? XQuery HDF5 NetCDF XML TEXT RMDB

7 Approach / Contributions
Use of XML Schemas to provide high-level abstractions on complex datasets Using XQuery with these Schemas to specify processing Issues in Translation High-level to low-level code Data-centric transformations for locality in low-level codes Issues specific to XQuery Recognizing recursive reductions Type inferencing and translation

8 Wrappers Goal: to provide the integration system transparent access to data sources Challenges Development cost Performance Scripting languages can be slow Updates Data Formats can change frequently

9 Our Approach Machine-interpretable metadata
A layout descriptor associated with each dataset Wrappers generated on the fly Applied to several bioinformatics examples

10 Layout Descriptor DATASET “FASTAData” { DATATYPE {FASTA}
Dataset name Schema name DATASET “FASTAData” { DATATYPE {FASTA} DATASPACE LINESIZE=80 { LOOP ENTRY 1:EOF:1 { “>” ID “ “ DESCRIPTION < “\n” SEQ > “\n” | EOF } } DATA {osu/fasta} ID DESCRIPTION >Example1 envelope protein ELRLRYCAPAGFALLKCNDA DYDGFKTNCSNVSVVHCTNL MNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKH >Example2 synthetic peptide HITREPLKHIPKERYRGTNDT… SEQ SEQ File layout SEQ SEQ File location

11 XQuery on Streaming Data
Infinite data streams All processing must be single pass Interesting Compiler Questions: How do I transform a code to execute on a single pass How to tell that it can be executed correctly with a single pass Addressed this problem for XML Streams and XML query language XQuery Appears in VLDB 2005


Download ppt "New (Applications of) Compiler Techniques for Data Grids"

Similar presentations


Ads by Google