Supporting High-Performance Data Processing on Flat-Files

Supporting High-Performance Data Processing on Flat-Files
Xuan Zhang Gagan Agrawal Ohio State University

Motivation Challenges of bioinformatics integration
Data volume: overwhelming DNA sequence: 100 gigabases (August, 2005) Data growth: exponential Figure provided by PDB

Existing Solutions (Relational) Databases Flat Files with Scripts
Support for indexing and high-level queries Not suitable for biological data Flat Files with Scripts Compact, Perl Scripts available Lack indexing and high-level query processing Web-services Significant overhead

Our Approach Enhance information integration systems on Functionality
On-the-fly data incorporation Flat file data process Usability Declarative interface Low programming requirement Performance Incorporate indexing support

Approach Summary Metadata Code generation
Declarative description of data Data mining algorithms for semi-automatic writing Reusable by different requests on same data Code generation Request analysis and execution separated General modules with plug-in data module

System Overview Understand Data Process Data Data File User Request
Metadata Description Layout Miner Layout Descriptor Schema Descriptor Answer Code Generation Request Processor Layout Descriptor Schema Descriptor Layout Descriptor Schema Descriptor Schema Miner Information Integration System

Advantages Simple interface General data model Low human involvement
At metadata level, declarative General data model Semi-structured data Flat file data Low human involvement Semi-automatic data incorporation Low maintenance cost OK Performance Linear scale guaranteed Can improve by using indexing

System Components Understand data Process data Layout mining
Schema mining Process data Wrapper generation Query Process Query Process with indices

Data Process Overview Automatic code generation approach Input Output
Metadata about datasets involved Optional: Implicit data transformation task Request by users Indexing functions Output Executable programs General modules Task-specific data module

Metadata Description Two aspects of data in flat files
Logical view of the data Physical data organization Two components of every data descriptor Schema description Layout description Design goals Powerful Easy for writing and interpretation

Schema Descriptors Follow XML DTD standard for semi-structured data
Simple attribute list for relational data <?xml version='1.0' encoding='UTF-8'?> <!ELEMENT FASTA (ID, DESCRIPTION, SEQ)> <!ELEMENT ID (#PCDATA)> <!ELEMENT DESCRIPTION (#PCDATA)> <!ELEMENT SEQ (#PCDATA)> [FASTA] //Schema Name ID = string //Data type definitions DESCRIPTION = string SEQ = string

Layout Descriptors Overall structure (FASTA example)
DATASET “FASTAData” { //Dataset name DATATYPE {FASTA} //Schema name DATASPACE LINESIZE=80 { // ---- File layout details goes here ---- } DATA {osu/fasta} //File location

Wrapper Generation System Overview
Layout Descriptor Schema Descriptors Layout Parser Mapping Generator Mapping File Mapping Parser Data Entry Representation Schema Mapping Wrapper generation system Application Analyzer WRAPINFO Source Dataset Target Dataset DataReader DataWriter Synchronizer wrapper

Query With Indices Motivation
Goal Improve the performance of query-proc program Index Maintain the advantages Flat file based Low requirement on programming

Challenges & Approaches
Various indexing algorithms for various biological data User defined indexing functions Standard function interfaces Flat file data Values parsed implicitly and ready to be indexed Byte offset as pointer Metadata about indices Layout descriptor

Schema & Layout information
System Revisited query Source/target names Query parser Metadata collection Dataset descriptors Descriptor parser Schema & Layout information mappings Application analyzer Query analysis Query execution QUERYINFOR Source data files Target data file DataReader DataWriter Synchronizer Index file Index functions

Language Enhancement Describe indices
Indexing is a property of dataset Extend layout descriptors Maintain query format DATASET “name”{ … INDEX {attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc [, attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc]} } New meaning of “=“: If index available, use index retrieving function Else, compare values directly AUTOWRAP GNAMES FROM CHIPDATA, YEASTGENOME BY CHIPDATA.GENE = YEASTGENOME.ID WHERE …

System Enhancement Metadata Descriptor Parser Application Analyzer
+ parse index information Application Analyzer + index information: index look-up table + test condition: compare_field_indexing

Microarray Gene Information Look-up
Goal: gather information about genes (120) Query: microarray output join genome database Index: gene names in genome

BLAST-ENHANCE Query Goal: Add extra information to BLAST output
Query: BLAST output join Swiss-Prot database Index: protein ID in Swiss-Prot

OMIM-PLUS Query Goal: add Swiss-Prot link to OMIM
Query: OMIM join Swiss-Prot Index: protein ID in Swiss-Prot

Homology Search Query Goal: find similar sequences
Query: query sequence list * sequence database Indexing algorithm Sequence-based Transformation of sub-string composition Indexing n-D numerical values

Homology Search (1) Index (Singh’s algorithm) Data: yeast genome
wavelet coefficients minimum bounding rectangles

Homology Search (2) Index (Ferhatosmanoglu’s algorithm) Data: GenBank
Wavelet coefficients Scalar quantization R-tree

Conclusions A frame work and a set of tools for on-the-fly flat file data integration New data source understood semi-automatically by data mining tools New data processed automatically by generated programs Support for indexing incorporated flexibly

Supporting High-Performance Data Processing on Flat-Files

Similar presentations

Presentation on theme: "Supporting High-Performance Data Processing on Flat-Files"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Supporting High-Performance Data Processing on Flat-Files

Similar presentations

Presentation on theme: "Supporting High-Performance Data Processing on Flat-Files"— Presentation transcript:

Similar presentations

About project

Feedback