Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University
Motivation Challenges of bioinformatics integration –Data volume: overwhelming DNA sequence: 100 gigabases (August, 2005) –Data growth: exponential Figure provided by PDB
Existing Solutions –(Relational) Databases Support for indexing and high-level queries Not suitable for biological data –Flat Files with Scripts Compact, Perl Scripts available Lack indexing and high-level query processing –Web-services Significant overhead
Enhance information integration systems on –Functionality On-the-fly data incorporation Flat file data process –Usability Declarative interface Low programming requirement –Performance Incorporate indexing support Our Approach
Approach Summary Metadata –Declarative description of data –Data mining algorithms for semi-automatic writing –Reusable by different requests on same data Code generation –Request analysis and execution separated –General modules with plug-in data module
System Overview Understand DataProcess Data Data File User Request Answer Metadata Description Layout Descriptor Schema Descriptor Layout Descriptor Schema Descriptor Layout Descriptor Schema Descriptor Code Generation Request Processor Layout Miner Schema Miner Information Integration System
Advantages Simple interface –At metadata level, declarative General data model –Semi-structured data –Flat file data Low human involvement –Semi-automatic data incorporation –Low maintenance cost OK Performance –Linear scale guaranteed –Can improve by using indexing
System Components Understand data –Layout mining –Schema mining Process data –Wrapper generation –Query Process –Query Process with indices
Data Process Overview Automatic code generation approach Input –Metadata about datasets involved –Optional: Implicit data transformation task Request by users Indexing functions Output –Executable programs General modules Task-specific data module
Metadata Description Two aspects of data in flat files –Logical view of the data –Physical data organization Two components of every data descriptor –Schema description –Layout description Design goals –Powerful –Easy for writing and interpretation
Schema Descriptors Follow XML DTD standard for semi-structured data Simple attribute list for relational data [FASTA]//Schema Name ID = string//Data type definitions DESCRIPTION = string SEQ = string
Layout Descriptors Overall structure (FASTA example) DATASET “FASTAData” {//Dataset name DATATYPE {FASTA} //Schema name DATASPACE LINESIZE=80 { // ---- File layout details goes here ---- } DATA {osu/fasta}//File location }
Wrapper Generation System Overview DataReaderDataWriter Synchronizer Source Dataset Target Dataset WRAPINFO Wrapper generation system wrapper Mapping File Mapping Parser Schema Mapping Mapping Generator Schema Descriptors Layout Parser Layout Descriptor Data Entry Representation Application Analyzer
Query With Indices Motivation Goal –Improve the performance of query-proc program Index –Maintain the advantages Flat file based Low requirement on programming
Challenges & Approaches Various indexing algorithms for various biological data –User defined indexing functions –Standard function interfaces Flat file data –Values parsed implicitly and ready to be indexed –Byte offset as pointer Metadata about indices –Layout descriptor
System Revisited query Query parser Metadata collection Dataset descriptors Descriptor parser Application analyzer QUERYINFOR DataReaderDataWriter Synchronizer Source data files Target data file Source/target names Schema & Layout information mappings Query analysis Query execution Index file Index functions
Language Enhancement Describe indices –Indexing is a property of dataset –Extend layout descriptors –Maintain query format DATASET “name”{ … INDEX {attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc [, attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc]} } AUTOWRAP GNAMES FROM CHIPDATA, YEASTGENOME BY CHIPDATA.GENE = YEASTGENOME.ID WHERE … New meaning of “=“: If index available, use index retrieving function Else, compare values directly
System Enhancement Metadata Descriptor Parser + parse index information Application Analyzer + index information: index look-up table + test condition: compare_field_indexing
Microarray Gene Information Look-up Goal: gather information about genes (120) Query: microarray output join genome database Index: gene names in genome
BLAST-ENHANCE Query Goal: Add extra information to BLAST output Query: BLAST output join Swiss- Prot database Index: protein ID in Swiss-Prot
OMIM-PLUS Query Goal: add Swiss- Prot link to OMIM Query: OMIM join Swiss-Prot Index: protein ID in Swiss-Prot
Homology Search Query Goal: find similar sequences Query: query sequence list * sequence database Indexing algorithm –Sequence-based –Transformation of sub-string composition –Indexing n-D numerical values
Homology Search (1) Index (Singh’s algorithm) –Data: yeast genome –wavelet coefficients –minimum bounding rectangles
Homology Search (2) Index (Ferhatosmanoglu’s algorithm) –Data: GenBank –Wavelet coefficients –Scalar quantization –R-tree
Conclusions A frame work and a set of tools for on-the-fly flat file data integration –New data source understood semi-automatically by data mining tools –New data processed automatically by generated programs –Support for indexing incorporated flexibly