Supporting High-Performance Data Processing on Flat-Files

Slides:



Advertisements
Similar presentations
Intelligent Technologies Module: Ontologies and their use in Information Systems Revision lecture Alex Poulovassilis November/December 2009.
Advertisements

The database approach to data management provides significant advantages over the traditional file-based approach Define general data management concepts.
Introduction to Web services MSc on Bioinformatics for Health Sciences May 2006 Arnaud Kerhornou Iván Párraga García INB.
DataFoundry: An Approach to Scientific Data Integration Terence Critchlow Ron Musick Ida Lozares Center for Applied Scientific Computing Tom SlezakKrzystof.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Fundamentals of Information Systems, Second Edition 1 Organizing Data and Information Chapter 3.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
“DOK 322 DBMS” Y.T. Database Design Hacettepe University Department of Information Management DOK 322: Database Management Systems.
VxOware Progress Report August How to create a new section? Configure section –Create metadata structure (template) –Create elements map for web.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
CVSQL 2 The Design. System Overview System Components CVSQL Server –Three network interfaces –Modular data source provider framework –Decoupled SQL parsing.
A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Chapter 5 Lecture 2. Principles of Information Systems2 Objectives Understand Data definition language (DDL) and data dictionary Learn about popular DBMSs.
Fundamentals of Information Systems, Second Edition 1 Organizing Data and Information.
Concept demo System dashboard. Overview Dashboard use case General implementation ideas Use of MULE integration platform Collection Aggregation/Factorization.
Zhonghua Qu and Ovidiu Daescu December 24, 2009 University of Texas at Dallas.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
BioPython Workshop Gershon Celniker Tel Aviv University.
OracleAS Reports Services. Problem Statement To simplify the process of managing, creating and execution of Oracle Reports.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Organizing Data and Information AD660 – Databases, Security, and Web Technologies Marcus Goncalves Spring 2013.
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
BioPerl Ketan Mane SLIS, IU. BioPerl Perl and now BioPerl -- Why ??? Availability Advantages for Bioinformatics.
Mining the Biomedical Research Literature Ken Baclawski.
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Progress Report - Year 2 Extensions of the PhD Symposium Presentation Daniel McEnnis.
Integration of BioInformatics tools at NUS. GenBank Growth Chart Year Bases.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal.
PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.
GeneConnect Use Cases and Design August 3, GeneConnect Database IDs are linked by Direct Annotation, Inferred Annotation, or Sequence Alignment.
11 Copyright © 2009, Oracle. All rights reserved. Enhancing ETL Performance.
THE LEONS COLLEGE OF LAW1 Organizing Data and Information Chapter 4.
Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.
Databases and DBMSs Todd S. Bacastow January
Tools Of Structured Analysis
Compiler Construction (CS-636)
Fundamentals & Ethics of Information Systems IS 201
System Design.
Web Ontology Language for Service (OWL-S)
Mangaldai College, Mangaldai
RichAnnotator: Annotating rich (XML-like) documents
Implementing Language Extensions with Model Transformations
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Data Model.
Database Systems Instructor Name: Lecture-3.
Basic Local Alignment Search Tool
Explore Evolution: Instrument for Analysis
The Database Environment
Grid Based Data Integration with Automatic Wrapper Generation
Database Design Hacettepe University
Answering Cross-Source Keyword Queries Over Biological Data Sources
Metadata The metadata contains
TargetDB and PEPCDB •
Learning Layouts of Biological Datasets Semi-Automatically
Xuan Zhang Kaushik Sinha Ruoming Jin Gagan Agrawal
Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets Kaushik Sinha Xuan Zhang Ruoming Jin Gagan Agrawal 5 May 2019.
Implementing Language Extensions with Model Transformations
CSE591: Data Mining by H. Liu
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
New (Applications of) Compiler Techniques for Data Grids
Presentation transcript:

Supporting High-Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Motivation Challenges of bioinformatics integration Data volume: overwhelming DNA sequence: 100 gigabases (August, 2005) Data growth: exponential Figure provided by PDB

Existing Solutions (Relational) Databases Flat Files with Scripts Support for indexing and high-level queries Not suitable for biological data Flat Files with Scripts Compact, Perl Scripts available Lack indexing and high-level query processing Web-services Significant overhead

Our Approach Enhance information integration systems on Functionality On-the-fly data incorporation Flat file data process Usability Declarative interface Low programming requirement Performance Incorporate indexing support

Approach Summary Metadata Code generation Declarative description of data Data mining algorithms for semi-automatic writing Reusable by different requests on same data Code generation Request analysis and execution separated General modules with plug-in data module

System Overview Understand Data Process Data Data File User Request Metadata Description Layout Miner Layout Descriptor --------------------------------------------------- Schema Descriptor Answer Code Generation Request Processor Layout Descriptor --------------------------------------------------- Schema Descriptor Layout Descriptor --------------------------------------------------- Schema Descriptor Schema Miner Information Integration System

Advantages Simple interface General data model Low human involvement At metadata level, declarative General data model Semi-structured data Flat file data Low human involvement Semi-automatic data incorporation Low maintenance cost OK Performance Linear scale guaranteed Can improve by using indexing

System Components Understand data Process data Layout mining Schema mining Process data Wrapper generation Query Process Query Process with indices

Data Process Overview Automatic code generation approach Input Output Metadata about datasets involved Optional: Implicit data transformation task Request by users Indexing functions Output Executable programs General modules Task-specific data module

Metadata Description Two aspects of data in flat files Logical view of the data Physical data organization Two components of every data descriptor Schema description Layout description Design goals Powerful Easy for writing and interpretation

Schema Descriptors Follow XML DTD standard for semi-structured data Simple attribute list for relational data <?xml version='1.0' encoding='UTF-8'?> <!ELEMENT FASTA (ID, DESCRIPTION, SEQ)> <!ELEMENT ID (#PCDATA)> <!ELEMENT DESCRIPTION (#PCDATA)> <!ELEMENT SEQ (#PCDATA)> [FASTA] //Schema Name ID = string //Data type definitions DESCRIPTION = string SEQ = string

Layout Descriptors Overall structure (FASTA example) DATASET “FASTAData” { //Dataset name DATATYPE {FASTA} //Schema name DATASPACE LINESIZE=80 { // ---- File layout details goes here ---- } DATA {osu/fasta} //File location

Wrapper Generation System Overview Layout Descriptor Schema Descriptors Layout Parser Mapping Generator Mapping File Mapping Parser Data Entry Representation Schema Mapping Wrapper generation system Application Analyzer WRAPINFO Source Dataset Target Dataset DataReader DataWriter Synchronizer wrapper

Query With Indices Motivation Goal Improve the performance of query-proc program Index Maintain the advantages Flat file based Low requirement on programming

Challenges & Approaches Various indexing algorithms for various biological data User defined indexing functions Standard function interfaces Flat file data Values parsed implicitly and ready to be indexed Byte offset as pointer Metadata about indices Layout descriptor

Schema & Layout information System Revisited query Source/target names Query parser Metadata collection Dataset descriptors Descriptor parser Schema & Layout information mappings Application analyzer Query analysis Query execution QUERYINFOR Source data files Target data file DataReader DataWriter Synchronizer Index file Index functions

Language Enhancement Describe indices Indexing is a property of dataset Extend layout descriptors Maintain query format DATASET “name”{ … INDEX {attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc [, attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc]} } New meaning of “=“: If index available, use index retrieving function Else, compare values directly AUTOWRAP GNAMES FROM CHIPDATA, YEASTGENOME BY CHIPDATA.GENE = YEASTGENOME.ID WHERE …

System Enhancement Metadata Descriptor Parser Application Analyzer + parse index information Application Analyzer + index information: index look-up table + test condition: compare_field_indexing

Microarray Gene Information Look-up Goal: gather information about genes (120) Query: microarray output join genome database Index: gene names in genome

BLAST-ENHANCE Query Goal: Add extra information to BLAST output Query: BLAST output join Swiss-Prot database Index: protein ID in Swiss-Prot

OMIM-PLUS Query Goal: add Swiss-Prot link to OMIM Query: OMIM join Swiss-Prot Index: protein ID in Swiss-Prot

Homology Search Query Goal: find similar sequences Query: query sequence list * sequence database Indexing algorithm Sequence-based Transformation of sub-string composition Indexing n-D numerical values

Homology Search (1) Index (Singh’s algorithm) Data: yeast genome wavelet coefficients minimum bounding rectangles

Homology Search (2) Index (Ferhatosmanoglu’s algorithm) Data: GenBank Wavelet coefficients Scalar quantization R-tree

Conclusions A frame work and a set of tools for on-the-fly flat file data integration New data source understood semi-automatically by data mining tools New data processed automatically by generated programs Support for indexing incorporated flexibly