Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.

Slides:



Advertisements
Similar presentations
Intelligent Technologies Module: Ontologies and their use in Information Systems Revision lecture Alex Poulovassilis November/December 2009.
Advertisements

The database approach to data management provides significant advantages over the traditional file-based approach Define general data management concepts.
Introduction to Web services MSc on Bioinformatics for Health Sciences May 2006 Arnaud Kerhornou Iván Párraga García INB.
DataFoundry: An Approach to Scientific Data Integration Terence Critchlow Ron Musick Ida Lozares Center for Applied Scientific Computing Tom SlezakKrzystof.
Guide to Oracle10G1 Introduction To Forms Builder Chapter 5.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Making the Most of What We Know: Towards Effective Use of Genomics Data Terence Critchlow Center for Applied Scientific Computing Lawrence Livermore National.
Fundamentals of Information Systems, Second Edition 1 Organizing Data and Information Chapter 3.
Bioinformatics & LIS A brief talk for librarians, information scientists, and computer scientists about resources and collaborative opportunities with.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.
“DOK 322 DBMS” Y.T. Database Design Hacettepe University Department of Information Management DOK 322: Database Management Systems.
VxOware Progress Report August How to create a new section? Configure section –Create metadata structure (template) –Create elements map for web.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
CVSQL 2 The Design. System Overview System Components CVSQL Server –Three network interfaces –Modular data source provider framework –Decoupled SQL parsing.
A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University.
Systems Analysis – Analyzing Requirements.  Analyzing requirement stage identifies user information needs and new systems requirements  IS dev team.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Chapter 5 Lecture 2. Principles of Information Systems2 Objectives Understand Data definition language (DDL) and data dictionary Learn about popular DBMSs.
Fundamentals of Information Systems, Second Edition 1 Organizing Data and Information.
Concept demo System dashboard. Overview Dashboard use case General implementation ideas Use of MULE integration platform Collection Aggregation/Factorization.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Zhonghua Qu and Ovidiu Daescu December 24, 2009 University of Texas at Dallas.
Title: GeneWiz browser: An Interactive Tool for Visualizing Sequenced Chromosomes By Peter F. Hallin, Hans-Henrik Stærfeldt, Eva Rotenberg, Tim T. Binnewies,
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
BioPython Workshop Gershon Celniker Tel Aviv University.
OracleAS Reports Services. Problem Statement To simplify the process of managing, creating and execution of Oracle Reports.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Organizing Data and Information AD660 – Databases, Security, and Web Technologies Marcus Goncalves Spring 2013.
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
EARTH SCIENCE MARKUP LANGUAGE Why do you need it? How can it help you? INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
+ Information Systems and Databases 2.2 Organisation.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
SDM center Supporting Heterogeneous Data Access in Genomics Terence Critchlow Center for Applied Scientific Computing Lawrence Livermore National Laboratory.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
EMBL-EBI MSD Search and Visualization tools Jawahar Swaminathan.
BioPerl Ketan Mane SLIS, IU. BioPerl Perl and now BioPerl -- Why ??? Availability Advantages for Bioinformatics.
Mining the Biomedical Research Literature Ken Baclawski.
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal.
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Progress Report - Year 2 Extensions of the PhD Symposium Presentation Daniel McEnnis.
Documenting LabVIEW Data & Data Mining with LabVIEW and DIAdem Presentation with self paced training exercises.
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal The Ohio State University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal.
PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.
GeneConnect Use Cases and Design August 3, GeneConnect Database IDs are linked by Direct Annotation, Inferred Annotation, or Sequence Alignment.
1 Middle East Users Group 2008 Self-Service Engine & Process Rules Engine Presented by: Ryan Flemming Friday 11th at 9am - 9:45 am.
Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.
Fundamentals & Ethics of Information Systems IS 201
Database Systems Instructor Name: Lecture-3.
Basic Local Alignment Search Tool
Grid Based Data Integration with Automatic Wrapper Generation
Database Design Hacettepe University
TargetDB and PEPCDB •
Learning Layouts of Biological Datasets Semi-Automatically
Xuan Zhang Kaushik Sinha Ruoming Jin Gagan Agrawal
Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets Kaushik Sinha Xuan Zhang Ruoming Jin Gagan Agrawal 5 May 2019.
Supporting High-Performance Data Processing on Flat-Files
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
New (Applications of) Compiler Techniques for Data Grids
Presentation transcript:

Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Motivation Challenges of bioinformatics integration –Data volume: overwhelming DNA sequence: 100 gigabases (August, 2005) –Data growth: exponential Figure provided by PDB

Existing Solutions –(Relational) Databases Support for indexing and high-level queries Not suitable for biological data –Flat Files with Scripts Compact, Perl Scripts available Lack indexing and high-level query processing –Web-services Significant overhead

Enhance information integration systems on –Functionality On-the-fly data incorporation Flat file data process –Usability Declarative interface Low programming requirement –Performance Incorporate indexing support Our Approach

Approach Summary Metadata –Declarative description of data –Data mining algorithms for semi-automatic writing –Reusable by different requests on same data Code generation –Request analysis and execution separated –General modules with plug-in data module

System Overview Understand DataProcess Data Data File User Request Answer Metadata Description Layout Descriptor Schema Descriptor Layout Descriptor Schema Descriptor Layout Descriptor Schema Descriptor Code Generation Request Processor Layout Miner Schema Miner Information Integration System

Advantages Simple interface –At metadata level, declarative General data model –Semi-structured data –Flat file data Low human involvement –Semi-automatic data incorporation –Low maintenance cost OK Performance –Linear scale guaranteed –Can improve by using indexing

System Components Understand data –Layout mining –Schema mining Process data –Wrapper generation –Query Process –Query Process with indices

Data Process Overview Automatic code generation approach Input –Metadata about datasets involved –Optional: Implicit data transformation task Request by users Indexing functions Output –Executable programs General modules Task-specific data module

Metadata Description Two aspects of data in flat files –Logical view of the data –Physical data organization Two components of every data descriptor –Schema description –Layout description Design goals –Powerful –Easy for writing and interpretation

Schema Descriptors Follow XML DTD standard for semi-structured data Simple attribute list for relational data [FASTA]//Schema Name ID = string//Data type definitions DESCRIPTION = string SEQ = string

Layout Descriptors Overall structure (FASTA example) DATASET “FASTAData” {//Dataset name DATATYPE {FASTA} //Schema name DATASPACE LINESIZE=80 { // ---- File layout details goes here ---- } DATA {osu/fasta}//File location }

Wrapper Generation System Overview DataReaderDataWriter Synchronizer Source Dataset Target Dataset WRAPINFO Wrapper generation system wrapper Mapping File Mapping Parser Schema Mapping Mapping Generator Schema Descriptors Layout Parser Layout Descriptor Data Entry Representation Application Analyzer

Query With Indices Motivation Goal –Improve the performance of query-proc program Index –Maintain the advantages Flat file based Low requirement on programming

Challenges & Approaches Various indexing algorithms for various biological data –User defined indexing functions –Standard function interfaces Flat file data –Values parsed implicitly and ready to be indexed –Byte offset as pointer Metadata about indices –Layout descriptor

System Revisited query Query parser Metadata collection Dataset descriptors Descriptor parser Application analyzer QUERYINFOR DataReaderDataWriter Synchronizer Source data files Target data file Source/target names Schema & Layout information mappings Query analysis Query execution Index file Index functions

Language Enhancement Describe indices –Indexing is a property of dataset –Extend layout descriptors –Maintain query format DATASET “name”{ … INDEX {attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc [, attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc]} } AUTOWRAP GNAMES FROM CHIPDATA, YEASTGENOME BY CHIPDATA.GENE = YEASTGENOME.ID WHERE … New meaning of “=“: If index available, use index retrieving function Else, compare values directly

System Enhancement Metadata Descriptor Parser + parse index information Application Analyzer + index information: index look-up table + test condition: compare_field_indexing

Microarray Gene Information Look-up Goal: gather information about genes (120) Query: microarray output join genome database Index: gene names in genome

BLAST-ENHANCE Query Goal: Add extra information to BLAST output Query: BLAST output join Swiss- Prot database Index: protein ID in Swiss-Prot

OMIM-PLUS Query Goal: add Swiss- Prot link to OMIM Query: OMIM join Swiss-Prot Index: protein ID in Swiss-Prot

Homology Search Query Goal: find similar sequences Query: query sequence list * sequence database Indexing algorithm –Sequence-based –Transformation of sub-string composition –Indexing n-D numerical values

Homology Search (1) Index (Singh’s algorithm) –Data: yeast genome –wavelet coefficients –minimum bounding rectangles

Homology Search (2) Index (Ferhatosmanoglu’s algorithm) –Data: GenBank –Wavelet coefficients –Scalar quantization –R-tree

Conclusions A frame work and a set of tools for on-the-fly flat file data integration –New data source understood semi-automatically by data mining tools –New data processed automatically by generated programs –Support for indexing incorporated flexibly