June 15 SRS - A Backbone for Genome Information and Data Grid Systems Don Gilbert Indiana University

Slides:



Advertisements
Similar presentations
Introductory to database handling Endre Sebestyén.
Advertisements

Pulan Yu School of Informatics Indiana University Bloomington Web service based Varuna.Net.
Web Service Architecture
Database System Concepts and Architecture
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Genome Data Directories Don Gilbert, May 2003.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Active Directory: Final Solution to Enterprise System Integration
Introduction to Web services MSc on Bioinformatics for Health Sciences May 2006 Arnaud Kerhornou Iván Párraga García INB.
StatCat Building a Statistical Data Finder ssrs.yale.edu/statcat Steven Citron-Pousty Ann Green Julie Linden Yale University.
Presentation 7 part 2: SOAP & WSDL. Ingeniørhøjskolen i Århus Slide 2 Outline Building blocks in Web Services SOA SOAP WSDL (UDDI)
Workshop on Cyber Infrastructure in Combustion Science April 19-20, 2006 Subrata Bhattacharjee and Christopher Paolini Mechanical.
Interpret Application Specifications
Ch 12 Distributed Systems Architectures
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 17 Client-Server Processing, Parallel Database Processing,
Chapter 4: Database Management. Databases Before the Use of Computers Data kept in books, ledgers, card files, folders, and file cabinets Long response.
Systems Architecture, Fourth Edition1 Internet and Distributed Application Services Chapter 13.
Digital Library in a Box Ming Luo, Hussein Suleman, Edward Fox Virginia Tech Subcontract to Collaborative Project led by University of Florida (also with.
Client-Server Processing and Distributed Databases
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Web Services Michael Smith Alex Feldman. What is a Web Service? A Web service is a message-oriented software system designed to support inter-operable.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Argos & Genome Directories & Lucegene (‘Lucy Jean’) A Replicable Genome infOrmation System of Common Components GMOD Meeting, Sept Don Gilbert,
WFleaBase Daphnia Genome Database from Common Components Daphnia Genomic Consortium Meeting, Sept Don Gilbert,
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Genomes to Grids Bio Data Distribution for Grid Computing Biologists have discovered many millions of genes and genome features, now part of the bio-data.
Analysis Environments For Scientific Communities From Bases to Spaces Bruce R. Schatz Institute for Genomic Biology University of Illinois at Urbana-Champaign.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
XML Overview. Chapter 8 © 2011 Pearson Education 2 Extensible Markup Language (XML) A text-based markup language (like HTML) A text-based markup language.
ISpheres Project. Project Overview iSpheresCore iSpheresImage Demonstration References.
UDDI ebXML(?) and such Essential Web Services Directory and Discovery.
1 Technologies for distributed systems Andrew Jones School of Computer Science Cardiff University.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
MET280: Computing for Bioinformatics Introduction to databases What is a database? Not a spreadsheet. Data types and uses DBMS (DataBase Management System)
Implementing LDAP Client/Server System for Directory Service By Maochun Sun Project Advisor: Dr. Chung-E Wang Department of Computer Science California.
SEEK EcoGrid l Integrate diverse data networks from ecology, biodiversity, and environmental sciences l Metacat, DiGIR, SRB, Xanthoria,... l EML is the.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
MobileMAN Internal meetingHelsinki, June 8 th 2004 NETikos activity in MobileMAN project Veronica Vanni NETikos S.p.A.
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
Web: Minimal Metadata for Data Services Through DIALOGUE Neil Chue Hong AHM2007.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Toward a Unified Gene Page GMOD Meeting, April 2004 Don Gilbert,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
Bulk data files // TeraGrid uses for Genome Databases GMOD meet, June 2006 Don Gilbert,
Web Services Presented By : Noam Ben Haim. Agenda Introduction What is a web service Basic Architecture Extended Architecture WS Stacks.
Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data.
XML and Database.
GCRC Meeting 2004 BIRN Coordinating Center Software Development Vicky Rowley.
WEB SERVICE DESCRIPTION LANGUAGE (WSDL). Introduction  WSDL is an XML language that contains information about the interface semantics and ‘administrivia’
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
XML-Based Grid Data System for Bioinformatics Development Noppadon Khiripet, Ph.D Wasinee Rungsarityotin, MS Chularat Tanprasert, Ph.D Royol Chitradon.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
ARGOS (A Replicable Genome InfOrmation System) for FlyBase and wFleaBase Don Gilbert, Hardik Sheth, Vasanth Singan { gilbertd, hsheth, vsingan
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
WREC Working Group IETF 49, San Diego Co-Chairs: Mark Nottingham Ian Cooper WREC Working Group.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
XML 1. Chapter 8 © 2013 Pearson Education, Inc. Publishing as Prentice Hall SAMPLE XML SCHEMA (XSD) 2 Schema is a record definition, analogous to the.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
An Overview of Data-PASS Shared Catalog
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Wsdl.
Web services, WSDL, SOAP and UDDI
MANAGING DATA RESOURCES
Data Model.
Lesson 3 Bioinformatics Laboratory
Supporting High-Performance Data Processing on Flat-Files
Presentation transcript:

June 15 SRS - A Backbone for Genome Information and Data Grid Systems Don Gilbert Indiana University

June 15SRS - Genomes and Grids Overview Search/Retrieval in Genome Information systems Efficiency and complexity: RDBMS, SRS †, others Genome data federation: local and distributed Directories of data: automated S/R and the Grid SRS, LDAP and future biodata grids † Sequence Retrieval System, Lion Bioscience

June 15SRS - Genomes and Grids Indiana U. using SRS Bio-info archiving and distribution –IUBio Archive, -- public molecular biology data / software archive –Bio-Mirrors, -- Sequence and related biology databanks Genome information systems –FlyBase, -- genome infosystem of Drosophila fruitfly –euGenes, -- infosystem for 8 important eukaryotes with 180,000 genes Bio-Data Grids – -- experimental distributed computing

June 15SRS - Genomes and Grids Genome Information Systems FlyBase, euGenes (SRS,Perl/Java) Wormbase (AceDB > RDBMS, BioPerl) Mouse GD, Sacc. GD (RDBMS) GeneCards (Glimpse > XMLquery) Ensembl (RDBMS,BioPerls) Nascent: many newly developing organism genome systems

June 15SRS - Genomes and Grids euGenes 8 eukaryote genomes in common summary data format Describes 180,000 known, predicted and orphan genes Gene Homologies with comparative summaries Genome map views and feature annotations Gene Ontology function, process and cell location integration Efficient information search and retrieval methods Extends FlyBase information system technology Updated (semi) automatically from several sources

June 15SRS - Genomes and Grids Genome attributes in euGenes July 2002 Genes as extracted from genome project sources. These differ from true gene numbers by orphan gene records, prediction artifacts, unmerged predicted/expt. records, and unfinished sequencing gaps.

June 15SRS - Genomes and Grids Search/Retrieval for Genome DBs Separate management and public search/ retrieval has advantages in flexibility, speed Indexing methods for text databases (or rdbms exports) are accurate, efficient for high volume data, easy to implement for complexly structured biology data Sequence Retrieval System (SRS) is used in FlyBase and euGenes; GeneCards uses Glimmer and similar methods; Google and Digital library methods are related

June 15SRS - Genomes and Grids

June 15SRS - Genomes and Grids Anatomy of a Genome Info. System Information structure –Complex document structure; tabular data; etc. –Organize: Table of contents, Reports, Indexing –Browse contents; Search / retrieve from biological questions –Bulk data search / retrieve for bioinformatics Information content –Literature (abstracted and curated), Sequence and feature analyses, maps, controlled vocabulary/ontologies, people, biologics, contacts, etc. –Metadata describing primary data, along with protocols, notes, sources Informatics / software –Backend database, data collection, management, analyses –Front-end services (hypertext web, document search/retrieval); ease of understanding and usage (HCI) –Middleware glue code, interfaces, software, etc. –Specialized for genome data: maps, blast searches, ontologies

June 15SRS - Genomes and Grids Single DB vs. Federated Info. S/R

June 15SRS - Genomes and Grids FlyBase/euGenes Query System

June 15SRS - Genomes and Grids FlyBase Query Results FlyBase Genes query results Query: ( [libs={FBgn PFgn}-all:wing] or [libs-syn:wing] ) and [libs-org:Dmel], No. matches= 1437 Bookmark FBquery: ( [libs={FBgn PFgn}-all:wing] | [libs-syn:wing] )& [libs-org:Dmel] #SymbolName MapAllelesStocksRefsDNADate 118w18 wheeler56F May 02 22R-F May Act42AActin 42A42A May 02 20Act5CActin 5C5C May Page and Sort results Batch Download Fetch items : x All Items […] Format: [Spreadsheet] Report content: [Summary] Report only Select fields: [ Field list ] Refine query or find items in related data Refine query ( [libs={FBgn PFgn}-all:wing] or [libs-syn:wing] ) and [libs-org:Dmel] [ and ] [other fields] matches [ ….. ] Search Genes, retrieve Related Data Classes (alleles, aberrations, transcripts, insertions, sequences …)

June 15SRS - Genomes and Grids Efficiency of SRS versus RDB Drosophila Genome Annotations SRS or Gadfly DB relational database Web search time (shorter is better; two computers - O,F)

June 15SRS - Genomes and Grids [-- Genomes to Grids --]

June 15SRS - Genomes and Grids Science Data Grids Infrastructure for distributed analyses –analyses distributed among 1000s of commodity computers –high-volume data distribution –data resource directories (catalogs) –security, authenticated use –peer-to-peer sharing and collaborations –Data grid infrastructure still needs work Links –globus.org; eu-datagrid.org; ivdgl.org

June 15SRS - Genomes and Grids BioGrid Client-Server Aspects Grid-aware client software Data and software resource directories Grid of processing computers

June 15SRS - Genomes and Grids Moving Data on the Grid biodirectory "find protein coding sequences for species X,Y,Z" biodirectory "get locators split 100 ways" 3.for i ( ) { copydata(realdata[i],gridcpu[i]); runapp(gridcpu[i]) }

June 15SRS - Genomes and Grids Design of bio-data directories 1.Develop schema describing directory objects and attributes. Essential fields include ID/accession, data class / category, update time. Start with minimal directory descriptions. 2.Create directories of data records, with existing backend software such as SRS, RDMBS, Entrez, others. 3.Replicate directories among data centers; use for determining primary data to be fetched or mirrored. 4.Common exchange formats, schema, directory query syntax are necessary, implementation details are at the choice of a data center.

June 15SRS - Genomes and Grids Directories of Genome data For genome data, "broad and shallow" directories can federate the "narrow and deep" data-bases Science Grid computing –Needs efficient, authenticated discovery and distribution of high volume data LDAP directories –mature, efficient for high volumes, allows federated queries over distributed directories, and works well for SRS databanks and genome annotations; –As functional as BioDAS (distributed annotation); broader in scope, with generic client/server software

June 15SRS - Genomes and Grids LDAP? Why not xxx? Why LDAP for bio-data directories? –Available now with many features needed Web/XML ? –Web/SOAP/WSDL/UDDI: SOAP for communication of directory requests, WSDL for an interface to the directory repository, UDDI to locate the service (some assembly required…) –DSML: a direct conversion of LDAP to XML, for Web/XML interoperability to LDAP (e.g., ); supported by industry (Msoft, Sun, others) CORBA? SQL? Wgetz? FTP?

June 15SRS - Genomes and Grids Light-weight Directory Features (LDAP) Flexible, hierarchical directory of objects with identifiers to community definitions. Objects are simple or complex. Each has attributes (fields) composed of strings, numbers, binaries and complex structures Use many backend systems (including SRS); can be added to search/retrieval systems relatively easily Globally distributed searches of many directories Schema are documented: objects and attributes have unique identifiers and definitions (e.g. IETF RFC documents) Schema search/retrieval for directory 'discovery' Computable search, browse, retrieval; referrals to other servers and remote objects; extension mechanisms for new object types Replication of directories; mechanisms for peer group updates Security mechanisms for data transport, access and updates

June 15SRS - Genomes and Grids SRS6 - LDAP gateway Experimental SRS6 backend search compiled with OpenLDAP server – –ldap://iubio.bio.indiana.edu:3895/srv=srs Act like “getz” or “wgetz”, with LDAP query input and output Efficient, functional as network getz surpasses wgetz for programmability, efficiency Issue 1: convert ldap to srs query Issue 2: [.. must be something.. ]

June 15SRS - Genomes and Grids SRS-LDAP efficiency Queries Q1: 3 libs, 20K ids, 60 Mb Q2: 1 lib, 340K ids, 1.5 Gb Q3: 1 lib, 1.2M ids, 4.7 Gb (* estimated time for getz/wgetz) 1hr 2hr * *

June 15SRS - Genomes and Grids Wrap-up Beyond sequence retrieval with SRS to genome and biological information systems Federation of disparate data is “easy” Efficiency is high, an important factor in information systems Grid, future distributed computing needs flexible, efficient technology such as SRS.

June 15SRS - Genomes and Grids End of SRS - Genomes and Grids Eugenes fulgens (Magnificent Hummingbird, Costa Rica) Don Gilbert Indiana University