Anil Wipat University of Newcastle upon Tyne, UK A Grid based System for Microbial Genome Comparison and analysis
Motivation: Genome Comparison The past decade has seen the emergence of whole genome sequencing Whole genome sequences can reveal a great deal about the biology of an organism Comparing genomes is one of the most effective ways to exploit genome sequence information Establishes the differences and similarities at the genetic level Aids biologists in understanding pathogenicity, evolution, ecology, metabolism, etc.
Microbial Genome comparison commonly applied at different levels: DNA (nucleotide sequence) (..atcggatcgtacgagcgatc..) DNA (nucleotide sequence) (..atcccatcgaacgagcgatc..) Proteins (amino acid sequence MCSAKMQTR..) Nucleotide sequence Comparison (whole genome) All–against-all Amino acid sequence comparisons between proteins Proteins (amino acid sequence MSAKMPTR..)
Motivation: Genome Comparison The number of complete genome sequences is rapidly increasing as sequencing technology advances e.g. ~200 whole genomes have been sequenced e.g. ~200 whole genomes have been sequenced Sequence analysis and comparison is becoming more computationally intensive Large scale genome comparison is already beyond the capability of many laboratories Large scale genome comparison is already beyond the capability of many laboratories How are we going to handle all these genomes? New methods and technologies for genome comparison are required. New methods and technologies for genome comparison are required.
Microbase Project Overview Aims to create a scalable, Grid-enabled analytical system to support microbial genome comparison. Aims to support both the biological and bioinformatics community. Funded by BBSRC Bioinformatics and e-Science & DTI Started April Started April Collaboration with microbiologists and industrial partners Providing use cases. Providing use cases.
Microbase: Functionality A system that utilises Grid resources to automatically perform genome comparisons at nucleotide and protein levels An information repository that: maintains and exposes the results of these comparisons to users as a base level dataset maintains and exposes the results of these comparisons to users as a base level dataset provides canned algorithms for analysis provides canned algorithms for analysis A Grid-enabled high-performance environment to execute remote user-specified computations Data integration with remote, Grid-enabled databases e.g. Genomic, Metabolic, Protein Interaction, Gene Expression databases, etc… e.g. Genomic, Metabolic, Protein Interaction, Gene Expression databases, etc…
MicrobaseLite: A Prototype The first prototype of the Microbase system Automatically performs all-against-all genome comparisons and exposes the resulting datasets Provide services for biologists to browse and query genome sequences and comparison results Helps the specification of entire Microbase system and the derivation of use cases Implemented using a Component-based architecture with Web services interfaces Also uses existing Grid technology – my Grid Notification Service
MicrobaseLite: Datasets microbial genomes including Bacteria, archaea, eukaryota Bacteria, archaea, eukaryota Held in the GenomePool component Held in the GenomePool component Results of all-against-all nucleotide sequence comparison Blastn, MUMmer Blastn, MUMmer Results of all-against-all protein sequence comparison Blastp, Ssearch, Promer Blastp, Ssearch, Promer Held in the ComparisonPool component Held in the ComparisonPool component Object-oriented data model of interspecies genome rearrangements The OGRE module component (current research) The OGRE module component (current research)
MicrobaseLite: Architecture Client Side Server Side Request Builder Object-oriented Database Object Model Builder DNA Comparison Protein Comparison Database Notification Service External Notification Internal Notification BIOSQL Genome Loader Web Services Query Microbial Genome Pool Task Scheduler Post-processing Genome Comparison Pool Query & Execution OGRE Module Client Proxies Notification Proxy Web Services Proxy Data Processing Graphical Viewer User Tools Response Receiver
MicrobaseLite: Microbial Genome Pool Provide a Web / Grid service based information repository of microbial genomes maintains a database of 170+ microbial genomes maintains a database of 170+ microbial genomes A web-service implementation of BioJava Interfaces Uses the my Grid Notification Service to notify registered clients of new genomes Available for use now with a prototype API Clients Comparison Pool Notification Service External Notification Internal Notification BIOSQL Genome Loader Web Service API Microbial Genome Pool
MicrobaseLite: Genome Comparison Pool Retrieves genomes from the Microbial Genome Pool automatically on Notification Executes a variety of genome comparison tools: Blast, MUMmer, Promer, MSPcrunch Incorporates a Task Scheduler for parallel processing Uses N1 Grid Engine (batch system) to dispatch comparison tasks to run on Linux clusters Uses N1 Grid Engine (batch system) to dispatch comparison tasks to run on Linux clusters Comparison outputs processed and stored into a relational database (mySQL). Protein & Nucleotide Comparison Database Task Scheduler Post-processing Genome Comparison Pool Parallel Cluster(s) N1 Grid Engine Parallel Cluster(s)
Task Scheduler and scalability Number of Processors Execution Time (minutes) Execution times of all-against-all comparisons with 10 microbial genomes ( Blastp, Blastn, MSPcrunch, MUMmer and PROmer )
MicrobaseLite: User Tools Demonstration graphical tools under development Genome Browser allows users to view genomes, the comparison results and the results of canned algorithms Deployed at client-side operating via Web services
Vision for the full Microbase System Continue to explore scalability issues using MicrobaseLite as platform Towards seamless scalability Towards seamless scalability Harnessing of remote clusters on demand Harnessing of remote clusters on demand A system for the submission and enactment of remotely conceived code or workflows for user defined comparative analysis Investigating the integration of Taverna core to enact SCUFL workflows within Microbase Investigating the integration of Taverna core to enact SCUFL workflows within Microbase
Conclusions Microbase aims to exploit Grid resources to provide a scalable system for Microbial genome comparison MicrobaseLite produced as a prototype and demonstrator application for the biologist/bioinformatician Work now underway on the full Microbase - a system to support remotely conceived computations
Acknowledgements The Microbase Team: Anil Wipat, Yudong Sun, Matthew Pocock, Keith Flanagan, Pete Lee, and Paul Watson Anil Wipat, Yudong Sun, Matthew Pocock, Keith Flanagan, Pete Lee, and Paul Watson The Microbase User Requirements/Use case contributors my Grid project (Particularly Southampton and EBI) The Industrial supporters: NonLinear Dynamics, NCIMB, Arrow Therapeutics, Angel Biotech, Complement Genomics, ACS Dobfar, AstraZeneca See
Microbial Genome comparison commonly applied at two levels: DNA (nucleotide sequence) (..atcggatcgtacgagcgatc..) DNA (nucleotide sequence) (..atcccatcgaacgagcgatc..) Proteins (amino acid sequence MCSAKMQTR..) Nucleotide sequence Comparison (whole genome) All–against-all Amino acid sequence comparisons between proteins Proteins (amino acid sequence MSAKMPTR..)
OGRE: Object-oriented Genome REarrangements Model A dataset that captures genomic rearrangements between microorganisms Object-Oriented (OO) concepts and formalism are being used to classify the results of the nucleotide sequence comparison An Ontology and OO-conceptual model is being developed to describe chromosomal rearrangements and to define objects that can represent them An Ontology and OO-conceptual model is being developed to describe chromosomal rearrangements and to define objects that can represent them Algorithms developed to recognise defined rearrangement features in nucleotide sequence comparison data Algorithms developed to recognise defined rearrangement features in nucleotide sequence comparison data Objects made persistent in a OO database Objects made persistent in a OO database
MicrobaseLite: OGRE Module Performs object-oriented analysis and storage of genome rearrangements An OO dataset captures genomic rearrangements revealed through nucleotide sequence comparison An OO dataset captures genomic rearrangements revealed through nucleotide sequence comparison Made persistent in an OO database Made persistent in an OO database Provides Web services interface for external users to query and analyse the OO dataset Object-oriented Database Object Model Builder Query & Execution OGRE Module Comparison Pool Web Services