The Encyclopedia of Life (EOL) Project An initiative to analyze and provide annotation for putative protein sequences from all publicly available genome.

Slides:

Advertisements

Similar presentations

Database System Concepts and Architecture

Advertisements

Oncomine Database Lauren Smalls-Mantey Georgia Institute of Technology June 19, 2006 Note: This presentation contains animation.

1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.

Protein Structure Database Introduction Database of Comparative Protein Structure Models ModBase 生資所 g 詹濠先.

1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.

A New Computing Paradigm. Overview of Web Services Over 66 percent of respondents to a 2001 InfoWorld magazine poll agreed that "Web services are likely.

Introduction to Web Application Architectures Web Application Architectures 18 th March 2005 Bogdan L. Vrusias

How Clients and Servers Work Together. Objectives Learn about the interaction of clients and servers Explore the features and functions of Web servers.

Information Technology, the Internet, and You © 2013 The McGraw-Hill Companies, Inc. All rights reserved.Computing Essentials 2013.

Chapter 11 ASP.NET JavaScript, Third Edition. 2 Objectives Learn about client/server architecture Study server-side scripting Create ASP.NET applications.

ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.

Microsoft ® Application Virtualization 4.5 Infrastructure Planning and Design Series.

Passage Three Introduction to Microsoft SQL Server 2000.

Microsoft ® Application Virtualization 4.6 Infrastructure Planning and Design Published: September 2008 Updated: February 2010.

Digital Library Architecture and Technology

SERNEC Image/Metadata Database Goals and Components Steve Baskauf

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

INTRODUCTION TO WEB DATABASE PROGRAMMING

PAT project Advanced bioinformatics tools for analyzing the Arabidopsis genome Proteins of Arabidopsis thaliana (PAT) & Gene Ontology (GO) Hongyu Zhang,

January, 23, 2006 Ilkay Altintas

Introduction to ASP.NET. Prehistory of ASP.NET Original Internet – text based WWW – static graphical content  HTML (client-side) Need for interactive.

Copyright © cs-tutorial.com. Introduction to Web Development In 1990 and 1991,Tim Berners-Lee created the World Wide Web at the European Laboratory for.

GIS technologies and Web Mapping Services

Chapter 16 The World Wide Web Chapter Goals Compare and contrast the Internet and the World Wide Web Describe general Web processing Describe several.

DATA COMMUNICATION DONE BY: ALVIN SAMPATH CARLVIN SAMPATH.

16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.

CPS120: Introduction to Computer Science The World Wide Web Nell Dale John Lewis.

14 Publishing a Web Site Section 14.1 Identify the technical needs of a Web server Evaluate Web hosts Compare and contrast internal and external Web hosting.

Title: GeneWiz browser: An Interactive Tool for Visualizing Sequenced Chromosomes By Peter F. Hallin, Hans-Henrik Stærfeldt, Eva Rotenberg, Tim T. Binnewies,

CIS 375—Web App Dev II Microsoft’s.NET. 2 Introduction to.NET Steve Ballmer (January 2000): Steve Ballmer "Delivering an Internet-based platform of Next.

Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.

From Creation to Dissemination A Case Study in the Library of Congress’s use Open Source Software DLF Spring Forum Corey Keith

DISTRIBUTED COMPUTING

Fundamentals of Database Chapter 7 Database Technologies.

Authors Project Database Handler The project database handler dbCCP4i is a small server program that handles interactions between the job database and.

Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK

Unit – I CLIENT / SERVER ARCHITECTURE. Unit Structure  Evolution of Client/Server Architecture  Client/Server Model  Characteristics of Client/Server.

(The Encyclopedia of Life (EOL)) medicine researcheducation The Annotation and Cataloging of Proteins, Life's Building Blocks for… The Open Notebook.

Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.

Project Overview Graduate Selection Process Project Goal Automate the Selection Process.

Project Overview Graduate Selection Process Project Goal Automate the Selection Process.

Browsing the Genome Using Genome Browsers to Visualize and Mine Data.

5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.

Kurt Mueller San Diego Supercomputer Center NPACI HotPage Updates.

Database Systems: Design, Implementation, and Management Eighth Edition Chapter 14 Database Connectivity and Web Technologies.

9 Systems Analysis and Design in a Changing World, Fourth Edition.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

Project Database Handler The Project Database Handler dbCCP4i is a brokering application that mediates interactions between the project database and an.

The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.

OWL Representing Information Using the Web Ontology Language.

GCRC Meeting 2004 BIRN Coordinating Center Software Development Vicky Rowley.

May 6, 2002Earth System Grid - Williams The Earth System Grid Presented by Dean N. Williams PI’s: Ian Foster (ANL); Don Middleton (NCAR); and Dean Williams.

1 Service Creation, Advertisement and Discovery Including caCORE SDK and ISO21090 William Stephens Operations Manager caGrid Knowledge Center February.

Rich Web Applications for the Enterprise... Creating RWA from Your Oracle Database Presented By: John Krahulec Bizwhazee SEOUC Charlotte February 2009.

Providing web services to mobile users: The architecture design of an m-service portal Minder Chen - Dongsong Zhang - Lina Zhou Presented by: Juan M. Cubillos.

GraDS MacroGrid Carl Kesselman USC/Information Sciences Institute.

Features Of SQL Server 2000: 1. Internet Integration: SQL Server 2000 works with other products to form a stable and secure data store for internet and.

ATLAS Database Access Library Local Area LCG3D Meeting Fermilab, Batavia, USA October 21, 2004 Alexandre Vaniachine (ANL)

SAN DIEGO SUPERCOMPUTER CENTER Welcome to the 2nd Inca Workshop Sponsored by the NSF September 4 & 5, 2008 Presenters: Shava Smallen

Function BIRN The ability to find a subject who may have participated in multiple experiments and had multiple assessments done is a critical component.

High throughput biology data management and data intensive computing drivers George Michaels.

5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.

1 1 High Throughput Proteomics and the Encyclopedia of Life Mark A. Miller, Ph.D. Integrative BioScience Program San Diego Supercomputer Center.

E-Business Infrastructure PRESENTED BY IKA NOVITA DEWI, MCS.

Functional and structural genomics using PEDANT

Integrating Scientific Tools and Web Portals

iGAP: Integrative Grid-enabled Genome Annotation Pipeline

Overview of the Encyclopedia of Life (EOL) Project

Recap: introduction to e-science

Project tracking system for the structure solution software pipeline

Presentation transcript:

The Encyclopedia of Life (EOL) Project An initiative to analyze and provide annotation for putative protein sequences from all publicly available genome data Baldridge, K.; Baru, C.; Bourne, P.; Clingman, E.; Cotofana, C.; Ferguson, C.; Fountain, A.; Greenberg, J.; Jermanis, D.; Li, W.; Matthews, J.; Miller, M.; Mitchell, J.; Mosley, M.; Pekurovsky, D.; Quinn, G.B.; Rowley, j.; Shindyalov, I.; Smith, C.; Stoner, D.; Veretnik, S. San Diego Supercomputer Center, MC 0505, 9500 Gilman Drive, La Jolla, CA , USA Genomic Pipeline Arabidopsis Protein sequences Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Structural assignment of domains by PSI-BLAST on FOLDLIB Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB Create PSI-BLAST profiles for Protein sequences Store assigned regions in the DB Functional assignment by PFAM, NR, PSI-Pred assignments FOLDLIB NR, PFAM Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) Domain location prediction by sequence structure info sequence info SCOP, PDB Data warehouse MySQL DataMart(s) Structure assignment by PSI-BLAST Structure assignment by 123D Domain location prediction SOAP/Web Server Load/update scripts Application server UDDI directory Publish Web Services & API Automated data downloads to mirrors and researchers WWW Data incorporated into third party web pages EOL Web pages served via JSP EOL Notebook For further information about EOL, please visit us online at: or contact Mark Miller at The Sequence Analysis Pipeline The Need for Protein AnnotationInnovative Data Access Accompanying the massive supply of genomic data is a need to annotate proteins from structural and functional points of view. Questions that researchers look to answer using the massive amount of new genomic data include: - What other genomic proteins are similar to the protein that I am researching? - What level of conservation is there for a particular protein sequence across species? - Which protein domains are common to various protein sequences? - What is the likely cellular location of a specific protein or class of proteins? On a limited basis, researchers are able to manually perform BLAST searches, sequence analysis and data collation for small collections of protein sequences of interest, but for the very large numbers of sequences (10,000 to 15,000 or greater) coded for in an individual genome, this becomes impractical. Therefore, key to large-scale genomic sequence analysis is the creation of a reliable and automated software “pipeline” to handle both the analysis functions and then the collation of output data from the analysis. The EOL Model Figure 1 Genomic analysis pipeline used to analyze Arabidopsis thaliana sequence data The Proteins of Arabidopsis thaliana (PAT) project was a prototype initiative to establish a reliable and accurate pipeline for genome annotation (iGAP) (Figure 1). Using homology modeling, the iGAP provides functional annotations and predicts three-dimensional structures (where possible) for proteins encoded in the Arabidopsis thaliana genome. The results from iGAP (BLAST-WU, PSI- BLAST, 123D+, COILS, TMHMM, SignalP) were combined and organized into a relational database with a web-based GUI. Structural assignment by sequence similarity and fold recognition. – Fold assignment. – Function assignment. – Modeling by aligning with template. Functional assignment by sequence similarity. Assignment of special classes (filtering). Assignment of protein features. Steps in Protein Annotation An important issue in this process is automation and its associated automated quality assessment. In the pipeline model, this was addressed by: Introduction of six reliability categories. Introduction of benchmark based on 1000 non-redundant SCOP folds [Murzin, AG; Brenner, SE; Hubbard, T; Chothia, C. J. Mol. Biol., 1995, 247:536]. Testing a variety of search conditions and methods within this benchmark. Further information about the PAT project may be found at the PAT web site: Reliability Categories (based on selectivity benchmark): A. Certain (99.9% of true positives among predicted positives) B. Reliable (99%) C. Probable (90%) D. Possible (50%) F. No annotation Sensitivity = tp/(tp+fn) Selectivity = tp/(tp+fp) E. Potential (10%) Figure 2 The EOL Data Analysis and Delivery Model Large-Scale Computing Resources and Data Storage Stages in EOL Data Processing and Delivery Publicly available genomic sequence data are obtained via a high-speed Internet 2 connection from NCBI to the San Diego Supercomputer Center. Sequence data is distributed to several large-scale computing resources such as at partner institutions, such as the BII in Singapore; and the TeraGrid at SDSC (see below), to which the PAT software pipeline has been ported. Data from the pipeline is deposited into a DB2-based multi-species version of the PAT data warehouse schema, and federated with data from a number of other local database projects. Multiple complex queries on the data are run and the results are stored in the data base. Data is loaded into multiple data marts for fast, read-only query access/distribution to both end-users (via a Web interface and a SOAP-based Web services paradigm), and to EOL data mirror sites. Researchers throughout the world are able to access the data by pointing their Web browser to the EOL data Web site or one of its mirrors. Additionally, the World Wide Web Consortium (W3C) standards-based Web Service protocol allows for peer-to- peer automated computer data access for a variety of uses. Figure 3 Book Metaphor Web Interface Ported pipeline applications Sequence data from genomic sequencing projects Pipeline data Key to the success of the EOL project has been the ability to partner with computing projects that will provide the resources to drive the software pipeline to process over 800 available genomes. Large- scale computing resources being recruited for the EOL project include the TeraGrid, the world's largest, fastest, most comprehensive, distributed infrastructure for open scientific research ( PRAGMA, an open organization in which Pacific Rim institutions formally collaborate to develop grid-enabled applications and to deploy Grid infrastructure throughout the Pacific region ( and NRAC reources, including SDSC’s Blue Horizon; University of Michigans AMD cluster, and the University of Wisconsin Condor Flock. Another factor in the development of EOL has been the ability to deploy large-scale, mass storage to handle the enormous amount of data generated by iGAP analyses and loaded into EOL data warehouse schema and data marts. Ultimately more than 10 terabytes of storage will be deployed for genome annotation alone. Multiple EOL Data Mirror Sites Data mirrors will be a major component in the EOL data distribution system. A software package can be downloaded from the EOL interface that allows researchers to store selected EOL data on local machines, and, if desired, the software makes it possible to act as a public EOL data mirror. This mirror package software will be based upon a freely available relational database management system (MySQL) and application server (JBoss). This ensures the widest possible deployment of an EOL mirror data repository, from major university and biotech sites to the smallest research institutions, even including high schools, The end-user experience of accessing data processed in this manner is fast, comprehensive and flexible. The EOL model (Figure 2), applies the iGAP pipeline (proven by the PAT project) al available (cuurently 800+) genomes. It is a key goal of the the project to provide the computational and storage resources necessary to accommodate the analysis of this magnitude of sequnce data (current esitmates are 300 cpu years with available hardware). Ongoing efforts are aimed at obtaining more cpu resources, and improving the efficiency of computational resource utilization. An unique aspect of the EOL model is its ability to deliver data through multiple routes.One arm of this data delivery system is the Web interface, driven by Java Server Pages (JSP). Building on the “Encyclopedia of Life” concept, the interface provides fast access to EOL data through a book metaphor design. Data is cataloged alphabetically by species, and the user is provided with multiple additional tools to search sequence data, including: BLAST search with a protein query sequence to one or more specific species data. Keyword search. Natural Language Query search. Sequence identifier (accession ID) search. SCOP Fold browser. Putative function browser. Query results will be returned in multiple forms, including a Web page summary at the genome, sequence, and structure data levels; as well as by links to the same information in XML, a PDF printer-friendly output, EOL notebook version (see below), and a narrated summary in Flash. The Web interfaces make extensive use of Scalable Vector Graphics (SVG) components to deliver fast, client-side graphical data renderings using XML encapsulated data accoroding to W3C standards. An example is the SVG “chromosome mapper” shown in Figure 4. SVG molecular rendering is used at the client side to provide fast, interactive, and visually informative molecular graphics. Web Services and the EOL Notebook In addition to obtaining access to EOL data via the web, other components of data delivery include publication of Web Services-based API; and the SDSC Blue Titan web services network direction system. Through Web Services, any researcher or data service is able to access EOL data automatically and with minimal programmatic effort. The EOL notebook is a subproject within EOL (and bioinformatics.org) to create a Java-based application, distributed via JNLP, that will act as a local repository for EOL data. In addition to being able to store and search data locally, the EOL notebook will also be a consumer of EOL Web Services and, via automation, will ensure locally kept data (stored in XML format for interoperability) is kept in sync with data in the main EOL repository. Figure 4 Client-side data rendering using SVG mmiller: This should read “all available genome sequences” mmiller: This should read “all available genome sequences” mmiller: Strike out arabidopsis mmiller: Strike out arabidopsis