Overview of the Encyclopedia of Life (EOL) Project

Slides:

Advertisements

Similar presentations

Drybridge Consulting Party Identification Directory Installing the Microsoft Research Service IDEAlliance and Drybridge Consulting – collaborating to deliver.

Advertisements

Integrating ChemAxon technology into your End User Applications Java solutions for cheminformatics Ver. Mar., 2005.

General introduction to Web services and an implementation example

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.

1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.

DT211/3 Internet Application Development Active Server Pages & IIS Web server.

DataFoundry: An Approach to Scientific Data Integration Terence Critchlow Ron Musick Ida Lozares Center for Applied Scientific Computing Tom SlezakKrzystof.

GIS at SDSC Domains: –From geology, environmental science, hydrology, ocean biodiversity, regional development, Katrina response, archaeology, to neuroscience.

SAN DIEGO SUPERCOMPUTER CENTER Developing a CUAHSI HIS Data Node, as part of Cyberinfrastructure for the Hydrologic Sciences David Valentine Ilya Zaslavsky.

Development of Japanese GIS Tool for use in the Humanities ○ Masatoshi ISHIKAWA †, Yoichi KAWANISHI ††, Hidefumi OKUMURA †††, Shoichiro HARA †††† † University.

Integrating CRM On Demand with the E-Business Suite to Supercharge your Sales Team Presented by: Tom Connolly, Jason Lieberman Company: BizTech Session.

Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at

PAT project Advanced bioinformatics tools for analyzing the Arabidopsis genome Proteins of Arabidopsis thaliana (PAT) & Gene Ontology (GO) Hongyu Zhang,

OFC 322 Building Office Research Web Services: Exposing Corporate Data Through Office Brian Jones Program Manager Authoring Services Martin Sawicki Lead.

Title: GeneWiz browser: An Interactive Tool for Visualizing Sequenced Chromosomes By Peter F. Hallin, Hans-Henrik Stærfeldt, Eva Rotenberg, Tim T. Binnewies,

The Encyclopedia of Life (EOL) Project An initiative to analyze and provide annotation for putative protein sequences from all publicly available genome.

Web Services Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.

BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.

Web Engineering we define Web Engineering as follows: 1) Web Engineering is the application of systematic and proven approaches (concepts, methods, techniques,

Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,

(The Encyclopedia of Life (EOL)) medicine researcheducation The Annotation and Cataloging of Proteins, Life's Building Blocks for… The Open Notebook.

The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.

Pfam, DAS and the future Rob Finn DAS Workshop 2009.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

Data Integration and Management A PDB Perspective.

IODE Ocean Data Portal - ODP  The objective of the IODE Ocean Data Portal (ODP) is to facilitate and promote the exchange and dissemination of marine.

An Overview of Microsoft.NET Todd M. Gagorik Technical Architect Microsoft Corporation.

NOVA A Networked Object-Based EnVironment for Analysis “Framework Components for Distributed Computing” Pavel Nevski, Sasha Vanyashin, Torre Wenaus US.

XML-Based Grid Data System for Bioinformatics Development Noppadon Khiripet, Ph.D Wasinee Rungsarityotin, MS Chularat Tanprasert, Ph.D Royol Chitradon.

Introduction to The Storage Resource.

A Cyberinfrastructure Framework for Discovery, Integration, and Analysis of Earth Science Data A Prototype System A. K. Sinha, Z. Malik, A. Rezgui, A.

GEONSearch: From Searching to Recommending GeoInformatics 2006 May 10-12, Reston, Virginia Ullas Nambiar, Bertram Ludaescher Dept. of Computer Science.

Copyright OpenHelix. No use or reproduction without express written consent1 1.

Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.

Hydroinformatics Lecture 15: HydroServer and HydroServer Lite The CUAHSI HIS is Supported by NSF Grant# EAR CUAHSI HIS Sharing hydrologic data.

1 1 High Throughput Proteomics and the Encyclopedia of Life Mark A. Miller, Ph.D. Integrative BioScience Program San Diego Supercomputer Center.

Grid Account Management: A Case Study GGF 9 PGM-RG Chicago, IL October 5-8, 2003 Doru Marcusiu Assistant Director Grid and Security.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

The Holmes Platform and Applications

Functional and structural genomics using PEDANT

Operating System & Application Software

Search Engine Optimization

What are they? The Package Repository Client is a set of Tcl scripts that are capable of locating, downloading, and installing packages for both Tcl and.

Tools and Services Workshop

Joslynn Lee – Data Science Educator

What is WWW? The term WWW refers to the World Wide Web or simply the Web. The World Wide Web consists of all the public Web sites connected to the Internet.

MATLAB Distributed, and Other Toolboxes

Pasquale Pagano CNR, Italy

Sabri Kızanlık Ural Emekçi

KnowEnG: A SCALABLE KNOWLEDGE ENGINE FOR LARGE SCALE GENOMIC DATA

Open Source distributed document DB for an enterprise

iGAP: Integrative Grid-enabled Genome Annotation Pipeline

CUAHSI HIS Sharing hydrologic data

Sequence based searches:

Encyclopedia of Life as a Target VGrADS Application

Recap: introduction to e-science

PHP / MySQL Introduction

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Chapter 10 Development of Multimedia Project

LEARNING MANAGEMENT SYSTEM

Functional Annotation of the Horse Genome

Using Spotfire for Proteomic Analysis

Goals Introduce the Windows Server 2003 family of operating systems

Intermountain West Data Warehouse

Databases, Web Pages and Archives

E-commerce Infrastructure Web Servers / Web Clients / Web Browsers

Large Scale Distributed Computing

Defining the Grid Fabrizio Gagliardi EMEA Director Technical Computing

LEARNING MANAGEMENT SYSTEM

Presentation transcript:

Overview of the Encyclopedia of Life (EOL) Project

Background Biology has become a data driven science We have the blueprint (genomes) of over 800 organisms This number will increase rapidly to the point in 5-10 years where your blueprint becomes a tool in your medical diagnosis First we must understand the buildings (proteins) that control life’s processes EOL strives to be the 21st century “Britannica” that everyone will turn to

EOL Project Description The Encyclopedia of Life is a joint development of the San Diego Supercomputer Center (SDSC) and scientists and biological resources worldwide EOL involves SDSC staff from HPC, DAKS, Grids and clusters and visualization EOL has three parts: 1. Putative functional and 3-D structure assignment through the largest computation ever attempted 2. True API level integration with key biological resources 3. A focus for future collaborative developments via the EOL Notebook

Type of Questions to be Addressed by EOL If a knockout gene in arabidopsis leads to an average phenotypic response of 10% increased growth, will the same likely happen in rice? Is protein X found in anthrax? Is protein X a drug target, that is, does it exist predominantly in pathogenic bacteria of is it found in eukaryotes also? Has caspase-1, a protein involved in cell death and aging been identified in any plants, if so what species and do the proposed protein structures look similar? Give me all available information on caspase-1

EOL Basic Topology Genomic Data Putative Functional and 3D Assignment Integration with Other Resources Public and Private Databases To Serve Thousands Worldwide

TeraGrid Some Technical Detail Mapped to the Topology Sequence data from genomic sequencing projects Ported applications Load/update scripts MySQL DataMart(s) Pipeline data Data warehouse Structure assignment by PSI-BLAST Structure assignment by 123D Domain location prediction Normalized DB2 schema Application Server Web/SOAP Server Some Technical Detail Mapped to the Topology Retrieve Web pages & Invoke SOAP methods

One Plant Genome Processed as a Prototype http://arabidopsis.sdsc.edu One Plant Genome Processed as a Prototype

Current Genomic Pipeline Arabidopsis Protein sequences sequence info structure info NR, PFAM Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) SCOP, PDB Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) Create PSI-BLAST profiles for Protein sequences Structural assignment of domains by PSI-BLAST on FOLDLIB Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB Only sequences w/out A-prediction Functional assignment by PFAM, NR, PSIPred assignments Domain location prediction by sequence FOLDLIB Store assigned regions in the DB

Scale of Multi-genome Analysis ~800 genomes @ 10k-20k per =~107 ORF’s Genomes Protein sequences sequence info structure info NR, PFAM Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) SCOP, PDB 4 CPU years Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) 104 entries 228 CPU years Create PSI-BLAST profiles for Protein sequences Structural assignment of domains by PSI-BLAST on FOLDLIB 3 CPU years Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB 9 CPU years Only sequences w/out A-prediction 252 CPU years Functional assignment by PFAM, NR, PSIPred assignments 3 CPU years Domain location prediction by sequence FOLDLIB Store assigned regions in the DB

TeraGrid application Technical aspects: Excellent charter application for the TeraGrid project! Good demonstration of producing practical output from TeraGrid computing: scientific papers and an extensive web site and services will be produced Software pipeline now a proven technique and a sure bet Can be implemented in the fastest possible time; project already initialized

EOL Data Services WWW MySQL DataMart(s) Data warehouse Pipeline data Load/update scripts Data warehouse MySQL DataMart(s) Pipeline data Structure assignment by PSI-BLAST Structure assignment by 123D Domain location prediction Publish Web Services & API Application server SOAP/Web Server UDDI directory Web pages served via JSP EOL Notebook Data incorporated into third party web pages Automated data downloads to mirrors and researchers Encyclopedia of Life WWW

Basic Web Interface MS Internet Explorer Netscape 4.7/6.1 Mozilla v1.0 Opera Microsoft Windows Encyclopedia of Life MS Internet Explorer Netscape 4.7/6.1 Mozilla v1.0 Opera Apple Macintosh Netscape 4.7/6.1 Mozilla v1.0 Opera Linux MS Internet Explorer Netscape 4.7/6.1 Mozilla v1.0 Opera Win-CE and pen-based devices

Local Data Mirrors MySQL DataMart(s) SDSC SOAP Server Mirror Manager MySQL DataMart(s) Structure assignment by PSI-BLAST Structure assignment by 123D Domain location prediction SDSC SOAP Server Request for bulk data streams Data Management Layer MySQL DataMart(s) Structure assignment by PSI-BLAST Structure assignment by 123D Domain location prediction BLAST server SOAP Server Web Interface

Local Data Mirrors Support for server platforms, i.e. Sparc Solaris IRIX Linux Based on MySQL + Apache because of availability Automated mirror registration and listing User-friendly admin for mirror maintenance Means of metering of data usage per species data stream to generate revenue from industry

EOL Notebook EOL DataMart SOAP Server EOL SOAP Queries XML/RDF store Structure assignment by PSI-BLAST Structure assignment by 123D Domain location prediction SOAP Server Encyclopedia of Life EOL SOAP Queries Invoke Virtual community messaging XML/RDF store Metadata sharing BLAST Data Keyword data Scheduler Stored queries BLAST Annotations Keyword queries Session info

EOL Notebook Provides a consistent, advanced, cross-platform GUI to view returned data from queries to the EOL database via Web Services. Provide persistence of both queries and returned data via local XML database Provide mechanism to enable unattended, scheduled, periodic queries Provides means to annotate data and results and share those with others, in effect a scientific Napster Provide means to create virtual community(s)

Summary 1. EOL is a large-scale data analysis project, one of the largest biological computations attempted, whose results will be eagerly awaited by an enormous number of biologists 2. Core scientific analysis techniques well-proven in existing arabidopsis project 3. It’s a perfect choice as a charter application for the TeraGrid Very large scale computation Pipeline-type computations well suited to the Grid platform High visibility and very practical use of TeraGrid results TeraGrid name will become associated with high quality data analysis