1 1 High Throughput Proteomics and the Encyclopedia of Life Mark A. Miller, Ph.D. Integrative BioScience Program San Diego Supercomputer Center.

Slides:

Advertisements

Similar presentations

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

Advertisements

1 Use Cases Application provisioning (version control) Workload management/load-balancing (server consolidation) Data Federation/sharing E-utilities (provisioning.

E-Science Collaboration between the UK and China Paul Townend ( University of Leeds.

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.

Building an Operational Enterprise Architecture and Service Oriented Architecture Best Practices Presented by: Ajay Budhraja Copyright 2006 Ajay Budhraja,

ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing P. Balaji, Argonne National Laboratory W. Feng and J. Archuleta, Virginia Tech.

Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.

Data-intensive Computing: Case Study Area 1: Bioinformatics B. Ramamurthy 6/17/20151.

1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.

Creating Smart Clients with the Collaboration Notebook Greg Quinn Principal Investigator Desktop and Mobile Data Management San Diego Supercomputer Center.

ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.

Live Meeting APIs Robert Devine Program Manager Microsoft Corporation.

Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.

UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.

PAT project Advanced bioinformatics tools for analyzing the Arabidopsis genome Proteins of Arabidopsis thaliana (PAT) & Gene Ontology (GO) Hongyu Zhang,

9/30/2004TCSS588A Isabelle Bichindaritz1 Introduction to Bioinformatics.

Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.

Beyond the Human Genome Project Future goals and projects based on findings from the HGP.

Enabling Cloud and Grid Powered Image Phenotyping Nirav Merchant iPlant Collaborative

UDDI ebXML(?) and such Essential Web Services Directory and Discovery.

Web Engineering Web engineering is the process used to create high quality WebApps. Web engineering is not a perfect clone of software engineering. But.

The Encyclopedia of Life (EOL) Project An initiative to analyze and provide annotation for putative protein sequences from all publicly available genome.

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE Molecular Science in NPACI Russ B. Altman NPACI Molecular Science Thrust Stanford Medical.

BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.

Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,

IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.

OEI’s Services Portfolio December 13, 2007 Draft / Working Concepts.

(The Encyclopedia of Life (EOL)) medicine researcheducation The Annotation and Cataloging of Proteins, Life's Building Blocks for… The Open Notebook.

Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore

The Future of the iPlant Cyberinfrastructure: Coming Attractions.

Taverna Workflows for Systems Biology Katy Wolstencroft School of Computer Science University of Manchester.

Kurt Mueller San Diego Supercomputer Center NPACI HotPage Updates.

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.

Usability Talk, 26 th January 2006 Development of Usable Grid Services for the Biomedical Community Prof Richard Sinnott Technical Director National e-Science.

Harbin Institute of Technology Computer Science and Bioinformatics Wang Yadong Second US-China Computer Science Leadership Summit.

Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.

Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.

IODE Ocean Data Portal - ODP  The objective of the IODE Ocean Data Portal (ODP) is to facilitate and promote the exchange and dissemination of marine.

Building the e-Minerals Minigrid Rik Tyer, Lisa Blanshard, Kerstin Kleese (Data Management Group) Rob Allan, Andrew Richards (Grid Technology Group)

Presented by Jens Schwidder Tara D. Gibson James D. Myers Computing & Computational Sciences Directorate Oak Ridge National Laboratory Scientific Annotation.

Running BLAST on the cluster system over the Pacific Rim.

Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.

Construction of Shanghai Life Science & Bio-technology Service Platform for Data Access and Sharing International Workshop on Strategies Presentation of.

XML-Based Grid Data System for Bioinformatics Development Noppadon Khiripet, Ph.D Wasinee Rungsarityotin, MS Chularat Tanprasert, Ph.D Royol Chitradon.

1 NSF/TeraGrid Science Advisory Board Meeting July 19-20, San Diego, CA Brief TeraGrid Overview and Expectations of Science Advisory Board John Towns TeraGrid.

EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.

Providing web services to mobile users: The architecture design of an m-service portal Minder Chen - Dongsong Zhang - Lina Zhou Presented by: Juan M. Cubillos.

GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.

Università di Perugia Enabling Grids for E-sciencE Status of and requirements for Computational Chemistry NA4 – SA1 Meeting – 6 th April.

Mapping of Scientific Workflow within the e-Protein project to Distributed Resources London e-Science Centre Department of Computing, Imperial College.

High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Bioinformatics activity Christophe BLANCHET.

SAN DIEGO SUPERCOMPUTER CENTER, UCSD NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE Introduction to SDSC Fran Berman Director, SDSC and.

Zach Miller Computer Sciences Department University of Wisconsin-Madison Supporting the Computation Needs.

VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.

Indiana University School of Indiana University ECCR Summary Infrastructure: Cheminformatics web service infrastructure made available as a community resource.

Developing our Metadata: Technical Considerations & Approach Ray Plante NIST 4/14/16 NMI Registry Workshop BIPM, Paris 1 …don’t worry ;-) or How we concentrate.

Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre

Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,

Functional and structural genomics using PEDANT

Tools and Services Workshop

Joslynn Lee – Data Science Educator

Data-intensive Computing: Case Study Area 1: Bioinformatics

iGAP: Integrative Grid-enabled Genome Annotation Pipeline

Overview of the Encyclopedia of Life (EOL) Project

Encyclopedia of Life as a Target VGrADS Application

Mangaldai College, Mangaldai

Large Scale Distributed Computing

The ViroLab Virtual Laboratory for Viral Diseases

Presentation transcript:

1 1 High Throughput Proteomics and the Encyclopedia of Life Mark A. Miller, Ph.D. Integrative BioScience Program San Diego Supercomputer Center

Biology in : how can we harness the data explosion to help us cross scales and disciplines? Organisms Organs Cells Atoms Biopolymers Organelles Cell Biology AnatomyPhysiology Proteomics Medicinal Chemistry Genomics

Long Term Goal: data collected across scales becomes accessible across disciplines via GUIs as translators Database UsersDomain Specific GUI “The GRID” Organisms Organs Cells Atoms Biopolymers Organelles Cell Biology AnatomyPhysiology Proteomics Medicinal Chemistry Genomics

A Grand Challenge Uniting Novel Sequence/Structure Analysis Methods and Grid Computation

ENCYCLOPEDIA of LIFE - a collaborative project based at the San Diego Supercomputer Center (SDSC); open to scientists and biological resources worldwide The EOL project has three goals: Putative functional and 3-D structure assignment through the largest computation ever attempted in biology True API level integration with key biological resources A focus for future collaborative developments via the EOL Notebook

Community works to improve individual protein sequence analysis tools DATABASE 1 genome’s sequences Tool 1Tool 4Tool 3Tool 2 Features: new tools for sequence annotation new tools for structure analysis new tools for structure prediction 1 genome’s sequences Limitations: annotation one genome at a time single user runs single program runs

EOL: Basic Strategy Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline Integration with Existing Local and International Data Resources Present to the International Community through a Vibrant and Creative Interface

How Will EOL Use Grid Resources Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline Integration with Existing Local and International Data Resources Present to the International Community through a Vibrant and Creative Interface High cpu requirements So far embarrassingly parallel Massive amounts of data to move and use in analyses/simulations Portals for individual operators Scientific Napster (collaboration)

Where are we now? Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline High cpu requirements So far embarrassingly parallel

Genome Protein sequences Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Structural assignment of domains by PSI- BLAST on FOLDLIB Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB Create PSI-BLAST profiles for Protein sequences DATABASE Functional assignment by PFAM, NR, PSIPred FOLDLIB NR, PFAM Domain location prediction by sequence structure info Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) SCOP, PDB Only sequences w/out A-prediction Current Genomic Pipeline

Where are we now? Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline High cpu requirements So far embarrassingly parallel Current pipeline rate: about 1 cpu hour/sequence ~800 genomes (and growing) =~10 7 ORF’s (and growing ) Allocated BH cpu hours in an NRAC year: 4 X 10 6 ~10 7 cpu hours* = (> 1000 cpu years and growing ) *for one pass through the pipeline!

Annotation Strategy Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline High cpu requirements So far embarrassingly parallel Show value in the annotation pipeline in a manual 1 genome run

13 13 arabidopsis.sdsc.edu One Plant Genome Processed as a Prototype

Annotation Strategy Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline High cpu requirements So far embarrassingly parallel Show value in the annotation pipeline in a manual 1 genome run Port the pipeline to local and partner resources

Annotation of the puffer fish genome (Takifugu rubripes) was completed recently. The team was led by Larry Ang and Atif Shahab at The BioInformatics Institute (Singapore); using the iGAP pipeline. Data link: More genomes are currently being processed at BII. Announcing the first EOL genome annotated by an international partner:

The human genome sequence requires three billion base pairs to encode all genes Puffer fish Fugu rubripes has only 350 million base pairs (10-fold less) to encode a very similar gene complement to humans, and most of the junk DNA in the human genome is absent. "It's almost like the human genome written in shorthand." WHY Puffer fish?

Annotation Strategy Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline High cpu requirements So far embarrassingly parallel Show value in the annotation pipeline in a manual 1 genome run Port the pipeline to local and partner resources Run the pipeline remotely on distributed local resources APST: Globus,Condor friendly; but also Globus,Condor independent Running on EOL Cluster, Sun Ultra, 4 Sun E10’s; Demo to follow Run the pipeline remotely on partner resources using APST

Annotation Strategy Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline High cpu requirements So far embarrassingly parallel Run the pipeline remotely on partner resources using APST In production: SDSC In principle: BII, Singapore Teragrid, USA PRAGMA, multi-national U. Wisconsin condor flock, USA IPICyT, Mexico In discussion: Belfast E-Science center, Ireland TITEC, Japan UFCG, Brasil

EOL Annotation – Lessons Learned So far, the biggest hump is establishing resource access We contribute to grid development as users by pushing the specifications and interfaces of the tool developers.

How Will EOL Use Grid Resources Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline Integration with Existing Local and International Data Resources High cpu requirements So far embarrassingly parallel Massive amounts of data to move and use in analyses/simulations

21 21 Retrieve Web pages & Invoke SOAP methods MySQL DataMart(s) Structure assignment by PSI-BLAST Structure assignment by 123D Domain location prediction Web/SOAP Server Application Server genomic sequencing data Ensembl! Pipeline data OLAP Ported applications Data Warehouse Some Technical Details Mapped to the Topology Global Grid Partners Extraction TransformationL oading

How Will EOL Use Grid Resources Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline Integration with Existing Local and International Data Resources Present to the International Community through a Vibrant and Creative Interface High cpu requirements So far embarrassingly parallel Massive amounts of data to move and use in analyses/simulations Portals for individual operators Scientific Napster (collaboration)

23 23 Data warehouse Pipeline data OLAP MySQL DataMart(s) Structure assignment by PSI-BLAST Structure assignment by 123D Domain location prediction SOAP/Web Server Application server UDDI directory Publish Web Services & API Automated data downloads to mirrors and researchers WWW Data incorporated into third party web pages Web pages served via JSP EOL Notebook Encyclopedia of Life

24 24 Metadata sharing Virtual community messaging EOL Notebook EOL SOAP Queries EOL DataMart Structure assignment by PSI-BLAST Structure assignment by 123D Domain location prediction SOAP Server XML/RDF store BLAST Data Keyword data Stored queries Annotations Session info Scheduler BLAST Keyword queries Invoke Encyclopedia of Life

25 25 EOL Notebook Provides a consistent, advanced, cross-platform GUI to view returned data from queries to the EOL database via Web Services. Provide persistence of both queries and returned data via local XML database Provide mechanism to enable unattended, scheduled, periodic queries Provides means to annotate data and results and share those with others, in effect a scientific Napster Provide means to create virtual communities

Portal Applications CE (Combinatorial Extension) is a structural similarity search algorithm developed by I.N. Shindyalov. Beta version available via secure HTTP. Access to IBM Blue Horizon (1024 processors). NPACI (National Partnership for Advanced Computing Infrastructure) users get access by quota Anonymous usage available in limited fashion.

PAT Interface:

The goal of EOL is to incorporate the best sequence analysis tools in an automated annotation process, and to web tools to increase impact and serve the results to the community. Features: annotation of all genomes by automated program portfolio all runs stored in federated database federation of local and public databases at API level results served via SOAP server interface facilitates novel queries interface facilitates data management and exchange ALL genome sequences DATABASE SOAP Services EOL creates a high throughput environment and delivers content Tool 4Tool 3Tool 2Tool 1 Annotation tools from the community

What We Want From the Grid Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline Integration with Existing Local and International Data Resources Present to the International Community through a Vibrant and Creative Interface High cpu requirements So far embarrassingly parallel Massive amounts of data to move and use in analyses/simulations Portals for individual operators Scientific Napster (collaboration) Access to distributed resources of many types The ability to store, move and access data in a high performance modality The ability to use the above in an interactive web interface

Acknowledgements SDSC-IBS Philip E. Bourne Ilya N. Shindyalov Greg Quinn Wilfred Li Coleman Mosley Dmitry Pekurovsky Kim Baldridge Jerry Rowley Neil Cotofana Vicente Reyes Robert Byrnes Celine Amoreira Yohan Potier SDSC-GRAIL Henri Casanova Jim Hayes Adam Birnbaum Ceres Inc. Nickolai Alexandrov Richard Flavell BII, Singapore Larry Ang Atif Shahab Kishore Sakharkar EBI Gareth Stockwell CE portal