1 1 High Throughput Proteomics and the Encyclopedia of Life Mark A. Miller, Ph.D. Integrative BioScience Program San Diego Supercomputer Center
Biology in : how can we harness the data explosion to help us cross scales and disciplines? Organisms Organs Cells Atoms Biopolymers Organelles Cell Biology AnatomyPhysiology Proteomics Medicinal Chemistry Genomics
Long Term Goal: data collected across scales becomes accessible across disciplines via GUIs as translators Database UsersDomain Specific GUI “The GRID” Organisms Organs Cells Atoms Biopolymers Organelles Cell Biology AnatomyPhysiology Proteomics Medicinal Chemistry Genomics
A Grand Challenge Uniting Novel Sequence/Structure Analysis Methods and Grid Computation
ENCYCLOPEDIA of LIFE - a collaborative project based at the San Diego Supercomputer Center (SDSC); open to scientists and biological resources worldwide The EOL project has three goals: Putative functional and 3-D structure assignment through the largest computation ever attempted in biology True API level integration with key biological resources A focus for future collaborative developments via the EOL Notebook
Community works to improve individual protein sequence analysis tools DATABASE 1 genome’s sequences Tool 1Tool 4Tool 3Tool 2 Features: new tools for sequence annotation new tools for structure analysis new tools for structure prediction 1 genome’s sequences Limitations: annotation one genome at a time single user runs single program runs
EOL: Basic Strategy Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline Integration with Existing Local and International Data Resources Present to the International Community through a Vibrant and Creative Interface
How Will EOL Use Grid Resources Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline Integration with Existing Local and International Data Resources Present to the International Community through a Vibrant and Creative Interface High cpu requirements So far embarrassingly parallel Massive amounts of data to move and use in analyses/simulations Portals for individual operators Scientific Napster (collaboration)
Where are we now? Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline High cpu requirements So far embarrassingly parallel
Genome Protein sequences Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Structural assignment of domains by PSI- BLAST on FOLDLIB Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB Create PSI-BLAST profiles for Protein sequences DATABASE Functional assignment by PFAM, NR, PSIPred FOLDLIB NR, PFAM Domain location prediction by sequence structure info Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) SCOP, PDB Only sequences w/out A-prediction Current Genomic Pipeline
Where are we now? Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline High cpu requirements So far embarrassingly parallel Current pipeline rate: about 1 cpu hour/sequence ~800 genomes (and growing) =~10 7 ORF’s (and growing ) Allocated BH cpu hours in an NRAC year: 4 X 10 6 ~10 7 cpu hours* = (> 1000 cpu years and growing ) *for one pass through the pipeline!
Annotation Strategy Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline High cpu requirements So far embarrassingly parallel Show value in the annotation pipeline in a manual 1 genome run
13 13 arabidopsis.sdsc.edu One Plant Genome Processed as a Prototype
Annotation Strategy Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline High cpu requirements So far embarrassingly parallel Show value in the annotation pipeline in a manual 1 genome run Port the pipeline to local and partner resources
Annotation of the puffer fish genome (Takifugu rubripes) was completed recently. The team was led by Larry Ang and Atif Shahab at The BioInformatics Institute (Singapore); using the iGAP pipeline. Data link: More genomes are currently being processed at BII. Announcing the first EOL genome annotated by an international partner:
The human genome sequence requires three billion base pairs to encode all genes Puffer fish Fugu rubripes has only 350 million base pairs (10-fold less) to encode a very similar gene complement to humans, and most of the junk DNA in the human genome is absent. "It's almost like the human genome written in shorthand." WHY Puffer fish?
Annotation Strategy Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline High cpu requirements So far embarrassingly parallel Show value in the annotation pipeline in a manual 1 genome run Port the pipeline to local and partner resources Run the pipeline remotely on distributed local resources APST: Globus,Condor friendly; but also Globus,Condor independent Running on EOL Cluster, Sun Ultra, 4 Sun E10’s; Demo to follow Run the pipeline remotely on partner resources using APST
Annotation Strategy Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline High cpu requirements So far embarrassingly parallel Run the pipeline remotely on partner resources using APST In production: SDSC In principle: BII, Singapore Teragrid, USA PRAGMA, multi-national U. Wisconsin condor flock, USA IPICyT, Mexico In discussion: Belfast E-Science center, Ireland TITEC, Japan UFCG, Brasil
EOL Annotation – Lessons Learned So far, the biggest hump is establishing resource access We contribute to grid development as users by pushing the specifications and interfaces of the tool developers.
How Will EOL Use Grid Resources Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline Integration with Existing Local and International Data Resources High cpu requirements So far embarrassingly parallel Massive amounts of data to move and use in analyses/simulations
21 21 Retrieve Web pages & Invoke SOAP methods MySQL DataMart(s) Structure assignment by PSI-BLAST Structure assignment by 123D Domain location prediction Web/SOAP Server Application Server genomic sequencing data Ensembl! Pipeline data OLAP Ported applications Data Warehouse Some Technical Details Mapped to the Topology Global Grid Partners Extraction TransformationL oading
How Will EOL Use Grid Resources Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline Integration with Existing Local and International Data Resources Present to the International Community through a Vibrant and Creative Interface High cpu requirements So far embarrassingly parallel Massive amounts of data to move and use in analyses/simulations Portals for individual operators Scientific Napster (collaboration)
23 23 Data warehouse Pipeline data OLAP MySQL DataMart(s) Structure assignment by PSI-BLAST Structure assignment by 123D Domain location prediction SOAP/Web Server Application server UDDI directory Publish Web Services & API Automated data downloads to mirrors and researchers WWW Data incorporated into third party web pages Web pages served via JSP EOL Notebook Encyclopedia of Life
24 24 Metadata sharing Virtual community messaging EOL Notebook EOL SOAP Queries EOL DataMart Structure assignment by PSI-BLAST Structure assignment by 123D Domain location prediction SOAP Server XML/RDF store BLAST Data Keyword data Stored queries Annotations Session info Scheduler BLAST Keyword queries Invoke Encyclopedia of Life
25 25 EOL Notebook Provides a consistent, advanced, cross-platform GUI to view returned data from queries to the EOL database via Web Services. Provide persistence of both queries and returned data via local XML database Provide mechanism to enable unattended, scheduled, periodic queries Provides means to annotate data and results and share those with others, in effect a scientific Napster Provide means to create virtual communities
Portal Applications CE (Combinatorial Extension) is a structural similarity search algorithm developed by I.N. Shindyalov. Beta version available via secure HTTP. Access to IBM Blue Horizon (1024 processors). NPACI (National Partnership for Advanced Computing Infrastructure) users get access by quota Anonymous usage available in limited fashion.
PAT Interface:
The goal of EOL is to incorporate the best sequence analysis tools in an automated annotation process, and to web tools to increase impact and serve the results to the community. Features: annotation of all genomes by automated program portfolio all runs stored in federated database federation of local and public databases at API level results served via SOAP server interface facilitates novel queries interface facilitates data management and exchange ALL genome sequences DATABASE SOAP Services EOL creates a high throughput environment and delivers content Tool 4Tool 3Tool 2Tool 1 Annotation tools from the community
What We Want From the Grid Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline Integration with Existing Local and International Data Resources Present to the International Community through a Vibrant and Creative Interface High cpu requirements So far embarrassingly parallel Massive amounts of data to move and use in analyses/simulations Portals for individual operators Scientific Napster (collaboration) Access to distributed resources of many types The ability to store, move and access data in a high performance modality The ability to use the above in an interactive web interface
Acknowledgements SDSC-IBS Philip E. Bourne Ilya N. Shindyalov Greg Quinn Wilfred Li Coleman Mosley Dmitry Pekurovsky Kim Baldridge Jerry Rowley Neil Cotofana Vicente Reyes Robert Byrnes Celine Amoreira Yohan Potier SDSC-GRAIL Henri Casanova Jim Hayes Adam Birnbaum Ceres Inc. Nickolai Alexandrov Richard Flavell BII, Singapore Larry Ang Atif Shahab Kishore Sakharkar EBI Gareth Stockwell CE portal