Download presentation
Presentation is loading. Please wait.
Published byMarion Parsons Modified over 6 years ago
1
Overview of the Encyclopedia of Life (EOL) Project
2
Background Biology has become a data driven science
We have the blueprint (genomes) of over 800 organisms This number will increase rapidly to the point in 5-10 years where your blueprint becomes a tool in your medical diagnosis First we must understand the buildings (proteins) that control life’s processes EOL strives to be the 21st century “Britannica” that everyone will turn to
3
EOL Project Description
The Encyclopedia of Life is a joint development of the San Diego Supercomputer Center (SDSC) and scientists and biological resources worldwide EOL involves SDSC staff from HPC, DAKS, Grids and clusters and visualization EOL has three parts: 1. Putative functional and 3-D structure assignment through the largest computation ever attempted 2. True API level integration with key biological resources 3. A focus for future collaborative developments via the EOL Notebook
4
Type of Questions to be Addressed by EOL
If a knockout gene in arabidopsis leads to an average phenotypic response of 10% increased growth, will the same likely happen in rice? Is protein X found in anthrax? Is protein X a drug target, that is, does it exist predominantly in pathogenic bacteria of is it found in eukaryotes also? Has caspase-1, a protein involved in cell death and aging been identified in any plants, if so what species and do the proposed protein structures look similar? Give me all available information on caspase-1
5
EOL Basic Topology Genomic Data Putative Functional and 3D Assignment
Integration with Other Resources Public and Private Databases To Serve Thousands Worldwide
6
TeraGrid Some Technical Detail Mapped to the Topology
Sequence data from genomic sequencing projects Ported applications Load/update scripts MySQL DataMart(s) Pipeline data Data warehouse Structure assignment by PSI-BLAST Structure assignment by 123D Domain location prediction Normalized DB2 schema Application Server Web/SOAP Server Some Technical Detail Mapped to the Topology Retrieve Web pages & Invoke SOAP methods
7
One Plant Genome Processed as a Prototype
One Plant Genome Processed as a Prototype
8
Current Genomic Pipeline
Arabidopsis Protein sequences sequence info structure info NR, PFAM Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) SCOP, PDB Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) Create PSI-BLAST profiles for Protein sequences Structural assignment of domains by PSI-BLAST on FOLDLIB Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB Only sequences w/out A-prediction Functional assignment by PFAM, NR, PSIPred assignments Domain location prediction by sequence FOLDLIB Store assigned regions in the DB
9
Scale of Multi-genome Analysis
~800 10k-20k per =~107 ORF’s Genomes Protein sequences sequence info structure info NR, PFAM Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) SCOP, PDB 4 CPU years Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) 104 entries 228 CPU years Create PSI-BLAST profiles for Protein sequences Structural assignment of domains by PSI-BLAST on FOLDLIB 3 CPU years Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB 9 CPU years Only sequences w/out A-prediction 252 CPU years Functional assignment by PFAM, NR, PSIPred assignments 3 CPU years Domain location prediction by sequence FOLDLIB Store assigned regions in the DB
10
TeraGrid application Technical aspects:
Excellent charter application for the TeraGrid project! Good demonstration of producing practical output from TeraGrid computing: scientific papers and an extensive web site and services will be produced Software pipeline now a proven technique and a sure bet Can be implemented in the fastest possible time; project already initialized
11
EOL Data Services WWW MySQL DataMart(s) Data warehouse Pipeline data
Load/update scripts Data warehouse MySQL DataMart(s) Pipeline data Structure assignment by PSI-BLAST Structure assignment by 123D Domain location prediction Publish Web Services & API Application server SOAP/Web Server UDDI directory Web pages served via JSP EOL Notebook Data incorporated into third party web pages Automated data downloads to mirrors and researchers Encyclopedia of Life WWW
12
Basic Web Interface MS Internet Explorer Netscape 4.7/6.1 Mozilla v1.0
Opera Microsoft Windows Encyclopedia of Life MS Internet Explorer Netscape 4.7/6.1 Mozilla v1.0 Opera Apple Macintosh Netscape 4.7/6.1 Mozilla v1.0 Opera Linux MS Internet Explorer Netscape 4.7/6.1 Mozilla v1.0 Opera Win-CE and pen-based devices
13
Local Data Mirrors MySQL DataMart(s) SDSC SOAP Server
Mirror Manager MySQL DataMart(s) Structure assignment by PSI-BLAST Structure assignment by 123D Domain location prediction SDSC SOAP Server Request for bulk data streams Data Management Layer MySQL DataMart(s) Structure assignment by PSI-BLAST Structure assignment by 123D Domain location prediction BLAST server SOAP Server Web Interface
14
Local Data Mirrors Support for server platforms, i.e. Sparc Solaris
IRIX Linux Based on MySQL + Apache because of availability Automated mirror registration and listing User-friendly admin for mirror maintenance Means of metering of data usage per species data stream to generate revenue from industry
15
EOL Notebook EOL DataMart SOAP Server EOL SOAP Queries XML/RDF store
Structure assignment by PSI-BLAST Structure assignment by 123D Domain location prediction SOAP Server Encyclopedia of Life EOL SOAP Queries Invoke Virtual community messaging XML/RDF store Metadata sharing BLAST Data Keyword data Scheduler Stored queries BLAST Annotations Keyword queries Session info
16
EOL Notebook Provides a consistent, advanced, cross-platform GUI to view returned data from queries to the EOL database via Web Services. Provide persistence of both queries and returned data via local XML database Provide mechanism to enable unattended, scheduled, periodic queries Provides means to annotate data and results and share those with others, in effect a scientific Napster Provide means to create virtual community(s)
17
Summary 1. EOL is a large-scale data analysis project, one of the largest biological computations attempted, whose results will be eagerly awaited by an enormous number of biologists 2. Core scientific analysis techniques well-proven in existing arabidopsis project 3. It’s a perfect choice as a charter application for the TeraGrid Very large scale computation Pipeline-type computations well suited to the Grid platform High visibility and very practical use of TeraGrid results TeraGrid name will become associated with high quality data analysis
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.