Current challenges and opportunities in Biogrids Dr. Craig A. Stewart Director, Research and Academic Computing, University Information.

Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information Technology Services Director, Information Technology Core, Indiana Genomics Initiative Visiting Scientist, Höchstleistungsrechenzentrum Universität Stuttgart 6th Metacomputing Symposium 22 May 2003

License terms Please cite as: Stewart, C.A. Current challenges and opportunities in Biogrids. 2003. Presentation. Presented at: 6th Metacomputing Symposium (High Performance Computing Center, Universitaet Stuttgart, Stuttgart, Germany, 22 May 2003). Available from: http://hdl.handle.net/2022/15217 http://hdl.handle.net/2022/15217 Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work. 2

Outline Background about grids and biology Biodata grids Biocomputation grids Some comments and suggestions regarding the challenges and opportunities for the computing community and the biology community NB: –Likely more questions than answers! –“Grids” will be defined loosely, and not necessarily consistently –Similar lack of precision will be employed with the various flavors of “–omics.” Ultimately it’s all computational biology.

Why do subject-specific grids exist? 1 In general: –Practical issues –Communities of practice and trust –Existence of specific problems that appear to call for grid-based approaches (e.g. GriPhyN) In biology: –Rudimentary “grid” projects predate the Web. Example: Flybase via Gopher. [Flybase dates to 1993] –Fractionated communities –Many independent data sources suggest a grid approach 1 These views may be peculiar to the US or to the speaker

The revolution in biology Automated, high-throughput sequencing has revolutionized biology. Computing has been a part of this revolution in three ways so far: –Computing has been essential to the assembly of genomes –There is now so much biological data available that it is impossible to utilize it effectively without aid of computers –Networking and the Web have made biological data generally and publicly available Computing should be in the future critical for: –Automated data analysis –Simulation and prediction

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

Biodata Grids

So how big is big? Genbank has grown exponentially, but the total sequences are now still only ~30B base pairs All of the data and programs from NCBI could be fit on one reasonably large supercomputer Even BIRN, the most ambitious of planned bio data grid projects, has a data set that will grow 10s to 100s of TBs per year ‘large dataset’ in the biological sciences ≠ ‘large dataset’ in the physical sciences Complexity of linkages within the data, however…

How many data sources? DNA/Chromosomes –GenBank. Operated by NCBI (National Center for Biotechnology Information). http://www.ncbi.nlm.nih.gov –European Molecular Biology Laboratory – Nucleotide Sequence Database. http://www.ebi.ac.uk/genomes –DNA Database of Japan (DDBJ). http://www.ddbj.nig.ac.jp Proteins –ExPASy http://www.expasy.org/ –Protein Data Base – PDB http://www.rcsb.org/pdb/ Biochemistry & Enzymes –PathDB http://www.ncgr.org/software/version_2_0.html –Kegg WIT http://wit.mcs.anl.gov/WIT2/ Not to mention the organism-specific databases

The needs and opportunities in Biodata grids Many disparate subcommunities, many funding sources, lots of history NCBI, DDBJ, EMBO contain essentially the same data; they complement/compete in terms of features and functions. Web clicking is not a suitable way to do large-scale computing! Private companies may need to be very private http://www.ncbi.nlm.nih.gov/

Data integration and management Person-intensive downloads Avaki (http://www.avaki.com/) Lion Biosciences (http://www.lionbioscience.com/) IBM – DB2 Information Integrator and DiscoveryLink (www.ibm.com/) Various XML-based efforts

IU Centralized Life Science Database (CSLD) Goal set by IU School of Medicine: Any research within the IU School of Medicine should be able to transparently query all relevant public external data sources and all sources internal to the IU School of Medicine to which the researcher has read privileges Based on use of IBM DiscoveryLink (TM) and DB/2 Information Integrator (TM) Public data is still downloaded, parsed, and put into a database, but now the process is automated and centralized. Lab data and programs like BLAST are included via DL’s wrappers. Implemented in partnership with IBM Life Sciences via IU- IBM strategic relationship in the life sciences IU contributed writing of data parsers

Biocomputation Grids

Orders of magnitude in biology Slide source: Rick Stevens, Argonne National Laboratory; information source DOE Genomes to Life ©

Example large-scale computational biology grid projects Department of Energy “Genomes to Life” http://doegenomestolife.org/ Biomedical Informatics Research Network (BIRN) http://birn.ncrr.nih.gov/birn/ Asia Pacific BioGrid (http://www.apbionet.org/) Encyclopedia of Life (http://eol.sdsc.edu/)

Deduced Protein sequences Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Structural assignment of domains by PSI-BLAST on FOLDLIB Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB Create PSI-BLAST profiles for Protein sequences Store assigned regions in the DB Functional assignment by PFAM, NR, PSIPred assignments FOLDLIB NR, PFAM Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) Domain location prediction by sequence structure info sequence info SCOP, PDB ~800 genomes @ 10k-20k per =~10 7 ORF ’ s 4 CPU years 228 CPU years 3 CPU years 570 CPU years 252 CPU years 3 CPU years 10 4 entries integrated Genomic Annotation Pipeline - iGAP Slide source: San Diego Supercomputing Center ©

One example: Building Phylogenetic Trees Goal: an objective means by which phylogenetic trees can be estimated The number of bifurcating unrooted trees for n taxa is (2n-5)!/ (n-3)! 2n-3 Solution: heuristic search Trees built incrementally. Trees are optimized in steps, and best tree(s) are then kept for next round of additions High communication/compute ratio

fastDNAml performance on an international Grid From iGrid ’98 at SC98

fastDNAml Performance on IBM SP From Stewart et al., SC2001

fastDNAml and Biogrid Computing IU-created library called SMBL (Simple Message Brokering Library) permits use of Condor flocks as “worker” processes fastDNAml has a very high compute/communicate ratio fastDNAml is one example of a general phenomenon in biogrid computation: How much of it is really capability computing, and how much of it would be high-throughput computing if the applications were really well written?

Some thoughts about the future

Current challenge areas ProblemHigh Throughput GridCapability Protein modeling X Genome annotation, alignment, phylogenetics XXx*x* Drug Target Screening XX (corporate grids) X Systems biology XX Medical practice support XX *Only a few large scale problems merit ‘capability’ status

What is the killer application for biocomputation grids? Systems biology – latest buzzword, but…. (see special issues in Nature and Science) Goal: multiscale modeling from cell chemistry up to multiple populations Current software tools still inadequate Multiscale modeling calls for use of established HPC techniques – e.g. adaptive mesh refinement, coupled applications The structure of the problems match the structure of grids Current challenge examples: actin fiber creation, heart attack modeling Opportunity for predictive biology?

Opportunities in Computational Biology and Biomedical Research Bioinformatics and related areas offer tremendous new possibilities Computer-oriented biomedical researchers must utilize the detailed knowledge held by “traditional” researchers There are tremendous opportunities for computer scientists and computational scientists to find and solve interesting and important problems! From www.sciencemag.org/ feature/data/mosquito/mtm/index.html Source Library: Centers for Disease Control Photo Credit: Jim Gathany

Some thoughts about the future of Grids and biocomputing Biodata problems are largely solvable now without use of sophisticated grid technology. This will change! Biocomputation grids must be developed with appropriate technology choices. Enhancement of software must happen simultaneously! Until the grid software becomes substantially simpler for the end user, grid projects will likely continue to be based on communities of common interest. There are many biodata grid and biocomputation grid opportunities that are a good match for grid architectures. There are natural similarities between the structure of grids and the likely structure of significant grand challenge problems in computational biology, biomedicine, etc.

Acknowledgments This research was supported in part by the Indiana Genomics Initiative (INGEN). The Indiana Genomics Initiative (INGEN) of Indiana University is supported in part by Lilly Endowment Inc. This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University. This material is based upon work supported by the National Science Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF). Particular thanks to Dr. Michael Resch, Director, HLRS, for inviting me to visit HLRS, and to Dr. Matthias Mϋller and Peggy Lindner for inviting me to speak here today.

Acknowledgements con’t UITS Research and Academic Computing Division managers: Mary Papakhian, David Hart, Stephen Simms, Richard Repasky, Matt Link, John Samuel, Eric Wernert, Anurag Shankar Indiana Genomics Initiative Staff: Andy Arenson, Chris Garrison, Huian Li, Jagan Lakshmipathy, David Hancock UITS Senior Management: Associate Vice President and Dean Christopher Peebles, RAC(Data) Director Gerry Bernbom Assistance with this presentation: John Herrin, Malinda Lingwall

Additional Information Further information is available at –http://www.indiana.edu/~uits/rac/ –http://www.indiana.edu/~rac/staff_papers.html –http://www.casc.org A recommended German bioinformatics site: –http://www.bioinformatik.de/

Current challenges and opportunities in Biogrids Dr. Craig A. Stewart Director, Research and Academic Computing, University Information.

Similar presentations

Presentation on theme: "Current challenges and opportunities in Biogrids Dr. Craig A. Stewart Director, Research and Academic Computing, University Information."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Current challenges and opportunities in Biogrids Dr. Craig A. Stewart Director, Research and Academic Computing, University Information.

Similar presentations

Presentation on theme: "Current challenges and opportunities in Biogrids Dr. Craig A. Stewart Director, Research and Academic Computing, University Information."— Presentation transcript:

Similar presentations

About project

Feedback