Download presentation
Presentation is loading. Please wait.
Published byCori Turner Modified over 9 years ago
1
Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003
2
License terms Please cite as: Stewart, C.A. and E. Wernert. Computational Biology: Data, computation, and visualization. 2003. Presentation. Presented at: Visualization Workshop (Arctic Region Supercomputer Center, University of Alaska Fairbanks, 7 Aug 2003). Available from: http://hdl.handle.net/2022/15219 http://hdl.handle.net/2022/15219 Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work. 2
3
Outline A bit about biomedical data Computation and visualization The revolution in biology & IU’s response –the Indiana Genomics Initiative Hardware Some thoughts about dealing with biological and biomedical researchers in general
4
The revolution in biology Automated, high-throughput sequencing has revolutionized biology. Computing has been a part of this revolution in three ways so far: –Computing has been essential to the assembly of genomes –There is now so much biological data available that it is impossible to utilize it effectively without aid of computers –Networking and the Web have made biological data generally and publicly available http://www.ncbi.nlm.nih.gov/Genbank/ genbankstats.html
6
FASTA format >gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
7
Some of the issues about this exponential growth in data stores WO/RN Comparability/replicability problems with certain types of data HIPPA – how do you de-identify patient data?
8
Indiana Genomics Initiative (INGEN) Created by a $105M grant from the Lilly Endowment, Inc. and launched December, 2000 Build on traditional strengths and add new areas of research for IU Perform the research that will generate new treatments for human disease in the post- genomic era Improve human health generally and in the State of Indiana particularly Enhance economic growth in Indiana
9
Challenges for UITS and the INGEN IT Core Assist traditional biomedical researchers in adopting use of advanced information technology (massive data storage, visualization, and high performance computing) Assist bioinformatics researchers in use of advanced computing facilities Questions we are asked: –Why wouldn't it be better just to buy me a newer PC? Questions we asked: –What do you do now with computers that you would like to do faster? –What would you do if computer resources were not a constraint?
10
So, why is this better than just buying me a new PC? Unique facilities provided by IT Core –Redundant data storage –HPC – better uniprocessor performance; trivially parallel programming, parallel programming –Visualization in the research laboratories Hardcopy document – INGEN's advanced IT facilities: The least you need to know Outreach efforts Demonstration projects
11
Example projects Data integration fastDNAml – maximum likelihood phylogenies (http://www.indiana.edu/~rac/hpc/fastDNAml/index.html) PiVN - Software to visualize human family trees 3-DIVE (3D Interactive Volume Explorer). http://www.avl.iu.edu/projects/3DIVE/ Protein Family Annotator – collaborative development with IBM, Inc.
12
Data Integration Goal set by IU School of Medicine: Any research within the IU School of Medicine should be able to transparently query all relevant public external data sources and all sources internal to the IU School of Medicine to which the researcher has read privileges IU has more than 1 TB of biomedical data stored in massive data storage system There are many public data sources Different labs were independently downloading, subsetting, and formatting data Solution: IBM DiscoveryLink, DB/2 Information Integrator
13
A life sciences data example - Centralized Life Science Database Based on use of IBM DiscoveryLink (TM) and DB/2 Information Integrator (TM) Public data is still downloaded, parsed, and put into a database, but now the process is automated and centralized. Lab data and programs like BLAST are included via DL’s wrappers. Implemented in partnership with IBM Life Sciences via IU-IBM strategic relationship in the life sciences IU contributed writing of data parsers
16
Dot Plots Simple way to get a feel for how sequences compare to each other. Used both with DNA and Protein sequences http://www.cgr.ki.se/cgr/groups/sonnhammer/Dot ter.html/http://www.cgr.ki.se/cgr/groups/sonnhammer/Dot ter.html/ "A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis" Erik L.L. Sonnhammer and Richard Durbin Gene 167(2):GC1-10 (1995)
17
http://www.dkfz-heidelberg.de/tbi/bioinfo/Pairwise/DotPlots/index.html
18
Protein Family Annotator New project Designed to allow federation and searching of protein family data ‘Visualizing’ the effect of variation in proteins a real challenge for the biologists
19
Phylogenetic Inference Determine likely evolutionary relationships among different taxa NP hard Very large search space Heuristic search required Problems: –searches that are clearly going nowhere –Comparison of different trees
21
PViN
22
Gamma Knife Used to treat inoperable tumors Treatment methods currently use a standardized head model UITS is working with IU School of Medicine to adapt Penelope code to work with detailed model of an individual patient’s head
23
Tomography Key issue is processing of images Weeks to days Days to minutes Visualization techniques applied do not need to be fancy to be useful Starting with some simple visualizations and then moving to some very sophisticated visualizations would be tremendous
24
Some information about the Indiana University high performance computing environment
25
Networking: I-light Network jointly owned by Indiana University and Purdue University 36 fibers between Bloomington and Indianapolis (IU’s main campuses) 24 fibers between Indianapolis and West Lafayette (Purdue’s main campus) Co-location with Abilene GigaPOP Expansion to other universities recently funded
27
Massive Data Storage System Based on HPSS (High Performance Software System) First HPSS installation with distributed movers; STK 9310 Silos in Bloomington and Indianapolis Automatic replication of data between Indianapolis and Bloomington, via I-light, overnight. Critical for biomedical data, which is often irreplaceable. 180 TB capacity with existing tapes; total capacity of 480 TB. 100 TB currently in use; 1 TB for biomedical data. Common File System (CFS) – disk storage ‘for the masses’ Photo: Tyagan Miller. May be reused by IU for noncommercial purposes. To license for commercial use, contact the photographer
28
AVIDD (Analysis and Visualization of Instrument-Driven Data) Analysis and Visualization of Instrument-Driven Data Hardware components: –Distributed Linux cluster Three locations: IU Northwest, Indiana University Purdue University Indianapolis, IU Bloomington 2.164 TFLOPS, 0.5 TB RAM, 10 TB Disk Tuned, configured, and optimized for handling real-time data streams –A suite of distributed visualization environments –Massive data storage Usage components: –Research by application scientists –Research by computer scientists –Education
29
Goals for AVIDD Create a massive, distributed facility ideally suited to managing the complete data/experimental lifecycle (acquisition to insight to archiving) Focused on modern instruments that produce data in digital format at high rates. Example instruments: –Advanced Photon Source, Advanced Light Source –Atmospheric science instruments in forest –Gene sequencers, expression chip readers
30
Goals for AVIDD, Con’t Performance goals: –Two researchers should be able simultaneously to analyze 1 TB data sets (along with other smaller jobs running) –The system should be able to give (nearly) immediate attention to real- time computing tasks, while still running at high rates of overall utilization –It should be possible to move 1 TB of data from HPSS disk cache into the cluster in ~2 hours Science goals: –The distribution of 3D visualization environments in scientists’ labs should enhance the ability of scientists to spontaneously interact with their data. –Ability to manage large data sets should no longer be an obstacle to scientific research –AVIDD should be an effective research platform for cluster engineering R&D as well as computer science research
31
John-E-Box Invented by John N. Huffman, John C. Huffman, and Eric Wernert
32
Thoughts about visualization and collaboration in bioinformatics Do you want a one-off, or sustained improvement in productivity of scientists? Collaboration tools can be highly sophisticated, or pretty darn ugly Sometimes they must be sophisticated The key for collaborative technology is that the collaboration has to solve a problem (other than ‘what are we going to do in the booth this year’) and has to feel natural to the application scientist Many problems are as much about the theory and practice of interacting with the information Placing facilities in the lab is tremendously beneficial We should encourage researchers not to be too cost-sensitive Grand challenge problems are great, but there have to be facilities that facilitate a learning curve and increases in sophistication over time for the application scientist. This creates a feeder system for the high end systems!
33
HPC Challenge “Arthropods evolving all over the world” (sort of) computational steering Big problem: how do you summarize the views of LOTS of different trees?
34
What are some really important challenges in visualization today? Expression chip data Trees Multi-scale problems
35
Thoughts about working with biologists
36
Bioinformatics and Biomedical Research Bioinformatics, Genomics, Proteomics, ____ics will radically change understanding of biological function and the way biomedical research is done. Traditional biomedical researchers must take advantage of new possibilities Computer-oriented researchers must take advantage of the knowledge held by traditional biomedical researchers So why do you want to interrupt the work of my paper mill? Anopheles gambiae From www.sciencemag.org/feature/ data/mosquito/mtm/index.html Source Library: Centers for Disease Control Photo Credit: Jim Gathany
37
So how do you find biologists with whom to collaborate? Chicken and egg problem? Or more like fishing? Or bank robbery?
38
Bank robbery Willie Sutton, a famous American bank robber, was asked why he robbed banks, and reportedly said “because that's where the money is.”* Cultivating collaborations with biologists in the short run will require: –Active outreach –Different expectations than we might have when working with an aerospace design firm –Patience There are lots of opportunities open for HPC centers willing to take the effort to cultivate relationships with biologists and biomedical researchers. To do this, we’ll all have to spend a bit of time “going where the biologists are.” *Unfortunately this is an urban legend; Sutton never said this
39
Acknowledgments This research was supported in part by the Indiana Genomics Initiative (INGEN). The Indiana Genomics Initiative (INGEN) of Indiana University is supported in part by Lilly Endowment Inc. This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University. This material is based upon work supported by the National Science Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).
40
Acknowledgements con’t UITS Research and Academic Computing Division managers: Mary Papakhian, David Hart, Stephen Simms, Richard Repasky, Matt Link, John Samuel, Eric Wernert, Anurag Shankar Indiana Genomics Initiative Staff: Andy Arenson, Chris Garrison, Huian Li, Jagan Lakshmipathy, David Hancock Assistance with this presentation: John Herrin, Malinda Lingwall Thanks to Dr. M. Resch, Director, HLRS, for inviting me to visit HLRS Thanks to Dr. H. Bungartz for his hospitality, help, and for including Einführung in die Bioinformatik as an elective Thanks to Dr. S. Zimmer for help throughout the semester
41
Further information is available at –ingen.iu.edu –http://www.indiana.edu/~uits/rac/ –http://www.ncsc.org/casc/paper.html –http://www.indiana.edu/~rac/staff_papers.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.