Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart & Dr. Eric Wernert 7 August 2003.

Slides:



Advertisements
Similar presentations
Digital Collections: Storage and Access Jon Dunn Assistant Director for Technology IU Digital Library Program
Advertisements

What is Cyberinfrastructure?
Overview of IU activities in supercomputing, grids, and computational biology Dr. Craig A. Stewart Director, Research and Academic Computing,
Bill Barnett, Bob Flynn & Anurag Shankar Pervasive Technology Institute and University Information Technology Services, Indiana University CASC. September.
Data Gateways for Scientific Communities Birds of a Feather (BoF) Tuesday, June 10, 2008 Craig Stewart (Indiana University) Chris Jordan.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Grappling with Data Management Plans Diane Oerly, Division of IT and Office of Research University of Missouri Panel Presentation for.
1 Supplemental line if need be (example: Supported by the National Science Foundation) Delete if not needed. Supporting Polar Research with National Cyberinfrastructure.
Pti.iu.edu /jetstream Award # A national science & engineering cloud funded by the National Science Foundation Award #ACI Prepared for the.
Research & Academic IU Bradley C. Wheeler Associate Vice President & Dean Office of the VP for Information Technology & CIO
INDIANAUNIVERSITYINDIANAUNIVERSITY April 2002 Implementing advanced IT facilities for the Indiana Genomics Initiative Craig A. Stewart
Current challenges and opportunities in Biogrids Dr. Craig A. Stewart Director, Research and Academic Computing, University Information.
Campus Bridging: What is it and why is it important? Barbara Hallock – Senior Systems Analyst, Campus Bridging and Research Infrastructure.
Statewide IT Conference, Bloomington IN (October 7 th, 2014) The National Center for Genome Analysis Support, IU and You! Carrie Ganote (Bioinformatics.
Next Generation Cyberinfrastructures for Next Generation Sequencing and Genome Science AAMC 2013 Information Technology in Academic Medicine Conference.
Research & Academic Computing Bradley C. Wheeler Associate Vice President & Dean.
Information technology, collaboration, and achieving IU ’ s research goals Craig A. Stewart 13 November 2003 Director, Research and Academic.
Craig Stewart 23 July 2009 Cyberinfrastructure in research, education, and workforce development.
© Trustees of Indiana University Released under Creative Commons 3.0 unported license; license terms on last slide. Using the Purdue DB Technology to build.
INDIANAUNIVERSITYINDIANAUNIVERSITY January 2002 INGEN's advanced IT facilities Craig A. Stewart
Goodbye from Indianapolis, IUPUI, and Craig A. Stewart Executive Director, Pervasive Technology Institute Associate Dean, Research Technologies Indiana.
High Performance Computing for University Medical Research: A Successful Implementation Dr. Craig A. Stewart, Ph.D. Director, Research and.
Computational Biology: Practical lessons and thoughts for the future Dr. Craig A. Stewart Visiting Scientist, Höchstleistungsrechenzentrum.
Big Red II & Supporting Infrastructure Craig A. Stewart, Matthew R. Link, David Y Hancock Presented at IUPUI Faculty Council Information Technology Subcommittee.
I-Light: A Network for Collaboration between Indiana University and Purdue University Craig Stewart Associate Vice President Gary Bertoline Associate Vice.
Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.
The National Center for Genome Analysis Support as a Model Virtual Resource for Biologists Internet2 Network Infrastructure for the Life Sciences Focused.
Sage Bionetworks A non-profit organization with a vision to enable networked team approaches to building better models of disease BIOMEDICINE INFORMATION.
Leveraging the National Cyberinfrastructure for Top Down Mass Spectrometry Richard LeDuc.
September 6, 2013 A HUBzero Extension for Automated Tagging Jim Mullen Advanced Biomedical IT Core Indiana University.
© Trustees of Indiana University Released under Creative Commons 3.0 unported license; license terms on last slide. The IQ-Table & Collection Viewer A.
RNA-Seq 2013, Boston MA, 6/20/2013 Optimizing the National Cyberinfrastructure for Lower Bioinformatic Costs: Making the Most of Resources for Publicly.
1 BioGrids in the US: Current status and future opportunities Craig A. Stewart 15 April 2004 Director, Research and Academic Computing Director,
Pti.iu.edu /jetstream Award # funded by the National Science Foundation Award #ACI Jetstream - A self-provisioned, scalable science and.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
July 18, 2012 Campus Bridging Security Challenges from “Panel: Security for Science Gateways and Campus Bridging”
Information Visualization: Ten Years in Review Xia Lin Drexel University.
Pti.iu.edu /jetstream Award # funded by the National Science Foundation Award #ACI Jetstream Overview – XSEDE ’15 Panel - New and emerging.
INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Parallel implementation and performance of fastDNAml - a program for maximum likelihood phylogenetic inference Craig.
Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.
Advanced IT to Support Digital Libraries Research and Academic Computing & Telecommunications Divisions UITS.
Bioinformatics Core Facility Guglielmo Roma January 2011.
Research Computing Archived Presentation Title:Indiana Economic Development From Indiana Economic Development Corporation to Indiana and Purdue.
INDIANAUNIVERSITYINDIANAUNIVERSITY Spring 2000 Indiana University Information Technology University Information Technology Services Please cite as: Stewart,
Sage Bionetworks A non-profit organization with a vision to enable networked team approaches to building better models of disease BIOMEDICINE INFORMATION.
February 27, 2007 University Information Technology Services Research Computing Craig A. Stewart Associate Vice President, Research Computing Chief Operating.
Condor: BLAST Rob Quick Open Science Grid Indiana University.
1 Global Analysis of Arthropod Evolution – a successful grid project Craig A. Stewart, Rainer Keller, Matthias Hess, Uwe Woessner, Martin Aumüller, Matthias.
UITS Research Technologies – Services Available to Regenstrief Institute 13 Oct 2015 Craig Stewart ORCID ID Executive Director, Indiana.
Cyberinfrastructure: An investment worth making Joe Breen University of Utah Center for High Performance Computing.
A national science & engineering cloud funded by the National Science Foundation Award #ACI Craig Stewart ORCID ID Jetstream.
Recent key achievements in research computing at IU Craig Stewart Associate Vice President, Research & Academic Computing Chief Operating Officer, Pervasive.
© Trustees of Indiana University Released under Creative Commons 3.0 unported license; license terms on last slide. Update on EAGER: Best Practices and.
Award # funded by the National Science Foundation Award #ACI Jetstream: A Distributed Cloud Infrastructure for.
A national science & engineering cloud funded by the National Science Foundation Award #ACI Craig Stewart ORCID ID Jetstream.
1 A national science & engineering cloud funded by the National Science Foundation Award #ACI Craig Stewart ORCID ID Jetstream.
Funding Opportunities and Partnerships Dr. Katy Börner Cyberinfrastructure for Network Science Center, Director Information Visualization Laboratory, Director.
Integration of Bioinformatics into Inquiry Based Learning by Kathleen Gabric.
© Trustees of Indiana University Released under Creative Commons 3.0 unported license; license terms on last slide. Informatics Tools at the Indiana CTSI.
Computational Biology: Practical lessons and thoughts for the future Dr. Craig A. Stewart Visiting Scientist, Höchstleistungsrechenzentrum.
High throughput biology data management and data intensive computing drivers George Michaels.
Numerical Methods Multidimensional Gradient Methods in Optimization- Example
Northwest Indiana Computational Grid Preston Smith Rosen Center for Advanced Computing Purdue University - West Lafayette West Lafayette Calumet.
Jetstream Overview Jetstream: A national research and education cloud Jeremy Fischer ORCID Senior Technical Advisor,
1 Campus Bridging: What is it and why is it important? Barbara Hallock – Senior Systems Analyst, Campus Bridging and Research Infrastructure.
Research & Academic Computing Indiana University Statewide IT Conference 11 September 2003 Indianapolis IN.
Matt Link Associate Vice President (Acting) Director, Systems
funded by the National Science Foundation Award #ACI
Research and Academic Computing Division
Presentation transcript:

Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart & Dr. Eric Wernert 7 August 2003

License terms Please cite as: Stewart, C.A. and E. Wernert. Computational Biology: Data, computation, and visualization Presentation. Presented at: Visualization Workshop (Arctic Region Supercomputer Center, University of Alaska Fairbanks, 7 Aug 2003). Available from: Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license ( This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work. 2

Outline A bit about biomedical data Computation and visualization The revolution in biology & IU’s response –the Indiana Genomics Initiative Hardware Some thoughts about dealing with biological and biomedical researchers in general

The revolution in biology Automated, high-throughput sequencing has revolutionized biology. Computing has been a part of this revolution in three ways so far: –Computing has been essential to the assembly of genomes –There is now so much biological data available that it is impossible to utilize it effectively without aid of computers –Networking and the Web have made biological data generally and publicly available genbankstats.html

FASTA format >gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF

Some of the issues about this exponential growth in data stores WO/RN Comparability/replicability problems with certain types of data HIPPA – how do you de-identify patient data?

Indiana Genomics Initiative (INGEN) Created by a $105M grant from the Lilly Endowment, Inc. and launched December, 2000 Build on traditional strengths and add new areas of research for IU Perform the research that will generate new treatments for human disease in the post- genomic era Improve human health generally and in the State of Indiana particularly Enhance economic growth in Indiana

Challenges for UITS and the INGEN IT Core Assist traditional biomedical researchers in adopting use of advanced information technology (massive data storage, visualization, and high performance computing) Assist bioinformatics researchers in use of advanced computing facilities Questions we are asked: –Why wouldn't it be better just to buy me a newer PC? Questions we asked: –What do you do now with computers that you would like to do faster? –What would you do if computer resources were not a constraint?

So, why is this better than just buying me a new PC? Unique facilities provided by IT Core –Redundant data storage –HPC – better uniprocessor performance; trivially parallel programming, parallel programming –Visualization in the research laboratories Hardcopy document – INGEN's advanced IT facilities: The least you need to know Outreach efforts Demonstration projects

Example projects Data integration fastDNAml – maximum likelihood phylogenies ( PiVN - Software to visualize human family trees 3-DIVE (3D Interactive Volume Explorer). Protein Family Annotator – collaborative development with IBM, Inc.

Data Integration Goal set by IU School of Medicine: Any research within the IU School of Medicine should be able to transparently query all relevant public external data sources and all sources internal to the IU School of Medicine to which the researcher has read privileges IU has more than 1 TB of biomedical data stored in massive data storage system There are many public data sources Different labs were independently downloading, subsetting, and formatting data Solution: IBM DiscoveryLink, DB/2 Information Integrator

A life sciences data example - Centralized Life Science Database Based on use of IBM DiscoveryLink (TM) and DB/2 Information Integrator (TM) Public data is still downloaded, parsed, and put into a database, but now the process is automated and centralized. Lab data and programs like BLAST are included via DL’s wrappers. Implemented in partnership with IBM Life Sciences via IU-IBM strategic relationship in the life sciences IU contributed writing of data parsers

Dot Plots Simple way to get a feel for how sequences compare to each other. Used both with DNA and Protein sequences ter.html/ ter.html/ "A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis" Erik L.L. Sonnhammer and Richard Durbin Gene 167(2):GC1-10 (1995)

Protein Family Annotator New project Designed to allow federation and searching of protein family data ‘Visualizing’ the effect of variation in proteins a real challenge for the biologists

Phylogenetic Inference Determine likely evolutionary relationships among different taxa NP hard Very large search space Heuristic search required Problems: –searches that are clearly going nowhere –Comparison of different trees

PViN

Gamma Knife Used to treat inoperable tumors Treatment methods currently use a standardized head model UITS is working with IU School of Medicine to adapt Penelope code to work with detailed model of an individual patient’s head

Tomography Key issue is processing of images Weeks to days Days to minutes Visualization techniques applied do not need to be fancy to be useful Starting with some simple visualizations and then moving to some very sophisticated visualizations would be tremendous

Some information about the Indiana University high performance computing environment

Networking: I-light Network jointly owned by Indiana University and Purdue University 36 fibers between Bloomington and Indianapolis (IU’s main campuses) 24 fibers between Indianapolis and West Lafayette (Purdue’s main campus) Co-location with Abilene GigaPOP Expansion to other universities recently funded

Massive Data Storage System Based on HPSS (High Performance Software System) First HPSS installation with distributed movers; STK 9310 Silos in Bloomington and Indianapolis Automatic replication of data between Indianapolis and Bloomington, via I-light, overnight. Critical for biomedical data, which is often irreplaceable. 180 TB capacity with existing tapes; total capacity of 480 TB. 100 TB currently in use; 1 TB for biomedical data. Common File System (CFS) – disk storage ‘for the masses’ Photo: Tyagan Miller. May be reused by IU for noncommercial purposes. To license for commercial use, contact the photographer

AVIDD (Analysis and Visualization of Instrument-Driven Data) Analysis and Visualization of Instrument-Driven Data Hardware components: –Distributed Linux cluster Three locations: IU Northwest, Indiana University Purdue University Indianapolis, IU Bloomington TFLOPS, 0.5 TB RAM, 10 TB Disk Tuned, configured, and optimized for handling real-time data streams –A suite of distributed visualization environments –Massive data storage Usage components: –Research by application scientists –Research by computer scientists –Education

Goals for AVIDD Create a massive, distributed facility ideally suited to managing the complete data/experimental lifecycle (acquisition to insight to archiving) Focused on modern instruments that produce data in digital format at high rates. Example instruments: –Advanced Photon Source, Advanced Light Source –Atmospheric science instruments in forest –Gene sequencers, expression chip readers

Goals for AVIDD, Con’t Performance goals: –Two researchers should be able simultaneously to analyze 1 TB data sets (along with other smaller jobs running) –The system should be able to give (nearly) immediate attention to real- time computing tasks, while still running at high rates of overall utilization –It should be possible to move 1 TB of data from HPSS disk cache into the cluster in ~2 hours Science goals: –The distribution of 3D visualization environments in scientists’ labs should enhance the ability of scientists to spontaneously interact with their data. –Ability to manage large data sets should no longer be an obstacle to scientific research –AVIDD should be an effective research platform for cluster engineering R&D as well as computer science research

John-E-Box Invented by John N. Huffman, John C. Huffman, and Eric Wernert

Thoughts about visualization and collaboration in bioinformatics Do you want a one-off, or sustained improvement in productivity of scientists? Collaboration tools can be highly sophisticated, or pretty darn ugly Sometimes they must be sophisticated The key for collaborative technology is that the collaboration has to solve a problem (other than ‘what are we going to do in the booth this year’) and has to feel natural to the application scientist Many problems are as much about the theory and practice of interacting with the information Placing facilities in the lab is tremendously beneficial We should encourage researchers not to be too cost-sensitive Grand challenge problems are great, but there have to be facilities that facilitate a learning curve and increases in sophistication over time for the application scientist. This creates a feeder system for the high end systems!

HPC Challenge “Arthropods evolving all over the world” (sort of) computational steering Big problem: how do you summarize the views of LOTS of different trees?

What are some really important challenges in visualization today? Expression chip data Trees Multi-scale problems

Thoughts about working with biologists

Bioinformatics and Biomedical Research Bioinformatics, Genomics, Proteomics, ____ics will radically change understanding of biological function and the way biomedical research is done. Traditional biomedical researchers must take advantage of new possibilities Computer-oriented researchers must take advantage of the knowledge held by traditional biomedical researchers So why do you want to interrupt the work of my paper mill? Anopheles gambiae From data/mosquito/mtm/index.html Source Library: Centers for Disease Control Photo Credit: Jim Gathany

So how do you find biologists with whom to collaborate? Chicken and egg problem? Or more like fishing? Or bank robbery?

Bank robbery Willie Sutton, a famous American bank robber, was asked why he robbed banks, and reportedly said “because that's where the money is.”* Cultivating collaborations with biologists in the short run will require: –Active outreach –Different expectations than we might have when working with an aerospace design firm –Patience There are lots of opportunities open for HPC centers willing to take the effort to cultivate relationships with biologists and biomedical researchers. To do this, we’ll all have to spend a bit of time “going where the biologists are.” *Unfortunately this is an urban legend; Sutton never said this

Acknowledgments This research was supported in part by the Indiana Genomics Initiative (INGEN). The Indiana Genomics Initiative (INGEN) of Indiana University is supported in part by Lilly Endowment Inc. This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University. This material is based upon work supported by the National Science Foundation under Grant No and Grant No. CDA Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

Acknowledgements con’t UITS Research and Academic Computing Division managers: Mary Papakhian, David Hart, Stephen Simms, Richard Repasky, Matt Link, John Samuel, Eric Wernert, Anurag Shankar Indiana Genomics Initiative Staff: Andy Arenson, Chris Garrison, Huian Li, Jagan Lakshmipathy, David Hancock Assistance with this presentation: John Herrin, Malinda Lingwall Thanks to Dr. M. Resch, Director, HLRS, for inviting me to visit HLRS Thanks to Dr. H. Bungartz for his hospitality, help, and for including Einführung in die Bioinformatik as an elective Thanks to Dr. S. Zimmer for help throughout the semester

Further information is available at –ingen.iu.edu – – –