DOE Genomics: GTL Program IT Infrastructure Needs for Systems Biology David G. Thomassen Office of Biological and Environmental Research DOE Office of.

Slides:

Advertisements

Similar presentations

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Center for Computational Sciences Cray X1 and Black Widow at ORNL Center for Computational.

Advertisements

Biological pathway and systems analysis An introduction.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏

Contents of this Talk [Used as intro to Genome Databases Seminar, 2002] Overview of bioinformatics Motivations for genome databases Analogy of virus reverse-eng.

DNA in the chromosomes of the genome contains all the information to develop an organism and operate all its cell types.DNA in the chromosomes of the genome.

Models and methods in systems biology Daniel Kluesing Algorithms in Biology Spring 2009.

Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.

Gene expression analysis summary Where are we now?

Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.

Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.

Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.

New Approaches for High-Throughput Identification and Characterization of Protein Complexes Michelle V. Buchanan Oak Ridge National Laboratory NIH Workshop.

Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.

Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.

GTL Facilities Characterization and Imaging of Molecular Machines Lee Makowski.

ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.

Understanding Drosophila Development at the Molecular Level Gene Myers EECS, UCal, Berkeley.

341: Introduction to Bioinformatics Dr. Natasa Przulj Deaprtment of Computing Imperial College London

GTL User Facilities Facility II: Whole Proteome Analysis Michelle V. Buchanan.

Agent Based Modeling and Simulation

Ch10. Intermolecular Interactions and Biological Pathways

CceHUB A Knowledge Discovery Environment for Cancer Care Engineering Research Ann Christine Catlin HUBzero Workshop November 7, 2008.

Knowledgebase Creation & Systems Biology: A new prospect in discovery informatics S.Shriram, Siri Technologies (Cytogenomics), Bangalore S.Shriram, Siri.

Shankar Subramaniam University of California at San Diego Data to Biology.

Problem Statement and Motivation Key Achievements and Future Goals Technical Approach Investigators: Yang Dai Prime Grant Support: NSF High-throughput.

Science & Technology Centers Program Center for Science of Information Bryn Mawr Howard MIT Princeton Purdue Stanford Texas A&M UC Berkeley UC San Diego.

Beyond the Human Genome Project Future goals and projects based on findings from the HGP.

GTL Facilities Computing Infrastructure for 21 st Century Systems Biology Ed Uberbacher ORNL & Mike Colvin LLNL.

DOE Resources & Facilities for Biological Discovery : Realizing the Potential Presentation to the BERAC 25 April 2002.

Database System Concepts and Architecture

material assembled from the web pages at

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE Molecular Science in NPACI Russ B. Altman NPACI Molecular Science Thrust Stanford Medical.

IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.

Chapter 4 Realtime Widely Distributed Instrumention System.

GTL User Facilities Facility IV: Analysis and Modeling of Cellular Systems Jim K. Fredrickson.

Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.

Integrating the Bioinformatic Technology Group into your research programme Introduction People and Skills Examples Integrating the BTG Contacts BHRC Away.

ASCAC-BERAC Joint Panel on Accelerating Progress Toward GTL Goals Some concerns that were expressed by ASCAC members.

Anil Wipat University of Newcastle upon Tyne, UK A Grid based System for Microbial Genome Comparison and analysis.

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

NextGen Pipeline: Enabling the Plant Science Community Tom Brutnell (lead), Steve Rounsley (co-lead), Matt Vaughn (Engagement Lead) Ed Buckler, Justin.

Major Disciplines in Computer Science Ken Nguyen Department of Information Technology Clayton State University.

Futures Lab: Biology Greenhouse gasses. Carbon-neutral fuels. Cleaning Waste Sites. All of these problems have possible solutions originating in the biology.

Genomes To Life Biology for 21 st Century A Joint Initiative of the Office of Advanced Scientific Computing Research and Office of Biological and Environmental.

Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.

Data Integration and Management A PDB Perspective.

XML Standards for Proteomics Data Andrew Jones, Dr Jonathan Wastling and Dr Ela Hunt Department of Computing Science and the Institute of Biomedical and.

 Our mission Deploying and unifying the NMR e-Infrastructure in System Biology is to make bio-NMR available to the scientific community in.

Presented by Jens Schwidder Tara D. Gibson James D. Myers Computing & Computational Sciences Directorate Oak Ridge National Laboratory Scientific Annotation.

COMPUTERS IN BIOLOGY Elizabeth Muros INTRO TO PERSONAL COMPUTING.

Central dogma: the story of life RNA DNA Protein.

An approach to carry out research and teaching in Bioinformatics in remote areas Alok Bhattacharya Centre for Computational Biology & Bioinformatics JAWAHARLAL.

Facility I: Production and Characterization of Proteins

Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani

Computer Simulation of Networks ECE/CSC 777: Telecommunications Network Design Fall, 2013, Rudra Dutta.

Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information

IIC Information Flow Interesting ions? Priority list of interesting ions Empty priority list? QA/QC? Peptide identification Protein identification External.

High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.

High throughput biology data management and data intensive computing drivers George Michaels.

1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.

VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.

Rafael Jimenez ELIXIR CTO BioMedBridges Life science requirements from e-infrastructure: initial results from a joint BioMedBridges workshop Stephanie.

Million Veteran Program: Industry Day Genomic Data Processing and Storage Saiju Pyarajan, PhD and Philip Tsao, PhD Million Veteran Program: Industry Day.

Clouds , Grids and Clusters

High-throughput Biological Data The data deluge

“Proteomics is a science that focuses on the study of proteins: their roles, their structures, their localization, their interactions, and other factors.”

Data Warehousing and Data Mining

Presentation transcript:

DOE Genomics: GTL Program IT Infrastructure Needs for Systems Biology David G. Thomassen Office of Biological and Environmental Research DOE Office of Science March 22, 2004

GTL Program Goals Using DNA sequence and high-throughput technologies goal 1 Identify and characterize the molecular machines of life goal 2 Characterize gene regulatory networks goal 3 Characterize the functional repertoire of complex microbial communities in their natural environments at the molecular level goal 4 Develop the computational capabilities to advance understanding of complex biological systems and predict their behavior Systems Biology Gain a comprehensive and predictive understanding of the dynamic, interconnected processes underlying living systems

Experimental: Complete datasets Quantitative measurements Comprehensive physical characterization:  Protein expression and interactions  Spatial distributions  Process kinetics Computational: Automated data analysis and validation Automated integration of diverse data sets Human and computer-accessible databases Molecular, Pathway and cell-level simulations The goals require a new synergy between computing and biology. Ultimate Goal is to Provide Predictive Models of Microbes This goal drives data collection and computing strategy.

GTL Experiment Template Generating Petascale Data Sets While this example does not account for data processing and compression it illustrates how even simple raw data storage will quickly become a bottleneck for biologists.

ATCGTAGCAATCGACCGT... CGGCTATAGCCGTTACCG… TTATGCTATCCATAATCGA... GGCTTAATCGCATACGAC... Capacity: e.g., High- throughput protein structure predictions, data analysis, sequence comparison Thread onto templates Best match Capability: e.g., Large scale biophysical simulations, stochastic regulatory simulations: Large size and timescale classical simulations Highly accurate quantum mechanical simulations GTL Science will Require High Performance Computing for Both Capacity and Capability Problems

Petascale Capacity Problems in Biology Microbial and Community Genome Annotation Analyze and annotate 20 microbial genomes - (720,000 processor hours) Now In 5 years Assemble, analyze and annotate community of 200 microbes and phage (10,000,000 processor hours) Compare genome sequences (200 megabases)to previous genomes (4 gigabases) (5,000,000 processor hours)

Petascale Capability Problems in Biology Membrane channel simulation Simulate non-flexible protein ion channel K+ flow using quantum methods (2,200,000) processor hours for 4 second simulation Now In 5 years Simulate flexible protein ion pump for producing ATP from K+ gradient (15,000,000 processor hours for 200 nanosecond simulation

2. Data Capture and Archiving 4. Modeling and Simulation 3. Data Analysis / Reduction 1. LIMS & Workflow Management 5. The Community Data Resource Computing Capabilities for GTL Facilities and Projects 6. Infrastructure Collaborative Projects Facilities

High-Performance Computing Roadmap for the Genomics: GTL Program Biological Complexity Comparative Genomics Constraint-Based Flexible Docking 1000 TF 100 TF 10 TF 1 TF* Constrained rigid docking Genome-scale protein threading Community metabolic regulatory, signaling simulations Molecular machine classical simulation Protein machine Interactions Cell, pathway, and network simulation Molecule-based cell simulation *Teraflops Current U.S. Computing

Genomics: GTL – A Vision of Systems Biology Research In years we would like to be able to start with a microbe or microbial community of interest and in a matter of days or weeks: Generate an annotated DNA sequence Produce proteins and molecular tags for most/all proteins Identify the majority of multi protein complexes Generate a working regulatory network model Identify the biochemical capabilities Design reengineering or control strategies in silico

Capabilities Needed: Map experimental strategies to distributed resources and instrument protocols Coordinate experimental process management across cyber collaboratories Track the process - sample tracking metadata Dynamically optimize experiment workflow Process and controls documentation / QA Localize problems with data production quality Share process data across facilities or projects Make production-scale collaborative science possible 1. LIMS and Workflow Management Track and capture metadata

R & D Challenges and Technologies Approaches to coordinated process design, optimization, protocol mapping for a large distributed enterprise Explore LIMS and workflow management systems technology including commercial systems – modify? Explore approaches to process documentation and control, QA/QC, and process metadata representation – make data reproducible Develop Collaborative tools, electronic notebooks, web servers for shared access to laboratory data 1. LIMS and Workflow Management

Capabilities Needed: Capture bulk data and metadata from many different measurements and instruments in shared large-scale data archives Represent Complex Non-standard Data types: mass spectrometry, light microscopy, cryo EM, expression, biophysical & biochemical characterization data… Capture and represent data quality, statistical reliability measures, process metadata Support deposition, access, transfer and retrieval for archives of multi-petabyte size Raw data sets Swimming in Data 2. Data Capture and Archiving

Developing representations and models for data and metadata from many different measurements and assays; confocal images, video, mass spec, 3D Cryo-EM,... Developing data exchange and format standards for facilities and the community Hardware infrastructure for rapid and flexible access to very large (petabyte) data volumes. Research new data storage technologies. Research approaches to design, query and retrieval efficiency in large datasets and with non-standard data types R & D Challenges and Technologies 2. Data Capture and Archiving Raw data sets

Capabilities Needed: Process data from instruments such as mass spectrometers, microscopes, NMR, etc., to reduce and analyze data; e.g.; Automatically identify interacting protein events in FRET confocal microscopy Identify peptides, proteins, PTMs of interest in mass spectrometry data Quantitate changes in / cluster expression data from arrays or mass spectrometry Compare metabolite levels under different cell conditions 3. Data Analysis and Reduction

R & D Challenges and technologies Many types of data, each with algorithm research and development challenges for analyzing data, basic algorithm research needed! e.g.; - Automated processing of images and video about protein cell localization to achieve analysis high-throughput - New mass spectrometry algorithms to identify post- translational modifications, cross-linked peptides, and new proteins (De Novo MS), and to automate quantitiation - Analysis of NMR, Scattering, AFMs.. Analysis throughput likely to be an issue; Research on Grid analysis approaches and codes for large clusters and MPP environments Approaches to Tools Libraries and Repositories Develop and adopt software engineering principles and practices for GTL software development; modular, open source 3. Data Analysis and Reduction

Capabilities Needed: Build models of biology that capture our knowledge, based on a combination of experimental data types, and validate these models, use them to predict. e.g.; Build regulatory network topology from observations of protein expression based on conditions Build a protein-protein interaction network from protein interaction data of several types Build a model for the organization of a protein complex from homology modeling, geometry constraints from mass spec, and cryo-EM images Build cell models that combine regulation, metabolism and protein interactions 4. Modeling and Simulation

R & D Challenges & Technologies Synthesis; How to infer or reconstruct systems from data – build “optimal” model Metabolic pathways from metabolic data & genome Regulatory networks from expression data Protein interaction networks from binary interaction data How to integrate different types of data into models Integration of different imaging modalities Integration of metabolism, regulation, and protein interactions into cell models How to derive best interaction networks from raw binary interaction data, cell interaction images, predicted interactions, and co-expression data Modeling and Simulation Capture human modes of integration to automate

R & D Challenges (cont’d) How to mathematically represent biology – pathways, networks, communities What’s the right calculus to describe regulation / metabolism / protein interaction networks / signaling / that allows quantitative prediction? Differential equations? Stochastic or deterministic? Control theory or Ad hoc mathematical networks? Binary or discrete value networks? Chaos theory? “Need for new abstractions” In what regimes do they work and where they fail? How do we deal with missing data, incomplete knowledge, or errors? Are there organizing principles or theory that could make us successful with incomplete knowledge? How to get to longer compute times for physics based simulations (millisecs and beyond)- steer and sample 4. Modeling and Simulation

Capabilities Needed: Provide community access to data, models, simulations, and protocols for GTL. Allow users to query and visualize data, use models, run simulations. Community resources for multiple types of data - machines, interactions, process models, expression, regulation, genome annotation, metabolism, regulation,… Access to: data protocols and methods analysis tools and user environments models and simulations Access to multiple levels of data - raw data, processed results, dynamic models Integrated view of the biology represented Guide experimental design strategy for next microbe “The GTL Knowledge Base” 5. Community Data Resource

R & D Challenges and Technologies Design and Integration of the major databases Huge data volumes, great schema complexity - need for new types of databases (hardware and software) Database technologies – object-relational, graph DBs, … Data standards, representations, ontologies for very complex objects User Access Systems for browsing, query, visualization, and to run analysis or simulations Supporting Simulation from DBs - how to allow users to utilize models and run simulations; how to link simulations to underlying data Integration - Provide integrated view of the biology - With data from other community sources. Community access to compute power to run long timescale simulations IP issues and reward system How to represent incomplete, sparse, conflicting data 5. Community Data Resource

Objective: Provide hardware and software environments to support analysis, data storage, modeling and simulation activities required in GTL Examples of Infrastructure: Hardware, network and operative system environments for peta-scale data storage and retrieval. Grid computing environments to support distributed large-scale data analysis operations. Massively parallel architectures for systems simulation. Discrete mathematics libraries 6. Infrastructure