Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

Slides:



Advertisements
Similar presentations
Biological pathway and systems analysis An introduction.
Advertisements

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21) NSF-wide Cyberinfrastructure Vision People, Sustainability, Innovation,
Wrapup. NHGRI strategic plan What does the NIH think genomics should be for the next 10 years? [Nature, Feb. 2011]
Lecture #1 Introduction.
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CF21) IRNC Kick-Off Workshop July 13,
Knowledge Enabled Information and Services Science What can SW do for HCLS today? Panel at HCSL Workshop, WWW2007 Amit Sheth Kno.e.sis Center Wright State.
Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Jeffery Loo NLM Associate Fellow ’03 – ’05 chemicalinformaticsforlibraries.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Grand Challenges (I)  Spying on Cells -- Mechanisms of interacting molecular functions leading to new engineering designs of sensing events -- Nano sensors.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Topics in Computational Biology (COSI 230a) Pengyu Hong 09/02/2005.
19 April, 2017 Knowledge and image processing algorithms for real-life applications. Dr. Maria Athelogou Principal Scientist & Scientific Liaison Manager.
GTL User Facilities Facility II: Whole Proteome Analysis Michelle V. Buchanan.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
V. Chandrasekar (CSU), Mike Daniels (NCAR), Sara Graves (UAH), Branko Kerkez (Michigan), Frank Vernon (USCD) Integrating Real-time Data into the EarthCube.
SCIENCE-DRIVEN INFORMATICS FOR PCORI PPRN Kristen Anton UNC Chapel Hill/ White River Computing Dan Crichton White River Computing February 3, 2014.
CceHUB A Knowledge Discovery Environment for Cancer Care Engineering Research Ann Christine Catlin HUBzero Workshop November 7, 2008.
DOE Genomics: GTL Program IT Infrastructure Needs for Systems Biology David G. Thomassen Office of Biological and Environmental Research DOE Office of.
Automated Explanation of Gene-Gene Relationships Wacek Kuśnierczyk.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
Science & Technology Centers Program Center for Science of Information Bryn Mawr Howard MIT Princeton Purdue Stanford Texas A&M UC Berkeley UC San Diego.
Genetic network inference: from co-expression clustering to reverse engineering Patrik D’haeseleer,Shoudan Liang and Roland Somogyi.
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
GTL Facilities Computing Infrastructure for 21 st Century Systems Biology Ed Uberbacher ORNL & Mike Colvin LLNL.
Grant Number: IIS Institution of PI: Arizona State University PIs: Zoé Lacroix Title: Collaborative Research: Semantic Map of Biological Data.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Center for Human Health and the Environment
© 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang 5-1 Chapter 5 Business Intelligence: Data.
Helping scientists collaborate BioCAD. ©2003 All Rights Reserved.
Astro / Geo / Eco - Sciences Illustrative examples of success stories: Sloan digital sky survey: data portal for astronomy data, 1M+ users and nearly 1B.
GTL User Facilities Facility IV: Analysis and Modeling of Cellular Systems Jim K. Fredrickson.
Nano-electronics Vision: Instrumentation and methods for analysis of atomic scale physical properties, and methods to correlate these properties with nano-electronic.
Research Design for Collaborative Computational Approaches and Scientific Workflows Deana Pennington January 8, 2007.
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
1 Departament of Bioengineering, University of California 2 Harvard Medical School Department of Genetics Metabolic Flux Balance Analysis and the in Silico.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
A Context Model based on Ontological Languages: a Proposal for Information Visualization School of Informatics Castilla-La Mancha University Ramón Hervás.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Systems Biology ___ Toward System-level Understanding of Biological Systems Hou-Haifeng.
Major Disciplines in Computer Science Ken Nguyen Department of Information Technology Clayton State University.
Futures Lab: Biology Greenhouse gasses. Carbon-neutral fuels. Cleaning Waste Sites. All of these problems have possible solutions originating in the biology.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
COMPUTERS IN BIOLOGY Elizabeth Muros INTRO TO PERSONAL COMPUTING.
Structural Models Lecture 11. Structural Models: Introduction Structural models display relationships among entities and have a variety of uses, such.
Central dogma: the story of life RNA DNA Protein.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip.
National Research Council Of the National Academies
Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani
Towards an IoT Ecosystem Flavia C. Delicato 1, Paulo F. Pires 1, Thais Batista 2, Everton Cavalcante 2, Bruno Costa 1, Thomaz Barros 1 1 Department of.
Project number: ENVRI and the Grid Wouter Los 20/02/20161.
The Genomics: GTL Program Environmental Remediation Sciences Program Spring Workshop April 3, 2006.
CIMA and Semantic Interoperability for Networked Instruments and Sensors Donald F. (Rick) McMullen Pervasive Technology Labs at Indiana University
High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.
Visual Knowledge ® Software Inc. Visual Knowledge BioCAD Case Study Parallels to Other Domains VK Semantic Web Server.
High throughput biology data management and data intensive computing drivers George Michaels.
Cyberinfrastructure Overview of Demos Townsville, AU 28 – 31 March 2006 CREON/GLEON.
1 Modelling and Simulation EMBL – Beyond Molecular Biology Physics Computational Biology Chemistry Medicine.
A Computational Study of RNA Structure and Dynamics Rhiannon Jacobs and Harish Vashisth Department of Chemical Engineering, University of New Hampshire,
Informatics for Scientific Data Bio-informatics and Medical Informatics Week 9 Lecture notes INF 380E: Perspectives on Information.
Sub-fields of computer science. Sub-fields of computer science.
Databases, Ontologies and Text mining Session Introduction Part 2
Bottom-Up Proteomics Data collection
Data Warehousing and Data Mining
Biomolecular Networks Initiative
A perspective on proteomics in cell biology
Presentation transcript:

Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular Systems Initiative Pacific Northwest National Laboratory (

2 Information Intensive Science Goals of IIS Understanding systems versus individual phenomena Strengthening/automating links between different types of data from different scales Examples Biology: Cell Signaling Biology: BIRN Chemistry: CMCS Homeland Defense Complexity of systems is becoming pervasive Challenges Efficient federation, graph-based queries Continuous data correlation Managing complex experiments, data provenance using multiple independent data and analysis resources Priorities High-performance federation, data mining, semantic query capabilities (software, hardware architecture) Knowledge environments (lightweight, evolvable, powerful, …) Organization and Visualization of large-scale, complex information

3 A systems-science approach to address complex problems New knowledge is assimilated from different data, tools, and disciplines at each scale Real-time bi-directional information flow Deep analysis across scales Multiple applications for the same information Challenges Data, provenance, annotation publication Syntactic and Semantic Federation Standardization versus innovation Examples: IUPAC – update of radical thermochemistry reference values by global expert group PrIMe – community developed optimized reaction mechanisms guiding experimental plans across scales, providing community resources for applied research Combustion is a Multi-scale Chemical Science Challenge

4 Volume of data, orders of magnitude larger and at different levels of abstraction Complexity of information spaces into very high dimensions, 200 the norm Information often out of context, incomplete, fuzzy Deception Information in all media types: text, imagery, video, voice, web, sensor data Time and temporal dynamics fundamentally change the approach Spatial, yet non-spatial abstract data Multiple ontologies, languages, cultures Privacy Issues Homeland Security: Pulling insight out of information overload Immigration Financial Sensors Shipping Communications Is there a domestic terrorist plot? Can we detect and prevent a terrorist attack BEFORE it happens? For homeland security and science we now turn to data-intensive visual analytics we now turn to data-intensive visual analytics

5

6 Molecular parameters: protein levels / states / locations / interactions / activities Cell function: death, proliferation, differentiation, migration,... Systems Biology of Cells Ultimate aim: Understanding and prediction of effects of component properties

7

8

9 What, Where, Quantity, Quality? What parts are being made? (identity) What is the regulatory network structured? (interactions) Where are the proteins located in cell? (location) What are their levels? (quantity) How do they interact with their partners? (activity) As a function of covalent modification Contribution of steric restrictions Forward and reverse rate constants To successfully model a complex biological system, one must minimally know the following information:

10 Cells as Input-Output Systems Biologists look at their experiments as input-output systems We start with a “defined” system to which we apply a stimulus (Input: independent variable) We then look for a specific response (output: dependent variable) The relationship between the input and output provides insight into the workings of the system System Input Output Unknown context So unless we control the experimental context, we cannot interpret our experiments

11 The Two Greatest Challenges of Systems Biology 1.Working with indeterminate systems 2.Understanding context - what it is and how to control and capture it

12 Defining the composition of living systems is driving analytical technologies Genomics Proteomics Metabanomics Expression profiling Imaging Etc……. All of these technologies seek to rigorously define the composition of living systems

13 Time 2-D display of detected peptides Mass Global simultaneous quantitative proteome measurements Proteins identified and quantified using accurate mass and time (AMT) tags LC elution time (min) m/z Dimension one - separation time Dimension two - accurate mass

Tesla High Throughput Mass Spectrometer 1 Experiment per hour 5000 spectra per experiment 4 MByte per spectrum Per instrument: 20 Gbytes per hour 480 Gbytes per day These are based on today's technologies. Time to analyze offsite: 1 week Time to analyze onsite: 48 hours Time to analyze onsite with smart storage: 2 hours High Throughput Proteomics

15 Integrated, High-throughput Experiments will Generate Enormous Amounts of Data

16

17 Trey Ideker The Molecular Interaction Scaffold is Huge

18 Cell Imaging New multispectral, multidimensional imaging techniques can generate enormous amounts of data

19 Cell Imaging Workflow Complex set of metadata collected here

20 How Much Data From Imaging? Currently, a high quality image of a single cell field is 4mb per image, obtained at 4fps (16mb/s) Following cell through one cell cycle is 24h, or approximately 1.4tb New hyperspectral microscopes analyzing only 10 wavelengths would generate 7tb/day Characterizing dynamics of most abundant set of genes (4000) would require 5.5pb This is for a single instrument and a single experiment using today’s technology

21 Understanding the influence of cell context is driving experimental and computational biology Cell Signaling Developmental biology Cancer and growth control Host-pathogen interactions Dynamics of microbial communities Cellular responses to stress

22 Computational Modeling Approaches -- Diverse Spectrum differential equations statistical mining Bayesian networks SPECIFIEDABSTRACTED Markov chains Boolean models relationships mechanisms influences * (including structure) *

23 Computer Models Allow Reconstruction of Processes Across Different Scales MODEL DATABASE Organ 1 Organ N Model 1 Cell Data Set N Unique ID Model Name Model Descr. Default Par. Default Comp. Timestamp Security Organ Species 1 Species N Species Solution Par. Input_par ID React. Rates Chemical Par. Concen. Val. - Geometric Par. Input_par ID Value_par - Equation Docs. Input_par ID Symbolic Source - TissueModel 1 Tissue N Cell Compute Par. Input_par ID Value_par - Initial Conditions Input_fld ID Value_par - Parameter Docs. Input_par ID References Limits -

24

25

26

27 Data is distributed across many repositories with various ontologies and data formats Analysis tools do not address integration of heterogeneous data sets Minimal informatics based analysis tools that support a systems biology approach Collaboration capabilities are primitive to support shared knowledge among researchers Obstacles preventing scientists from utilizing available data

28 The Challenge for Data Handling is Two-fold 1.Managing the massive amounts of compositional data necessary to define all of the relevant experimental systems 2.Capture all of the data on the relationships between context, composition and response Integration of the analytical and experimental methodologies into a single system is necessary to link all of the data in a useful way

29 END

30 Understanding Living Cells Cell responses are multiphasic Different classes of stimulants (information) are processed at characteristic time scales Processing nodes within cells are spatially segregated Each cell responds independently depending on its specific context A response generally induces a reprogramming of the cell machinery To create cell simulations, we must “abstract” this information to create a reference model which can then be modified