The Network Inference Problem and the SEBINI Platform

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Marković Miljan 3139/2011
Department of Mathematics and Computer Science
Detecting Computer Intrusions Using Behavioral Biometrics Ahmed Awad E. A, and Issa Traore University of Victoria PST’05 Oct 13,2005.
Insider Access Behavior Team May 06 Brandon Reher Jake Gionet Steven Bromley Jon McKee Advisor Client Dr. Tom DanielsThe Boeing Company Contact Dr. Nick.
Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏
Bioinformatics Needs for the post-genomic era Dr. Erik Bongcam-Rudloff The Linnaeus Centre for Bioinformatics.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Computer Science Prof. Bill Pugh Dept. of Computer Science.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
UNIT-V The MVC architecture and Struts Framework.
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
GTL Facilities Computing Infrastructure for 21 st Century Systems Biology Ed Uberbacher ORNL & Mike Colvin LLNL.
EADGENE and SABRE Post-Analyses Workshop 12-14th November 2008, Lelystad, Netherlands 1 François Moreews SIGENAE, INRA, Rennes Cytoscape.
Integrating the Bioinformatic Technology Group into your research programme Introduction People and Skills Examples Integrating the BTG Contacts BHRC Away.
Bioinformatics Core Facility Guglielmo Roma January 2011.
Systems Biology ___ Toward System-level Understanding of Biological Systems Hou-Haifeng.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
AdvancedBioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2002 Mark Craven Dept. of Biostatistics & Medical Informatics.
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
EB3233 Bioinformatics Introduction to Bioinformatics.
An approach to carry out research and teaching in Bioinformatics in remote areas Alok Bhattacharya Centre for Computational Biology & Bioinformatics JAWAHARLAL.
Introduction to biological molecular networks
Lecture №4 METHODS OF RESEARCH. Method (Greek. methodos) - way of knowledge, the study of natural phenomena and social life. It is also a set of methods.
The Genomics: GTL Program Environmental Remediation Sciences Program Spring Workshop April 3, 2006.
High throughput biology data management and data intensive computing drivers George Michaels.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Interaction and Animation on Geolocalization Based Network Topology by Engin Arslan.
Sub-fields of computer science. Sub-fields of computer science.
Information Retrieval in Practice
Detecting Web Attacks Using Multi-Stage Log Analysis
Progress Apama Fundamentals
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
SNS COLLEGE OF TECHNOLOGY
Networks and Interactions
Manufacturing Productivity Solutions
Biological Databases By: Komal Arora.
CSC 321: Data Structures Fall 2016
INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM
Organization and Knowledge Management
Introduction to Visual Basic 2008 Programming
KnowEnG: A SCALABLE KNOWLEDGE ENGINE FOR LARGE SCALE GENOMIC DATA
Regulatory network inference: use of whole brain vs
Genomic Data Integration
CSC 321: Data Structures Fall 2015
Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer.
CHAPTER 3 Architectures for Distributed Systems
HTML5 based Notification System for Updating
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Waikato Environment for Knowledge Analysis
High-throughput Biological Data The data deluge
Computer Simulation of Networks
1 Department of Engineering, 2 Department of Mathematics,
Data Warehousing and Data Mining
1 Department of Engineering, 2 Department of Mathematics,
Show suggestions and borderlines Hierarchical Clustering
1 Department of Engineering, 2 Department of Mathematics,
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Biomolecular Networks Initiative
Chapter 1 Introduction(1.1)
LESSON 1 INTNRODUCTION HYE-JOO KWON, Ph.D /
Technical Capabilities
Applying principles of computer science in a biological context
Kostas Kolomvatsos, Christos Anagnostopoulos
Gaurab KCa,b, Zachary Mitchella,c and Sarat Sreepathia
Presentation transcript:

The Network Inference Problem and the SEBINI Platform Ronald Taylor, Ph.D. Computational Biology & Bioinformatics Group Computational Sciences & Mathematics Division Pacific Northwest National Laboratory (PNNL) Richland, Washington Email: ronald.taylor@.pnl.gov Systems Biology at PNNL: http://www.sysbio.org/ 11/20/201811/20/2018

DOE’s Genomics:GTL program Follow-up to the Human Genome Project. (DOE launched the HGP in 1986. GenBank started at a DOE lab.) Goal: to comprehensively understand cellular processes in a realistic context, i.e., systems biology. To be accomplished using high-throughput advanced technologies and computation (petabyte scale databases, integrated knowledgebases, network modeling) Under the direction of the DOE Office of Science and its suboffices, the Office of Biological and Environmental Research (OBER) and the Office of Advanced Scientific Computing Research (OASCR). Focused on microbial organisms. (Genomes sequenced in the DOE Microbial Genome Program.)

DOE Genomics: GTL Program Goals - The Role of Living Systems in Energy Production, Environmental Remediation, and Carbon Cycling and Sequestration Molecular: Proteins and multicomponent molecular machines that perform most of the cell's work Cellular: Gene regulatory networks and pathways that control cellular processes Community: Microbial communities in which groups of cells carry out complex processes in nature

Methods used to provide various data for inferring regulatory networks (I) Prediction of transcription factor (TF) binding sites – e.g., Dr. Lee Ann McCue’s work at PNNL, also MotifMogul software at ISB. Public and commercial databases (for example: TRANSFAC, SCPD), gradually collecting wet lab experiments that identify TFs and their binding sites, one by one. Tiling arrays (expensive). Projects to find protein-protein interactions – e.g., PNNL/ORNL GTL project (expensive). Computational algorithms that infer regulatory edges based (primarily) on correlations in state – the algorithms used in SEBINI. Require a large amount of array or protein expression data. More powerful algorithms/models are needed to infer specific gene-to-gene connections than the typical statistical techniques used for clustering.

Methods used to provide various data for inferring regulatory networks (II) There are drawbacks to all methods. TFBS prediction based on sequence and phylogenetic comparison is very hard. Tiling arrays are promising, but new (expensive - test one putative source TF at a time). Drawbacks to both: dependence on nearness of TFBS to gene to infer target. In eukaryotes, ~70% of TFs bind far from their targets (A. Aderem). Also: neither yields regulation type (activator / inhibitor), just that there is binding. Also: it is common in bacteria for a TF to lie between genes transcribed in different directions. May regulate one or both - which choice is unknown from tiling and TFBS prediction. As for determining interaction networks using mass spec: not yet high-throughput, in terms of results.

Conclusion Computational algorithms that are based on correlations in state will be continue to be used, remaining a standard approach for many years to come. Bonus: gathering the large amount of array data required for their use provides the raw data for investigation of state functions – topic for future research.

Software Environment for BIological Network Inference (SEBINI) - Introduction SEBINI has been created to provide an interactive environment for the evaluation and deployment of algorithms used in the reconstruction of the structure of biological regulatory networks. SEBINI compares and trains network inference methods on artificial networks and simulated gene expression perturbation data. It also allows the analysis within the same framework of experimental high-throughput expression data using the suite of (trained) inference methods. Hence SEBINI should be useful both to software developers wishing to evaluate, compare, refine, or combine inference techniques, and to bioinformaticians (or biologists) analyzing experimental data. SEBINI provides a platform that aids in more accurate reconstruction of regulatory and interaction networks, with much less effort, in less time.

SEBINI consists of a suite of programs operating on a centralized relational database. The user interface is web - based, operated by Java servlets. A collection of inference algorithms is provided (mutual information, Bayesian network structure learning, etc), and an API has been created to allow addition of other inference methods in a well defined manner. Briefly, methods using high throughput data rely on searc h ing for patterns of partial correlation or conditional probabilities that indicate causal i n flu ence. Such patterns of partial correlations found in the high throughput data, possibly combined with other suppl e mental data on the genes in the proposed networks or other information on the organism, are the basis upon which the alg o rithms in SEBINI’s to olkit infer regulatory networks.

Direct comparison of network inference methods on common data sets. Thus, SEBINI allows · Direct comparison of network inference methods on common data sets. Artificial data sets (topologies, perturbations, node input functions) that can be dynamically altered and stored. Inference results that can be stored, further analyse d, and visually displayed. Dynamic, step - wise refinement of inference methods, based on results. Well defined addition of new inference algorithms through an API. Supervised or unsupervised training of inference methods, with supervised inference results s cored against the known network topologies. Analysis of experimental data within the same framework. Storage of supplemental information, allowing such information to be made available to an inference method (e.g., which of the transcription factors being produced will bind to which pro moter sites, etc.) SEBINI can thus show the quantitative effect of background information – for example, knowledge of M% of the promoter sites increases the number of regulatory edges that can be deduced by N%.

Software Environment for BIological Network Inference (SEBINI) PNNL’s Bioinformatics Resource Manager (BRM) Input Module High-throughput experimental data Builder Module Simulated high-throughput expression data for artificial networks Text files (flat files) PNNL’s PRISM database system Visualization of inferred networks via Cytoscape Human-readable reports on inferred networks SEBINI Central relational database (PostgreSQL) User interface – web site operated by Java servlets Machine-readable network structure files for dynamic modeling programs Topological statistics, network annotation, post-inference processing; scoring & error analysis (on artificial data sets) Collection of network inference algorithms. User selects algorithm and data set, runs alg to infer a network (a set of edges). Mutual information-based and Bayesian network structure learning algorithms provided for learning regulatory networks. Also: PNNL/ORNL algorithm for learning protein-protein interaction networks from PNNL/ORNL bait-prey experiment mass spec data sets. Inferred networks permanently stored back into database.

SEBINI architecture & implementation (I) ~100 Java programs (classes) and growing rapidly. All inter-servlet communication is routed through a CentralControl class. Algorithm handlers are called directly from the Java servlets for the corresponding web pages. This environment is is NOT a spiderweb – there is a control chokepoint. ~30 PostgreSQL database tables. Slowly growing; at present quite stable. One major database change coming. Data security – project based. Upon login, the user is assigned a 32 digit hex digit JSessionID, which is checked before display of every web page.

SEBINI architecture & implementation (II) While the SEBINI was originally designed to infer directed (regulatory) networks, the code now allows undirected networks, so algorithms that infer interaction networks can be used and such networks (e.g., protein-protein interaction networks) permanently stored and analyzed. Design issues: interface for user navigation among huge data sets, database design to map inferred networks and inferred edges back to original network and expression data – IDs must be carried forward. expression data  one-to-many via binning alg choice  binned exp data  one-to-many via inference alg choice  inferred_network

SEBINI architecture & implementation (III) Design issues (continued): multi-threaded; job monitoring; web pages to view all data points, working towards transparency of all data in all tables (security access permitting) in the database via display through the web site. Permanent storage of binned/ pre-processed data sets. Novel. Important for efficiency, transparency, speed of response, analysis of results. Jobs times recorded to millisec. Algorithms can be compared on efficiency vs relative power. A Java handler class is created for each new algorithm, to “wrap it”, ie, to handle communication with the database and web site.

SEBINI architecture & implementation (IV) SEBINI was initially implemented on a Dell desktop running Red Hat Linux, using Java ver. 1.4, PostgreSQL ver. 7.4, and Tomcat 4.1. Cytoscape is used for network visualization, invoked through Java Web Start. SEBINI has also been installed on a Windows web server that will soon be accessible from outside the PNNL, using Java ver. 1.5, Tomcat ver. 5, and PostgreSQL 8.1. Jakarta Commons Java libraries are used for data file uploads [Jakarta]. Machine-specific parameters are stored in an easily changed properties text file. development site: http://asimov.emsl.pnl.gov:8080/NIT/NIT.html public demo site: https://www.emsl.pnl.gov/NIT/NIT.html

Possible sources of inference algorithms Probabilistic graphical models (structure learning Bayesian networks, among others) Information theory (mutual information based, and CMI) Classical statistics, analysis of correlation - e.g., Pearson correlation Machine learning – decision trees (C4.5, ID3), supervised and unsupervised Data mining – association rule mining (really want to try this) Pattern classification Deductive reasoning Neural networks Fuzzy logic?

11/20/201811/20/2018

SEBINI flow of control, for experimental data Log into a project, or create a new project Create a network set (container) for the one experimental network Upload the experimental expression data file Select a binning algorithm and bin the data Select an inference algorithm, select the alg parameters, and infer a network. Visualize the inferred network, now in the database, using Cytoscape View topological statistics View node and edge annotations Generate a human-readable report Export the topology in a format suitable for input into dynamic simulations

SEBINI flow of control, for synthetic data Log into a project, or create a new project Select a topology build algorithm, enter param values, and create a synthetic network set Select an expression set build algorithm, enter param values, and create synthetic expression sets Select a binning algorithm and bin the data Select an inference algorithm, enter param values, and infer network Visualize and compare the real and inferred network(s), using Cytoscape View topological statistics View precision, recall, F-measure statistics for precise measure of how well the inference alg performed against the “gold standard”, the known synthetic network.

Some goals for SEBINI Make network inference a starting point, not an end point (currently an end point that is usually not even reached) Simple deployment of state-of-the-art algs not previously available to a biology lab, available over the web or via local SEBINI install. Advance the field by improving the algorithms. Possibility of combining alg output (as done in GRAIL), now that alg results are stored in same database. Develop expertise on how much data is needed, appropriate cutoffs, species-specific post-processing, the weaknesses of a given method, what background information on a genome is most useful to supplement the primary expression data. “Network biology is only in its infancy” (2004, Barabasi). Nobody knows what inference algorithm(s) will perform best - theoretical guidance is lacking. But SEBINI will position us to empirically test new algorithms, easily modify or combine algs - possibly with species specific information.

DOE Science Undergraduate Laboratory Internship (SULI) program DOE Science Undergraduate Lab Internship (SULI) program. Year-round. Duration: 10 wks summer, 12-16 wks in fall or spring. $400/week + housing; http://science-ed.pnl.gov/undergrad/erulf.stm. Manager: Karen Wieda – kj.wieda@pnl.gov, (509) 375-3811. Of ~200 applications specifying PNNL as 1st round choice, 50 offers made last year. Summer term is the most competitive. DOE pays most of cost. PNNL Science & Engineering Education - Fellowship Services. Undergrad, grad, visiting scientists, sabbaticals. $400-750/week, 5 months max. http://science-ed.pnl.gov/studentops.stm Manager: Rebecca Janosky – rebecca.janosky@pnl.gov, (509) 375-2302. About 900-1000 applications/yr, of which 250 get an offer. Completed applications posted to central site. Mentor/host pays the cost, plus 18% overhead.

Take-home SULI information Apply by early January 2007 for summer 2007. PNNL sees students who selected it as first choice on Feb 1. SULI manager: Ms. Karen Wieda. karen.wieda@pnl.gov Phone: (509) 375-3811 Ronald Taylor (for the SEBINI project): ronald.taylor@pnl.gov PNNL science education web site: http://science-ed.pnl.gov/ students/ Other background web sites: www.pnl.gov, www.sysbio.org, genomicsgtl.energy.gov/compbio/

Acknowledgements For work on SEBINI: Anuj Shah for work on Cytoscape, MI alg translation, statistics), Meridith Blevins (SULI), Charles Treatman (SULI) Funding: from the US Dept of Energy, through the PNNL Biomolecular Systems Initiative, the EMSL Membrane Grand Challenge at PNNL, and the joint PNNL/ORNL protein-protein interaction network mapping GTL project