The Network Inference Problem and the SEBINI Platform

The Network Inference Problem and the SEBINI Platform
Ronald Taylor, Ph.D. Computational Biology & Bioinformatics Group Computational Sciences & Mathematics Division Pacific Northwest National Laboratory (PNNL) Richland, Washington Systems Biology at PNNL: 11/20/201811/20/2018

DOE’s Genomics:GTL program
Follow-up to the Human Genome Project. (DOE launched the HGP in GenBank started at a DOE lab.) Goal: to comprehensively understand cellular processes in a realistic context, i.e., systems biology. To be accomplished using high-throughput advanced technologies and computation (petabyte scale databases, integrated knowledgebases, network modeling) Under the direction of the DOE Office of Science and its suboffices, the Office of Biological and Environmental Research (OBER) and the Office of Advanced Scientific Computing Research (OASCR). Focused on microbial organisms. (Genomes sequenced in the DOE Microbial Genome Program.)

DOE Genomics: GTL Program Goals - The Role of Living Systems in Energy Production, Environmental Remediation, and Carbon Cycling and Sequestration Molecular: Proteins and multicomponent molecular machines that perform most of the cell's work Cellular: Gene regulatory networks and pathways that control cellular processes Community: Microbial communities in which groups of cells carry out complex processes in nature

Methods used to provide various data for inferring regulatory networks (I)
Prediction of transcription factor (TF) binding sites – e.g., Dr. Lee Ann McCue’s work at PNNL, also MotifMogul software at ISB. Public and commercial databases (for example: TRANSFAC, SCPD), gradually collecting wet lab experiments that identify TFs and their binding sites, one by one. Tiling arrays (expensive). Projects to find protein-protein interactions – e.g., PNNL/ORNL GTL project (expensive). Computational algorithms that infer regulatory edges based (primarily) on correlations in state – the algorithms used in SEBINI. Require a large amount of array or protein expression data. More powerful algorithms/models are needed to infer specific gene-to-gene connections than the typical statistical techniques used for clustering.

Methods used to provide various data for inferring regulatory networks (II)
There are drawbacks to all methods. TFBS prediction based on sequence and phylogenetic comparison is very hard. Tiling arrays are promising, but new (expensive - test one putative source TF at a time). Drawbacks to both: dependence on nearness of TFBS to gene to infer target. In eukaryotes, ~70% of TFs bind far from their targets (A. Aderem). Also: neither yields regulation type (activator / inhibitor), just that there is binding. Also: it is common in bacteria for a TF to lie between genes transcribed in different directions. May regulate one or both - which choice is unknown from tiling and TFBS prediction. As for determining interaction networks using mass spec: not yet high-throughput, in terms of results.

Conclusion Computational algorithms that are based on correlations in state will be continue to be used, remaining a standard approach for many years to come. Bonus: gathering the large amount of array data required for their use provides the raw data for investigation of state functions – topic for future research.

Software Environment for BIological Network Inference
(SEBINI) - Introduction SEBINI has been created to provide an interactive environment for the evaluation and deployment of algorithms used in the reconstruction of the structure of biological regulatory networks. SEBINI compares and trains network inference methods on artificial networks and simulated gene expression perturbation data. It also allows the analysis within the same framework of experimental high-throughput expression data using the suite of (trained) inference methods. Hence SEBINI should be useful both to software developers wishing to evaluate, compare, refine, or combine inference techniques, and to bioinformaticians (or biologists) analyzing experimental data. SEBINI provides a platform that aids in more accurate reconstruction of regulatory and interaction networks, with much less effort, in less time.

SEBINI consists of a suite of programs operating on a
centralized relational database. The user interface is web - based, operated by Java servlets. A collection of inference algorithms is provided (mutual information, Bayesian network structure learning, etc), and an API has been created to allow addition of other inference methods in a well defined manner. Briefly, methods using high throughput data rely on searc h ing for patterns of partial correlation or conditional probabilities that indicate causal i n flu ence. Such patterns of partial correlations found in the high throughput data, possibly combined with other suppl e mental data on the genes in the proposed networks or other information on the organism, are the basis upon which the alg o rithms in SEBINI’s to olkit infer regulatory networks.

Direct comparison of network inference methods on common data sets.
Thus, SEBINI allows Direct comparison of network inference methods on common data sets. Artificial data sets (topologies, perturbations, node input functions) that can be dynamically altered and stored. Inference results that can be stored, further analyse d, and visually displayed. Dynamic, step - wise refinement of inference methods, based on results. Well defined addition of new inference algorithms through an API. Supervised or unsupervised training of inference methods, with supervised inference results s cored against the known network topologies. Analysis of experimental data within the same framework. Storage of supplemental information, allowing such information to be made available to an inference method (e.g., which of the transcription factors being produced will bind to which pro moter sites, etc.) SEBINI can thus show the quantitative effect of background information – for example, knowledge of M% of the promoter sites increases the number of regulatory edges that can be deduced by N%.

Software Environment for BIological Network Inference (SEBINI)
PNNL’s Bioinformatics Resource Manager (BRM) Input Module High-throughput experimental data Builder Module Simulated high-throughput expression data for artificial networks Text files (flat files) PNNL’s PRISM database system Visualization of inferred networks via Cytoscape Human-readable reports on inferred networks SEBINI Central relational database (PostgreSQL) User interface – web site operated by Java servlets Machine-readable network structure files for dynamic modeling programs Topological statistics, network annotation, post-inference processing; scoring & error analysis (on artificial data sets) Collection of network inference algorithms. User selects algorithm and data set, runs alg to infer a network (a set of edges). Mutual information-based and Bayesian network structure learning algorithms provided for learning regulatory networks. Also: PNNL/ORNL algorithm for learning protein-protein interaction networks from PNNL/ORNL bait-prey experiment mass spec data sets. Inferred networks permanently stored back into database.

SEBINI architecture & implementation (I)
~100 Java programs (classes) and growing rapidly. All inter-servlet communication is routed through a CentralControl class. Algorithm handlers are called directly from the Java servlets for the corresponding web pages. This environment is is NOT a spiderweb – there is a control chokepoint. ~30 PostgreSQL database tables. Slowly growing; at present quite stable. One major database change coming. Data security – project based. Upon login, the user is assigned a 32 digit hex digit JSessionID, which is checked before display of every web page.

SEBINI architecture & implementation (II)
While the SEBINI was originally designed to infer directed (regulatory) networks, the code now allows undirected networks, so algorithms that infer interaction networks can be used and such networks (e.g., protein-protein interaction networks) permanently stored and analyzed. Design issues: interface for user navigation among huge data sets, database design to map inferred networks and inferred edges back to original network and expression data – IDs must be carried forward. expression data  one-to-many via binning alg choice  binned exp data  one-to-many via inference alg choice  inferred_network

SEBINI architecture & implementation (III)
Design issues (continued): multi-threaded; job monitoring; web pages to view all data points, working towards transparency of all data in all tables (security access permitting) in the database via display through the web site. Permanent storage of binned/ pre-processed data sets. Novel. Important for efficiency, transparency, speed of response, analysis of results. Jobs times recorded to millisec. Algorithms can be compared on efficiency vs relative power. A Java handler class is created for each new algorithm, to “wrap it”, ie, to handle communication with the database and web site.

SEBINI architecture & implementation (IV)
SEBINI was initially implemented on a Dell desktop running Red Hat Linux, using Java ver. 1.4, PostgreSQL ver. 7.4, and Tomcat 4.1. Cytoscape is used for network visualization, invoked through Java Web Start. SEBINI has also been installed on a Windows web server that will soon be accessible from outside the PNNL, using Java ver. 1.5, Tomcat ver. 5, and PostgreSQL 8.1. Jakarta Commons Java libraries are used for data file uploads [Jakarta]. Machine-specific parameters are stored in an easily changed properties text file. development site: public demo site:

Possible sources of inference algorithms
Probabilistic graphical models (structure learning Bayesian networks, among others) Information theory (mutual information based, and CMI) Classical statistics, analysis of correlation - e.g., Pearson correlation Machine learning – decision trees (C4.5, ID3), supervised and unsupervised Data mining – association rule mining (really want to try this) Pattern classification Deductive reasoning Neural networks Fuzzy logic?

11/20/201811/20/2018

SEBINI flow of control, for experimental data
Log into a project, or create a new project Create a network set (container) for the one experimental network Upload the experimental expression data file Select a binning algorithm and bin the data Select an inference algorithm, select the alg parameters, and infer a network. Visualize the inferred network, now in the database, using Cytoscape View topological statistics View node and edge annotations Generate a human-readable report Export the topology in a format suitable for input into dynamic simulations

SEBINI flow of control, for synthetic data
Log into a project, or create a new project Select a topology build algorithm, enter param values, and create a synthetic network set Select an expression set build algorithm, enter param values, and create synthetic expression sets Select a binning algorithm and bin the data Select an inference algorithm, enter param values, and infer network Visualize and compare the real and inferred network(s), using Cytoscape View topological statistics View precision, recall, F-measure statistics for precise measure of how well the inference alg performed against the “gold standard”, the known synthetic network.

Some goals for SEBINI Make network inference a starting point, not an end point (currently an end point that is usually not even reached) Simple deployment of state-of-the-art algs not previously available to a biology lab, available over the web or via local SEBINI install. Advance the field by improving the algorithms. Possibility of combining alg output (as done in GRAIL), now that alg results are stored in same database. Develop expertise on how much data is needed, appropriate cutoffs, species-specific post-processing, the weaknesses of a given method, what background information on a genome is most useful to supplement the primary expression data. “Network biology is only in its infancy” (2004, Barabasi). Nobody knows what inference algorithm(s) will perform best - theoretical guidance is lacking. But SEBINI will position us to empirically test new algorithms, easily modify or combine algs - possibly with species specific information.

DOE Science Undergraduate Laboratory Internship (SULI) program
DOE Science Undergraduate Lab Internship (SULI) program. Year-round. Duration: 10 wks summer, wks in fall or spring. $400/week + housing; Manager: Karen Wieda – (509) Of ~200 applications specifying PNNL as 1st round choice, 50 offers made last year. Summer term is the most competitive. DOE pays most of cost. PNNL Science & Engineering Education - Fellowship Services. Undergrad, grad, visiting scientists, sabbaticals. $ /week, 5 months max. Manager: Rebecca Janosky – (509) About applications/yr, of which 250 get an offer. Completed applications posted to central site. Mentor/host pays the cost, plus 18% overhead.

Take-home SULI information
Apply by early January 2007 for summer PNNL sees students who selected it as first choice on Feb 1. SULI manager: Ms. Karen Wieda. Phone: (509) Ronald Taylor (for the SEBINI project): PNNL science education web site: students/ Other background web sites: genomicsgtl.energy.gov/compbio/

Acknowledgements For work on SEBINI: Anuj Shah for work on Cytoscape, MI alg translation, statistics), Meridith Blevins (SULI), Charles Treatman (SULI) Funding: from the US Dept of Energy, through the PNNL Biomolecular Systems Initiative, the EMSL Membrane Grand Challenge at PNNL, and the joint PNNL/ORNL protein-protein interaction network mapping GTL project

The Network Inference Problem and the SEBINI Platform

Similar presentations

Presentation on theme: "The Network Inference Problem and the SEBINI Platform"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Network Inference Problem and the SEBINI Platform

Similar presentations

Presentation on theme: "The Network Inference Problem and the SEBINI Platform"— Presentation transcript:

Similar presentations

About project

Feedback