Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an unprecedented collection of molecular and functional information for a wide range of model organisms. The collected output of these efforts will provide the basis for assembling a detailed description of the cell from which we may begin to build models for simulating intracellular molecular and biochemical processes to understand and predict the dynamic behavior of living cells. Previous work in biochemical network, genomic, and cell simulation has led to models of well-characterized biochemical pathways. These projects, however, have fallen short of developing a scalable hierarchical integrative model of the cell that incorporates gene regulation, metabolism, signaling, and transport in a spatial modeling framework designed to scale to petaflops computer platforms and beyond. We are developing an integrated environment that will enable the construction of multilevel computational metabolic models for prokaryotic organisms and microbial communities and will allow researchers to perform multilevel comparative and evolutionary analysis of biological data. This environment will contain data and computational tools required for all steps of metabolic modeling of microbial strains in silico. We believe that the development of a comprehensive model of biosystems and biosimulation requires the following: 1.model development coupled with experimental groups developing new cell and molecular biology analysis and assay methods; 2.multiple levels of abstraction; 3.interface definitions will be needed for sharing model components and descriptions; and 4.Model integration, with the various pieces coming from disparate labs and multiple disciplines. The goal of such effort is to gain a comprehensive understanding of microbial metabolism of a single organism and microbial communities. Development of such models is based on close interaction and extensive data exchange with the experimental component of the project.
Sequence Analysis Module Whole Genome Analysis and Architecture Module Experimentation Proteomics Networks Analysis Module Metabolic Simulation Phenotypes Module Metabolic Engineering Gene Functions Assignments Experimentation Conjectures about Gene Functions Gene Annotations Annotated Data Sets Visualization Genome Features Annotated Genome Maps Genomes Comparisons Visualization Metabolic Reconstructions (Annotated stoichiometric Matricies) Operons, regulons networks Predictions of Regulation Predictions of New pathways Functions of Hypotheticals Networks Comparisons Conserved Chromosomal Gene Clusters
GADU Framework Solution Implementation GADU – an Automated Pipeline to Support Analysis of the Genomes in GWiz Data Acquisition Module Data Analysis Module Data Storage Module Integrated environment for high-throughput analysis of genomes and reconstruction of genetic networks from the sequence data includes: 1. The supporting database containing: data obtained from various electronic data sources; computational models, results of computational and experimental (in the future) analyses of the genomes and gene networks 2. High-throughput genetic sequence analysis module consisting of: a computational infrastructure, tools and algorithms for high-throughput assignments of function to the genes in sequenced genomes (SVM- and HMM-based, voting algorithms, etc) The results of automated and interactive genetic sequence analysis of ~80 prokaryotic genomes 3. Metabolic and regulatory networks reconstruction module contains: Tools and algorithms for reconstruction, representation, navigation and analysis of metabolic and regulatory networks Reconstructions of metabolic and regulatory networks for at least 50 organisms 4. A library of hypotheses for experimental validation concerning functions of hypothetical proteins and architecture of metabolic and regulatory networks
Knowledge Base (ANL) Subunits DB (ANL) COGs BLAST Hobacgen PhyloBLOCKS (ANL) Sources of Protein Families SVM -Based Classification Characterization of Protein Families ( ANL/ORNL ) Motivation: The current resolution of tools such as BLOCKs and Pfam are unable to Discriminate closely related homologous sequence. Motivation: To develop a library of BLOCKs HMM profiles corresponding to particular enzymatic functions or evolutionary versions of enzymes. Output: 1. A library of SVM models for identification of certain enzymatic functions and 2. computational tools for predictions of protein functions based on these models. Applications: 1. Identification of conserved amino acid residues responsible for the functionality of a protein sequence. 2. Automated class- ification and prediction of protein function. Applications: 1. Identification,classification and characterization of proteins utilizing refined BLOCKs. 2. Phylogenetic analysis using BLOCKs distribution to Identify convergent/divergent evolution. Output: Refined BLOCKs specific for particular enzymatic functions. Tools and Algorithms for Genetic Sequence Analysis
The availability of phylogenetically diverse sequence data and the development of comprehensive bioinformatics methods now allow for the thorough investigations of evolutionary origins of metabolic pathways. A number of evolutionary mechanisms participate in establishing enzymatic functions. These include: divergent evolution or enzyme recruitment, convergent evolution or non-homologous replacement, horizontal transfer of genes from one organism to another and inheritance of a biological pathway from an ancestor. Uncovering and understanding the evolutionary history of metabolic pathways could provide information about past and present metabolic and evolutionary potential of a species. It can also help to guide engineering and the discovery of new metabolic activities. Challenges: 1. Incompleteness of the domains and motifs libraries. Currently available domains and motifs libraries (e.g. InterPro, BLOCKs) – while containing a wealth of information for characterization and identification of proteins, they are still incomplete and do not yet contain information for a large number of enzymatic functions. 2. Low resolution of some sequence profiles. Some of the sequence profiles from the domain libraries can identify large protein families (e.g. aminotransferases), but are unable to discern specific enzymatic functions.
Molecular Machines Gene & Chemical Networks Earth’s Macro cycles Whole Cells Cell-Cell Interactions Communities Hierarchical Simulation The simulation software will represent models at multiple scales and will provide toolkits for building corresponding simulations. There will be a systems interconnect to allow reuse of model components, reuse of simulation components, and workflow spanning bioinformatics (GWiz), simulation, analysis, and visualization.