Improving Gene Function Prediction Using Gene Neighborhoods Kwangmin Choi Bioinformatics Program School of Informatics Indiana University, Bloomington,

Slides:



Advertisements
Similar presentations
Global Mapping of the Yeast Genetic Interaction Network Tong et. al, Science, Feb 2004 Presented by Bowen Cui.
Advertisements

Genome organization Lesk, Ch 2 (Lesk, 2008). Genomes and proteomes Genome of a typical bacterium comes as a single DNA molecule of about 5 million characters.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Lecture 10 DNA Translation and Control
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
3.1 An overview of genetic possesses 3.2 The basis of hereditary 3.3 DNA replication 3.4 RNA and protein synthesis 3.5 Gene expression.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Protein-protein interactions
Bioinformatics and Phylogenetic Analysis
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Gene Expression.
Prepared with lots of help from friends... Metsada Pasmanik-Chor, Zohar Yakhini and NUMEROUS WEB RESOURCES. BioInformatics / Computational Biology Introduction.
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Sequencing a genome and Basic Sequence Alignment
Gene Structure and Identification
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Genome Informatics 2005 ~ 220 participants 1 keynote speaker: David Haussler 47 talks 121 posters.
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
ComPath Comparative Metabolic Pathway Analyzer Kwangmin Choi and Sun Kim School of Informatics Indiana University.
Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.
Analyzing transcription modules in the pathogenic yeast Candida albicans Elik Chapnik Yoav Amiram Supervisor: Dr. Naama Barkai.
Microbial Genetics: DNA Replication Gene Expression
Genomics in Drug Organon, Oss Tim Hulsen.
Overview. What is Annotation? Annotation is the process of determining the location and function of all identifiable genes in a genome. Annotation is.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Reconstruction of Transcriptional Regulatory Networks
I529: Lab5 02/20/2009 AI : Kwangmin Choi. Today’s topics Gene Ontology prediction/mapping – AmiGo –
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
Sequencing a genome and Basic Sequence Alignment
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Protein and RNA Families
Anis Karimpour-Fard ‡, Ryan T. Gill †,
Genome Analysis II Comparative Genomics Jiangbo Miao Apr. 25, 2002 CISC889-02S: Bioinformatics.
P HYLO P AT : AN UPDATED VERSION OF THE PHYLOGENETIC PATTERN DATABASE CONTAINS GENE NEIGHBORHOOD Presenter: Reihaneh Rabbany Presented in Bioinformatics.
I. Prolinks: a database of protein functional linkage derived from coevolution II. STRING: known and predicted protein-protein associations, integrated.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
Introduction to biological molecular networks
Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Comparative genomics, genome context.
1 Computational functional genomics Lital Haham Sivan Pearl.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
PROTEIN INTERACTION NETWORK – INFERENCE TOOL DIVYA RAO CANDIDATE FOR MASTER OF SCIENCE IN BIOINFORMATICS ADVISOR: Dr. FILIPPO MENCZER CAPSTONE PROJECT.
Detecting Protein Function and Protein-Protein Interactions from Genome Sequences TuyetLinh Nguyen.
Subsystem: General secretory pathway (sec-SRP) complex (TC 3.A.5.1.1) Matthew Cohoon, Department of Computer Science, University of Chicago, Chicago, IL.
Chapter 10-How Protein are Made Section 1-From Genes to Proteins – Traits are determined by proteins, that are built by DNA. – Proteins are NOT built by.
Network Motifs See some examples of motifs and their functionality Discuss a study that showed how a miRNA also can be integrated into motifs Today’s plan.
` Comparison of Gene Ontology Term Annotations Between E.coli K12 Databases REDDYSAILAJA MARPURI WESTERN KENTUCKY UNIVERSITY.
bacteria and eukaryotes
FLiPS Functional Linkage Prediction Service.
The Mimivirus Giant double stranded DNA virus Discovered in amoebas
Protein Interaction Networks
1 Department of Engineering, 2 Department of Mathematics,
Discovery of New Regulatory Motifs of Purine Biosynthetic Genes in Escherichia Coli and Bacillus Subtilis Indiana University School of Informatics Haifeng.
1 Department of Engineering, 2 Department of Mathematics,
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
1 Department of Engineering, 2 Department of Mathematics,
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
How to Use This Presentation
How Proteins are Made Biology I: Chapter 10.
SEG5010 Presentation Zhou Lanjun.
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Improving Gene Function Prediction Using Gene Neighborhoods Kwangmin Choi Bioinformatics Program School of Informatics Indiana University, Bloomington, IN

Introduction : PLATCOM (A Platform for Computational Comparative Genomics) PLATCOM is a system for the comparative analysis of multiple genomes. PLATCOM consists of 3 components: Databases of biological entities e.g. fna, faa, ptt, gbk… Databases of relationships among entities e.g. genome-genome, protein-protein pairwise comparison Mining tools over the databases The web interface of PLATCOM system is located at

PLATCOM Web Interface Frontpage of Genome Plot

Background : What is operon ? The operon structure was found in 1960 by 2 French biologists. Jacob,F. and Monod,J. (1961) Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol., 3, 318–356. An operon is a group of genes that encodes functionally linked proteins. Its components are : Adjacent ( nt) On the same strand (+ or -) Co-expressed by one promoter.

Background : How to identify or predict operon structure? When a promoter and terminator are known : Gene clusters = Transcription Units Classical concept of operon When a promoter is not known : Gene clusters = Directrons Hypothetical operon candidates Depending on direction and proper intergenic distance ( nt) Computational methods have been developed to find gene clusters in bacterial genomes.

PCBBH and PCH R.Overbeek et al. PNAS, 1999, Vol.96, pp PCBBH : Pair of Close Bidirectional Best Hits BBH : Bidirectional Best Hits PCH : Pair of Close Homologs COG : Clusters of Orthologous Genes

Background : Über-operon : P.Bork et al. Treds. Biochem. Sci., Vol. 25, pp Über-operon : A set of genes with a close functional and regulatory contexts that tends to be conserved despite numerous rearrangements. This concept focus on the functional themes of operons, not a specific genes or gene order.

Background : Why gene clusters are conserved ? Certain operons, particularly those that encode subunits of multiprotein complexes (e.g. ribosomal proteins) are conserved in phylogenetically distant bacterial genomes. These gene clusters might have been conserved since the last universal common ancestor. Why? Selfish-operon hypothesis :Horizontal transfer of an entire operon is favored by natural selection over transfer of individual genes because co-expression and co-regulation are preserved.

Background : Problems in Operon Prediction. Over 150 genomes have been fully sequenced until today, but The biological functions of some genes are still unknown. There is only a few promoter detection algorithms, but they are not fully satisfactory. In many cases, genomic data files do not provide full information of genes and their products. ( e.g. gene name, COG, PID.) Operon tends to undergo multiple rearrangements during evolution. As a result, gene order at a lever above is poorly conserved. (e.g. genes involved in de novo purine synthesis)

Background : Problems in Computational Algorithms to Predict Operons Direct Signal Finding Experiment-based approach Transcription promoters (5’-end) and terminators (3’-end) were searched. Only be effective for species whose transcription signals are well known, E.coli. Combination of gene expression data, functional annotation and other experimental data. Literature-based approach Primarily applicable to well studied genomes such as E.coli, because data files are incomplete for other genomes. In many cases, genomic data files do not provide full information of genes and their products. ( e.g. gene name, COG, PID.)

Procedure As a part of PLATCOM project, an integrated whole genome analysis system was built on BIOKDD server. Web interface for all-to-all pairwise comparison DB and tools are also provided. Several tools for multiple genomes analysis were written in Perl and then gene neighborhoods was reconstructed from the clustering data. My gene clustering algorithm was used to compensate the defect of the literature-based approach. Connected gene neighborhoods were analyzed to predict gene function and functional coupling between clusters.

Materials/ Tools Raw Data 22 genomes were chosen for this study. (14 groups) Protein-Protein Pairwise Comparison Data e.g. PTT files from NCBI site e.g. Data Generated by Web Tools Gene Clustering Data (based on sequence homology) e.g. Gene Clusters generated from PTT file (given intergenic distance) e.g. E. coli database for reality check

Genomes

Procedure My Approach to reconstruct Genomic Neighborhoods The idea underlying this study is that Different genomes contain different, overlapping parts of evolutionarily and functionally connected gene neighborhoods By generating a “Tiling Path”, the entire neighborhood can be reconstructed. Genomic context of well-known genome (e.g. E.coli ) is used as a contextual framework. Start with looking at this framework and then search a group of similar gene neighborhoods in the target genomes. “Genomic context” means the pattern of series of COG. If COG is not given, we can predict the function of a unknown gene based on my gene clustering data. We can also identify some “Hitchhikers”. “Hitchhikers” are inserted genes that are originated from different contexts/themes.

Tiling Path V.Koonin et al. Nucleic Acids Research, 2002, Vol.30, No.10, pp

Gene Neighborhoods

Results Case 1 Relationship between Gene Order and Phylogenetic Distance Case 2 One theme : Typical Operon (rbs operon) Reconstruct gene neighborhoods Find missing components from the reconstructed gene clusters. Case 3 Two or more themes : Functional Coupling ? Find genomic hitchhikers Predict gene function of uncharacterized protein Predict functional coupling

Case 1 : Gene Order and Phylogenetic Distance If gene order of two genome is well conserved, the sequence of homologs should appear as a line on the genome comparison diagonal plot. What is the relationship between phylogenetic distance and the conservation of gene order?

Phylogenetic Tree V.Daubin et al. Genome Research, Vol 12, Issue 7,

Genome Comparison Diagonal Plot : Phylogenetically-Distant Species (Z-score = over 500)

Genome Comparison Diagonal Plot : Phylogenetically-Close Species (Z-score > 1000)

Fragmented Gene Clusters

Case 1 : Conclusion Gene order in phylogenetically-distant species are poorly conserved. But this observation does not mean that gene order is conserved very well among the phylogenetically-close species. In case of very close species (e.g. E.coli vs. H.influenza), gene orders are completely scattered. In most cases, only a small number of genes are observed as a short line or cluster and we may consider it as a putative operon. In next step, this possibility will be investigated deeply.

Case 2 : Rbs Operon (Typical Operon) Theme : Ribose transport across membrane COG1869 D-ribose high-affinity transport system; membrane-associated protein COG1129 ATP-binding component of D-ribose high-affinity transport system COG1172 D-ribose high-affinity transport system COG1879 D-ribose periplasmic binding protein COG0524 ribokinase COG1609 regulator for rbs operon

Case 2 : Rbs Operon Z-score = over 750, Intergenic Distance = 300

Case 2 : Conclusion All components are involved in ribose transport across bacterial cell membrane In Rbs operon system, gene order pattern is out of 22 genomes have this operon system. Exceptsome cases, this gene order pattern is conserved very well. So it is possible that there exists a kind of “General Contextual Framework” of gene order.

Case 3 : Functional Coupling of 2 or more themes Theme 1 : Transcription COG0779 Uncharacterized Conserved Protein COG0195 Transcription elongation factor COG2740 Predicted nucleic-acid-binding protein (transcription termination?) Theme 2 : Translation COG1358 Ribosomal protein S17E COG0532 Translation initiation factor 2 (GTPase) COG1550 Uncharacterized Conserved Protein COG0858 Ribosome-binding factor A COG0184 Ribosomal protein S15P/S13E COG0130 tRNA Pseudouridine synthase Hitchhiker ? COG0196 FDA Synthase (Hitchhiker?)

Case 3 : Functional Coupling Z-score = over 750, Intergenic Distance = 300

Case 3 : Conclusion Functional Coupling : In bacteria, transcription, translation and RNA modification/degradation are coupled and the advantages of co-regulation the corresponding genes are obvious. COG0779(Uncharacterized) is almost inseparable from the COG0195(Transcription Elongation Factor), so it is likely to be a functional partner of COG0195. Hitchhiker : The association of the COG0196(FDA synthase) is not as tight as the connections between the genes belonging to the theme. Gene function prediction : The functions of 3 genes in AE genomes can be predicted by reading genomic context.

Conclusion Genome Comparison Diagonal Plot visualizes the sequence comparison of 2 genomes. It is a simple tool, but presents a very strong intuition to understand the genome structure. Conserved gene neighborhoods reconstructed from many genomes by the Tiling Path Method can be used to predict the functions of uncharacterized genes and functional coupling between well-characterized genes in those genomes. Ultimately, We can use this methods to reconstruct metabolic and functional subsystems.

Acknowledgements Haifeng Zhao Genome Pairwise Comparison DB Scott Martin Server Management and Technical Suppor Dr. Sun Kim Graduate Advisor and P.I.