I. Prolinks: a database of protein functional linkage derived from coevolution II. STRING: known and predicted protein-protein associations, integrated.

Slides:



Advertisements
Similar presentations
STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.
Advertisements

Journal Club Jenny Gu October 24, Introduction Defining the subset of Superfamilies in LUCA Examine adaptability and expansion of particular superfamilies.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Protein Targeting by Functional Linkage of Non-Homologous Proteins with examples from M. tuberculosis Genome-wide functional linkage map Structural Genomics.
Genome organization Lesk, Ch 2 (Lesk, 2008). Genomes and proteomes Genome of a typical bacterium comes as a single DNA molecule of about 5 million characters.
The STRING database Michael Kuhn EMBL Heidelberg.
Ontology annotation: mapping genomic regions biological function Paul D Thomas, Huaiyu Mi and Suzanna Lewis.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Research Methodology of Biotechnology: Protein-Protein Interactions Yao-Te Huang Aug 16, 2011.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Predicting interactions between genes based on genome Sequence comparisons The “genomic context” component of STRING Bioinformatics seminar series
A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae Article by Peter Uetz, et.al. Presented by Kerstin Obando.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Protein-protein interactions
COG and GO tutorial.
Protein domains vs. structure domains - an example.
Protein-protein interactions Ia. A combined algorithm for genome-wide prediction of protein function. Edward M. Marcotte, Matteo Pellegrini, Michael J.
Comparative ab initio prediction of gene structures using pair HMMs
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D. (1999). Detecting protein function and protein-protein interactions from genome sequences.
Affinity chromatography/mass spec Bait protein GST Page 252.
CISC667, F05, Lec27, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Review Session.
Protein Interactions and Disease Audry Kang 7/15/2013.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Michael Cummings David Reisman University of South Carolina Genomes and Genomics Chapter 15.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Ch10. Intermolecular Interactions and Biological Pathways
Regulatory factors 1) Gene copy number 2) Transcriptional control 2-1) Promoters 2-2) Terminators, attenuators and anti-terminators 2-3) Induction and.
Bioinformatics and it’s methods Prepared by: Petro Rogutskyi
Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.
Protein analysis and proteomics (Part 2 of 2). Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by Jonathan.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Functional Associations of Protein in Entire Genomes Sequences Bioinformatics Center of Shanghai Institutes for Biological Sciences Bingding.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Social behavior of proteins? Rui Alves. Organization of the talk Social behavior of the protein?!?!?!? Using meta text analysis Using phylogenetic profiling.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
Organizing information in the post-genomic era The rise of bioinformatics.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Inferring Functional Information from Domain co-evolution Yohan Kim, Mehmet Koyuturk, Umut Topkara, Ananth Grama and Shankar Subramaniam Gaurav Chadha.
Protein and RNA Families
Anis Karimpour-Fard ‡, Ryan T. Gill †,
PPI team Progress Report PPI team, IDB Lab. Sangwon Yoo, Hoyoung Jeong, Taewhi Lee Mar 2006.
Central dogma: the story of life RNA DNA Protein.
EB3233 Bioinformatics Introduction to Bioinformatics.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
Introduction to biological molecular networks
Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Comparative genomics, genome context.
1 Computational functional genomics Lital Haham Sivan Pearl.
Detecting Protein Function and Protein-Protein Interactions from Genome Sequences TuyetLinh Nguyen.
Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment Raja Jothi, Teresa.
Bioinformatics What is a genome? How are databases used? What is a phylogentic tree?
bacteria and eukaryotes
Biotechnology.
Bioinformatics Overview
FLiPS Functional Linkage Prediction Service.
14-3 Human Molecular Genetics
Protein Interaction Networks
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

I. Prolinks: a database of protein functional linkage derived from coevolution II. STRING: known and predicted protein-protein associations, integrated and transferred across organisms Hoyoung Jeong

2 Table Of Contents  Introduction  Genomic Inference Method  Phylogenetic profile method  Gene cluster method  Gene neighbor method  Rosetta Stone method  TextLinks  Comparative benchmarking database  Prolinks  STRING  System  Proteome Navigator  STRING  Conclusion

3 Introduction(1/2)  Genome sequencing has allowed scientists to identify most of the genes encoded in each organism  The function of many, typically 50%, of translated proteins can be inferred from sequence comparison with previously characterized sequences  The assignment of function by homology gives only a partial understanding of a protein’s role within a cell  A more complete understanding of a protein function requires the identification of interacting partners

4 Introduction(2/2)  Functional linkage  Need the use of non-homology-based methods  Two proteins are the components of a molecular complex and metabolic pathway  Genomic inference method  Phylogenetic profile method  Gene neighbors method  Rosetta stone method  Gene cluster method  These methods infer functional linkage between proteins by identifying pairs of nonhomologous proteins that co-evolve

5 Phylogenetic profile method(1/3)  Use the co-occurrence or absence of pairs of nonhomologous genes across genomes to infer functional relatedness  We can define a homolog of a query protein to be present in a secondary genome, using BLAST  N genomes yield an N-dimensional vector of ones and zeroes for the query protein - phylogenetic profile

6 Phylogenetic profile method(2/3)

7 Phylogenetic profile method(3/3)  Using this approach, we can compute the phylogenetic profiles for each protein coded within a genome of interest  Need to determine the probability that two proteins have co-evolved  We should compute the probability that two proteins have co-evolved by chance P(k’|n,m,N) = n N - n k m - k NmNm N represents the total # of genomes analyzed n, the # of homologs for protein A m, the # of homologs for protein B k’, the # of genomes that contain homologs of both A and B Because P represents the probability that the proteins do not co-evolve, 1-P(k > k’) is then the probability that they co-evolve Hypergeometric ditribution

8 Gene cluster method(1/2)  Within bacteria, protein of closely related function are often transcribed from a single functional unit known as an operon  Operons contain two or more closely spaced genes located on the same DNA strand  Our approach to the identification of operons that gene start position can be modeled by a Poisson distribution  Unlike the other co-evolution methods, that is able to identify potential functions for proteins exhibiting no homology to proteins in other genomes

9 Gene cluster method(2/2)  P(start) = me -m  P(N_positions_without_starts) = me -Nm  Where, m is the total # of genes divided by the # of intergenic nucleotides  The probability that two genes that are adjacent and coded on the same strand are part of an operon is 1-P P(separation < N) = ∫ me -mN = 1-e -mx x 0

10 Gene neighbor method(1/2)  Some of the operons contained within a particular organism may be conserved across other organism  That may provides additional evidence that the genes within the operon are functionally coupled  And may be components of a molecular complex and metabolic pathway

11 Gene neighbor method(2/2)  Our approach, first computes the probability that two genes are separated by fewer than d genes:  The likelihood of two genes is P(≤d) = 2d2d N - 1 P m (≤X) = 1 – P m (>X) ≈ X∑ m i = 1 m-1 k = 0 (-lnX) k k!k! where X = ∏ Pi(≤di), m is the # of organism that contain homologs of the two genes Where, N is the total # of genes in the genome

12 Rosetta Stone method(1/2)  Occasionally, two proteins expressed separately in one organism can be found as a single chain in the same or second genome  It may the clue to infer functional relatedness of gene fusion/division  Proteins may carry out consecutive metabolic steps or are components of molecular complex  To detect gene-fusion events, we first align all protein-coding sequences from a genome against the database using BLAST

13 Rosetta Stone method(2/2)  We identify cases where two nonhomologous proteins both align over at least 70% of their sequence to different portions of a third protein  To screen out these confounding fusion, we compute the probability that two proteins are found by chance P(k’|n,m,N) = n N - n k m - k NmNm Where k’ is the # of Rosetta Stone sequences Therefore, the probability that two proteins have fused is given by 1 – P(k > k’)

14 TextLinks(1/2)  Different from the methods above, is not a gene context analysis method  The co-occurrence of gene names and symbols within the scientific literature be used  For this analysis, we have used the PubMed database, containing 14 million abstract and citations  As with the phylogenetic profile method, abstracts and individual gene names were used to develop a binary vector  The result is an N-dimensional vector of ones and zeroes  Where, N is the total # of abstract  Marked as one when a protein name is found within a given abstract or citation  Marked as zero when a protein name is not found within a given abstract or citation

15 TextLinks(2/2)  To protect a co-occurrence by chance, use a phylogenetic profile method P(k’|n,m,N) = n N - n k m - k NmNm 1 – P(k>k’)

16 Comparative benchmarking database(1/3)  Database has  Prolinks(2004)  83 genomes, 18,077,293 links between proteins  STRING(2005)  730,000 proteins  Genomic inference method  Prolinks  Phylogenetic profile, Gene neighbors, Rosetta stone, Gene cluster method  TextLinks  STRING  Phylogenetic profile, Gene neighbors, Rosetta stone method  TextLinks, Experiments, Database, Textmining

17 Comparative benchmarking database(2/3) ProlinksSTRING  Confidential metric  Prolinks - COG(Clusters of Orthologous Groups) pathway  STRING - KEGG(Kyoto Encyclopedia Genes and Genomes) pathway

18 Comparative benchmarking database(3/3)  We have downloaded all the functional links for E. coli each database, we obtained(experimented on by Prolinks, 2004)  # of Links  Prolinks - 515,892 links  STRING - 407,520 links  Confidence  Prolinks - 20% of the links between proteins assigned to a COG pathway  STRING - 17% of the annotated links were between protein in the same pathway

19 Proteome Navigator

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36 Conclusion  Over the past few years significant progress has been made to protein interaction  In spite of affluent data, biologists are still limited in their coverage of organism  The majority of protein interactions have been measured within a single organism  The computational methodology may help them