Presentation is loading. Please wait.

Presentation is loading. Please wait.

Predicting interactions between genes based on genome Sequence comparisons The “genomic context” component of STRING Bioinformatics seminar series 5-10-2004.

Similar presentations


Presentation on theme: "Predicting interactions between genes based on genome Sequence comparisons The “genomic context” component of STRING Bioinformatics seminar series 5-10-2004."— Presentation transcript:

1 Predicting interactions between genes based on genome Sequence comparisons The “genomic context” component of STRING Bioinformatics seminar series 5-10-2004 Berend Snel

2 To do Seminar (today); please ask questionsSeminar (today); please ask questions Article: “a gene co-expression network for global discovery of conserved genetic modules”Article: “a gene co-expression network for global discovery of conserved genetic modules” –Make schedule for article discussion (today) –Read article (next couple of days) –5 minute discussion per person of the article (Preferentially Monday 11 October) Seminar (today); please ask questionsSeminar (today); please ask questions Article: “a gene co-expression network for global discovery of conserved genetic modules”Article: “a gene co-expression network for global discovery of conserved genetic modules” –Make schedule for article discussion (today) –Read article (next couple of days) –5 minute discussion per person of the article (Preferentially Monday 11 October)

3 http://string.embl.de

4 ContentsContents Predicting functional interactions between proteinsPredicting functional interactions between proteins Genomic context methodsGenomic context methods –General –Gene fusion –Gene order –Presence / absence of genes across genomes Integration and benchmarking of predictionsIntegration and benchmarking of predictions Interaction networksInteraction networks In addition to genomic context: functional genomics dataIn addition to genomic context: functional genomics data Predicting functional interactions between proteinsPredicting functional interactions between proteins Genomic context methodsGenomic context methods –General –Gene fusion –Gene order –Presence / absence of genes across genomes Integration and benchmarking of predictionsIntegration and benchmarking of predictions Interaction networksInteraction networks In addition to genomic context: functional genomics dataIn addition to genomic context: functional genomics data

5 Complete genomes, now what? Post-genomic era = we have the parts list (complete genomes)Post-genomic era = we have the parts list (complete genomes) to understand the cell we need to know the functions of the genesto understand the cell we need to know the functions of the genes Post-genomic era = we have the parts list (complete genomes)Post-genomic era = we have the parts list (complete genomes) to understand the cell we need to know the functions of the genesto understand the cell we need to know the functions of the genes

6 For most genes in any genome we need function prediction - - E. Coli, the most intensively studied organism: only 1924 genes (~43%) have been (partially) experimentally characterized. - - E. Coli, the most intensively studied organism: only 1924 genes (~43%) have been (partially) experimentally characterized.

7 What is function ? Various levels of description: Sequence similarity/homology has the largest relevance for “Molecular Function”. This aspect of protein function is best conserved. Molecular function can often be predicted from similarities between protein sequences (BLAST), or structures. What is function ? Various levels of description: Sequence similarity/homology has the largest relevance for “Molecular Function”. This aspect of protein function is best conserved. Molecular function can often be predicted from similarities between protein sequences (BLAST), or structures. Predicting protein function

8 BLASTBLAST

9 “Beyond” homology and molecular function Homolgy based function prediction works very well, but … … a large fraction of genes are poorly described (no homologs, uncharacterized homologs; this holds for ~60% of the human genes)… a large fraction of genes are poorly described (no homologs, uncharacterized homologs; this holds for ~60% of the human genes) … There are other aspects of function: functional associations, e.g. the target of a protein kinase or a transcriptional regulator… There are other aspects of function: functional associations, e.g. the target of a protein kinase or a transcriptional regulator Thus: predicting these associations Homolgy based function prediction works very well, but … … a large fraction of genes are poorly described (no homologs, uncharacterized homologs; this holds for ~60% of the human genes)… a large fraction of genes are poorly described (no homologs, uncharacterized homologs; this holds for ~60% of the human genes) … There are other aspects of function: functional associations, e.g. the target of a protein kinase or a transcriptional regulator… There are other aspects of function: functional associations, e.g. the target of a protein kinase or a transcriptional regulator Thus: predicting these associations

10 Genome sequences: Allowing us to interpret the function of proteins within the context in which they occur: Use the genome sequences (through comparative genome analysis) for interaction prediction: genomic context methodsReverse this process: predict the function of a protein from the context in which it tends to occur  prediction of protein function/pathways from genome sequences: Use the genome sequences (through comparative genome analysis) for interaction prediction: genomic context methods Genomic context methods have been shown to be reliable indicators for functional associationsGenomic context methods have been shown to be reliable indicators for functional associations Genome sequences: Allowing us to interpret the function of proteins within the context in which they occur: Use the genome sequences (through comparative genome analysis) for interaction prediction: genomic context methodsReverse this process: predict the function of a protein from the context in which it tends to occur  prediction of protein function/pathways from genome sequences: Use the genome sequences (through comparative genome analysis) for interaction prediction: genomic context methods Genomic context methods have been shown to be reliable indicators for functional associationsGenomic context methods have been shown to be reliable indicators for functional associations

11 Transcription regulation Transcription regulation P P Signalling pathways Protein complexes Metabolic pathways There are many types of functional associations (AKA functional interactions, interactions, functional links, functional relations) in molecular biology Cellular process

12 Types of functional associations metabolic pathways: filling gaps

13 Types of functional associations Transcription regulation P P Signalling pathways

14 Types of functional associations Cellular process Protein complexes

15 ContentsContents Predicting functional interactions between proteinsPredicting functional interactions between proteins Genomic context methodsGenomic context methods –General –Gene fusion –Gene order –Presence / absence of genes across genomes Integration and benchmarking of predictionsIntegration and benchmarking of predictions Interaction networksInteraction networks In addition to genomic context: functional genomics dataIn addition to genomic context: functional genomics data Predicting functional interactions between proteinsPredicting functional interactions between proteins Genomic context methodsGenomic context methods –General –Gene fusion –Gene order –Presence / absence of genes across genomes Integration and benchmarking of predictionsIntegration and benchmarking of predictions Interaction networksInteraction networks In addition to genomic context: functional genomics dataIn addition to genomic context: functional genomics data

16 Use the genome sequences (through comparative genome analysis) for interaction prediction: genomic context methods Use the genome sequences (through comparative genome analysis) for interaction prediction: genomic context methods Genomic context methods have been shown to be reliable indicators for functional interaction Genomic context methods have been shown to be reliable indicators for functional interaction Genomic context is also known as in silico interaction prediction, or genomic associations Genomic context is also known as in silico interaction prediction, or genomic associations Use the genome sequences (through comparative genome analysis) for interaction prediction: genomic context methods Use the genome sequences (through comparative genome analysis) for interaction prediction: genomic context methods Genomic context methods have been shown to be reliable indicators for functional interaction Genomic context methods have been shown to be reliable indicators for functional interaction Genomic context is also known as in silico interaction prediction, or genomic associations Genomic context is also known as in silico interaction prediction, or genomic associations Genomic context is an tool to predict functional associations between genes

17 trpAtrpB Genomic context methods detect evolutionary traces in genomes of functionally associated proteins

18

19 Three different genomic context methods in STRING Gene fusion, Rosetta stone methodGene fusion, Rosetta stone method Conserved gene order between divergent genomesConserved gene order between divergent genomes Co-occurrence of genes across genomes, phylogenetic profilesCo-occurrence of genes across genomes, phylogenetic profiles Gene fusion, Rosetta stone methodGene fusion, Rosetta stone method Conserved gene order between divergent genomesConserved gene order between divergent genomes Co-occurrence of genes across genomes, phylogenetic profilesCo-occurrence of genes across genomes, phylogenetic profiles

20 All genomic context methods use orthologs: corresponding genes between genomes Orthologs not just homologs; related by speciationOrthologs not just homologs; related by speciation Orthologs are very likely to have the same functionOrthologs are very likely to have the same function orthologs : genomes = alignment : sequenceorthologs : genomes = alignment : sequence Orthologs not just homologs; related by speciationOrthologs not just homologs; related by speciation Orthologs are very likely to have the same functionOrthologs are very likely to have the same function orthologs : genomes = alignment : sequenceorthologs : genomes = alignment : sequence Gene Duplication Speciation

21 ContentsContents Predicting functional interactions between proteinsPredicting functional interactions between proteins Genomic context methodsGenomic context methods –General –Gene fusion –Gene order –Presence / absence of genes across genomes Integration and benchmarking of predictionsIntegration and benchmarking of predictions Interaction networksInteraction networks In addition to genomic context: functional genomics dataIn addition to genomic context: functional genomics data Predicting functional interactions between proteinsPredicting functional interactions between proteins Genomic context methodsGenomic context methods –General –Gene fusion –Gene order –Presence / absence of genes across genomes Integration and benchmarking of predictionsIntegration and benchmarking of predictions Interaction networksInteraction networks In addition to genomic context: functional genomics dataIn addition to genomic context: functional genomics data

22 Gene fusion i.e. the orthologs of two genes in another organism are fused into one polypeptidei.e. the orthologs of two genes in another organism are fused into one polypeptide A very reliable indicator for functional interaction; partly because it is an relatively infrequent evolutionary event: 3470 distinct fusions when surveying 179 genomesA very reliable indicator for functional interaction; partly because it is an relatively infrequent evolutionary event: 3470 distinct fusions when surveying 179 genomes i.e. the orthologs of two genes in another organism are fused into one polypeptidei.e. the orthologs of two genes in another organism are fused into one polypeptide A very reliable indicator for functional interaction; partly because it is an relatively infrequent evolutionary event: 3470 distinct fusions when surveying 179 genomesA very reliable indicator for functional interaction; partly because it is an relatively infrequent evolutionary event: 3470 distinct fusions when surveying 179 genomes FusionFusion

23 Gene fusion: an example

24 ContentsContents Predicting functional interactions between proteinsPredicting functional interactions between proteins Genomic context methodsGenomic context methods –General –Fusion –Gene order –Presence / absence of genes across genomes Integration and benchmarking of predictionsIntegration and benchmarking of predictions Interaction networksInteraction networks In addition to genomic context: functional genomics dataIn addition to genomic context: functional genomics data Predicting functional interactions between proteinsPredicting functional interactions between proteins Genomic context methodsGenomic context methods –General –Fusion –Gene order –Presence / absence of genes across genomes Integration and benchmarking of predictionsIntegration and benchmarking of predictions Interaction networksInteraction networks In addition to genomic context: functional genomics dataIn addition to genomic context: functional genomics data

25 Gene order evolves rapidly But …

26 Differential retention of divergent / convergent gene pairs suggests that conservation implies a functional association

27 Comparison to pathways conservation implies a functional association

28 Conserved gene order i.e. genes that are present over ‘sufficiently large’ evolutionary distances in the same gene clusteri.e. genes that are present over ‘sufficiently large’ evolutionary distances in the same gene cluster Contributes by far the most predictionsContributes by far the most predictions i.e. genes that are present over ‘sufficiently large’ evolutionary distances in the same gene clusteri.e. genes that are present over ‘sufficiently large’ evolutionary distances in the same gene cluster Contributes by far the most predictionsContributes by far the most predictions

29 Conserved gene order NB1 predicting operons is not trivial; in fact conserved gene order or functional association is a major clue NB2 using ‘only’ operons without requiring conservation results in much less reliable function prediction

30 Conserved gene order: an example from Conserved gene order: an example from metabolism of propionyl-CoA “query” “target”

31 Biochemical assays confirm the function of members of COG0346 as a DL- methylmalonyl-CoA racemase

32 ContentsContents Predicting functional interactions between proteinsPredicting functional interactions between proteins Genomic context methodsGenomic context methods –General –Gene fusion –Gene order –Presence / absence of genes across genomes Integration and benchmarking of predictionsIntegration and benchmarking of predictions Interaction networksInteraction networks In addition to genomic context: functional genomics dataIn addition to genomic context: functional genomics data Predicting functional interactions between proteinsPredicting functional interactions between proteins Genomic context methodsGenomic context methods –General –Gene fusion –Gene order –Presence / absence of genes across genomes Integration and benchmarking of predictionsIntegration and benchmarking of predictions Interaction networksInteraction networks In addition to genomic context: functional genomics dataIn addition to genomic context: functional genomics data

33 Presence / absence of genes Gene content  co-evolution. (The easy case, few genomes. ) Genomes share genes for phenotypes they have in common Differences between gene Content reflect differences in Phenotypic potentialities Differences between gene Content reflect differences in Phenotypic potentialities

34 Presence / absence of genes L. innocua (non-pathogen) L. monocytogenes (pathogen)

35 Presence / absence of genes L. innocua (non-pathogenic) L. monocytogenes (pathogenic) Genes involved in pathogenecity

36 Generalization: phylogenetic profiles / co-occurence Gene 1: Gene 2: Gene 3:.... Gene 1: Gene 2: Gene 3:.... species 1 species 2 species 3 species 4 species 5........... species 1 species 2 species 3 species 4 species 5........... Gene 1: 1 0 1 1 0 1 Gene 2: 1 1 0 0 1 0 Gene 3: 0 1 0 0 1 0.... Gene 1: 1 0 1 1 0 1 Gene 2: 1 1 0 0 1 0 Gene 3: 0 1 0 0 1 0.... species 1 species 2 species 3 species 4 species 5........... species 1 species 2 species 3 species 4 species 5...........

37 … but phylogenetic signal in gene content! Escherichia coli Haemophilus influenzae \s sp1 sp2 sp3 sp4 … sp1 \1 0.2 0.4 0.2 … sp2 \1 0.9 0.1 … sp3 \1 0.3 … sp4 \1 … … … … … … \s sp1 sp2 sp3 sp4 … sp1 \1 0.2 0.4 0.2 … sp2 \1 0.9 0.1 … sp3 \1 0.3 … sp4 \1 … … … … … …

38 Co-occurrence of genes across genomes i.e. two genes have the same presence/ absence pattern over multiple genomes: they have ‘co- evolved’ i.e. two genes have the same presence/ absence pattern over multiple genomes: they have ‘co- evolved’ AKA phylogenetic profilesAKA phylogenetic profiles

39 Predicting function of a disease gene protein with unknown function, frataxin, using co-occurrence of genes across genomes Friedreich’s ataxiaFriedreich’s ataxia No (homolog with) known functionNo (homolog with) known function Friedreich’s ataxiaFriedreich’s ataxia No (homolog with) known functionNo (homolog with) known function

40 A. a e o l i c u s S y n e c h o c y s t i s B. s u b t i l i s M. g e n i t a l i u m M. t u b e r c u l o s i s D. r a d i o d u r a n s R. p r o w a z e k i i C. c r e s c e n t u s M. l o t i N. m e n i n g i t i d i s X. f a s t i d i o s a P. a e r u g i n o s a B u c h n e r a V. c h o l e r a e H. i n f l u e n z a e P. m u l t o c i d a E. coli A. p e r n i x M. j a n n a s c h i i A. t h a l i a n a S. c e r e v i s i a e s C. j e j u n i C. a l b i c a n s S. p o m b e H. s a p i e n s C. e l e g a n H. pylori D.melan. cyaY Yfh1 hscB Jac1 hscA ssq1 Nfu1 iscA Isa1-2 fdx Yah1 Arh1 RnaM IscR Hyp iscS Nfs1 iscU Isu1-2 Atm1 Atm1 Frataxin has co-evolved with hscA and hscB indicating that it plays a role in iron-sulfur cluster assembly

41 Iron-Sulfur (2Fe-2S) cluster in the Rieske protein

42 Prediction: Confirmation:

43 The opposite of co-occurrence: anti-correlation / complementary patterns: predicting analogous enzymes ABAB Genes with complementary phylogenetic profiles tend to have a similar biochemical function.

44 Complementary patterns in thiamin biosynthesis predict analogous enzymes

45 Prediction of analogous enzymes is confirmed

46 ContentsContents Predicting functional interactions between proteinsPredicting functional interactions between proteins Genomic context methodsGenomic context methods –General –Gene fusion –Gene order –Presence / absence of genes across genomes Integration and benchmarking of predictionsIntegration and benchmarking of predictions Interaction networksInteraction networks In addition to genomic context: functional genomics dataIn addition to genomic context: functional genomics data Predicting functional interactions between proteinsPredicting functional interactions between proteins Genomic context methodsGenomic context methods –General –Gene fusion –Gene order –Presence / absence of genes across genomes Integration and benchmarking of predictionsIntegration and benchmarking of predictions Interaction networksInteraction networks In addition to genomic context: functional genomics dataIn addition to genomic context: functional genomics data

47 Benchmark and integration: KEGG maps

48 00.20.40.60.81 Score 0 0.2 0.4 0.6 0.8 1 Fusion Gene Order Co-occurrence Fraction same KEGG map Integrating genomic context scores into one single score Compare each individual method against an independent benchmark (KEGG), and find “equivalency” Compare each individual method against an independent benchmark (KEGG), and find “equivalency” Multiply the chances that two proteins are not interacting and subtract from 1; naive bayesian i.e. assuming independence Multiply the chances that two proteins are not interacting and subtract from 1; naive bayesian i.e. assuming independence

49 BenchmarkBenchmark 0.50.60.70.80.91.0 Accuracy (fraction of confirmed predictions, i.e. same KEGG map) 10 100 1000 10000 100000 Fusion (norm.) Fusion (abs.) Gene Order (norm.) Gene Order (abs.) Cooccurrence Integrated Coverage (number of predicted links between orthologous groups)

50 Accuracy Coverage purified complexes TAP yeast two-hybrid two methods three methods Purified Complexes HMS-PCI combined evidence mRNA co-expression genomic context synthetic lethality fraction of reference set covered by data fraction of data confirmed by reference set filtered data raw data parameter choices Performance of genomic context compared to high-throughput interaction data

51 Genomic context: biochemistry by other means Despite the high performance of genomic context methods, as a tool for function prediction it is not a button press method It is more like biochemistry by other means. Often quite a lot of manual input and expert knowledge from the researcher is needed to distill associations into a concrete function prediction Small-scale bioinformatics?

52 ContentsContents Predicting functional interactions between proteinsPredicting functional interactions between proteins Genomic context methodsGenomic context methods –General –Fusion –Gene order –Co-occurrence across genomes Integration and benchmarking of predictionsIntegration and benchmarking of predictions Interaction networksInteraction networks In addition to genomic context: functional genomics dataIn addition to genomic context: functional genomics data Predicting functional interactions between proteinsPredicting functional interactions between proteins Genomic context methodsGenomic context methods –General –Fusion –Gene order –Co-occurrence across genomes Integration and benchmarking of predictionsIntegration and benchmarking of predictions Interaction networksInteraction networks In addition to genomic context: functional genomics dataIn addition to genomic context: functional genomics data

53 STRING allows a network view e.g. see not only to which genes the query gene has an association, but also what the relations are among these other genes

54 STRING Network output (depth=1) Archeal flagellins Archeal flagellin biosynth. ATPase uncharacterized archeal proteins Assigning to a network around

55 STRING Network(depth=2) Archeal flagellins Chemotaxis-related Type IV secretion pathway Archeal flagella components Connectingassociatedcellularprocesses

56 STRING Network(depth=3) Zooming out to other cellular processes

57 Using the local network to detect multi-functional proteins

58 ContentsContents Predicting functional interactions between proteinsPredicting functional interactions between proteins Genomic context methodsGenomic context methods –General –Fusion –Gene order –Co-occurrence across genomes Integration and benchmarking of predictionsIntegration and benchmarking of predictions Interaction networksInteraction networks In addition to genomic context: functional genomics dataIn addition to genomic context: functional genomics data Predicting functional interactions between proteinsPredicting functional interactions between proteins Genomic context methodsGenomic context methods –General –Fusion –Gene order –Co-occurrence across genomes Integration and benchmarking of predictionsIntegration and benchmarking of predictions Interaction networksInteraction networks In addition to genomic context: functional genomics dataIn addition to genomic context: functional genomics data

59 STRING currently in addition includes: Functional association data from large scale / high- throughput biochemical experiments (functional genomics data) Functional association data from large scale / high- throughput biochemical experiments (functional genomics data) protein complex purification protein complex purification yeast-2-hybrid yeast-2-hybrid ChIP-on-chip ChIP-on-chip micro-array gene expression micro-array gene expression “known” functional relations, so called “legacy data”, as present in PubMed abstracts and databases like MIPS or KEGG. “known” functional relations, so called “legacy data”, as present in PubMed abstracts and databases like MIPS or KEGG. STRING currently in addition includes: Functional association data from large scale / high- throughput biochemical experiments (functional genomics data) Functional association data from large scale / high- throughput biochemical experiments (functional genomics data) protein complex purification protein complex purification yeast-2-hybrid yeast-2-hybrid ChIP-on-chip ChIP-on-chip micro-array gene expression micro-array gene expression “known” functional relations, so called “legacy data”, as present in PubMed abstracts and databases like MIPS or KEGG. “known” functional relations, so called “legacy data”, as present in PubMed abstracts and databases like MIPS or KEGG.

60


Download ppt "Predicting interactions between genes based on genome Sequence comparisons The “genomic context” component of STRING Bioinformatics seminar series 5-10-2004."

Similar presentations


Ads by Google