Download presentation
Presentation is loading. Please wait.
Published byCharles Russell Modified over 9 years ago
1
The STRING database Michael Kuhn EMBL Heidelberg
2
protein interactions
9
example Tryptophan synthase beta chain E. Coli K12
13
many sources
14
genomic context
15
curated knowledge
16
T experimental evidence
17
literature
18
Jensen et al., Drug Discovery Today: Targets, 2004
19
373 genomes (only completely sequenced genomes)
20
1.5 million genes (not proteins)
21
Genome Reviews
22
RefSeq
23
Ensembl
24
model organism databases
25
data integration
26
genomic context methods
27
gene fusion
28
gene neighborhood
29
phylogenetic profiles
33
Cell Cellulosomes Cellulose
34
automatic inference of interactions
35
correct interactions
36
wrong associations
37
gene fusion score: sequence similarity
38
gene neighborhood score: sum of intergenic distances
39
phylogenetic profiles
40
SVD singular value decomposition (removes redundancy)
41
score: Euclidean distance
42
all scores are “raw scores”
43
not comparable sequence similarity sum of intergenic distances Euclidean distance
44
benchmarking calibrate against “gold standard” (KEGG)
46
raw scores
47
probabilistic scores e.g. “70% chance for an assocation”
48
curated knowledge
49
KEGG Kyoto Encyclopedia of Genes
50
Reactome
51
MIPS Munich Information center for Protein Sequences
52
STKE Signal Transduction Knowledge Environment
53
GO Gene Ontology
54
primary experimental data
55
many sources
56
many parsers
57
physical protein interactions
58
BIND Biomolecular Interaction Network Database
59
GRID General Repository for Interaction Datasets
60
MINT Molecular Interactions Database
61
DIP Database of Interacting Proteins
62
HPRD Human Protein Reference Database
63
large sets are scored separately
64
co-expression microarray data
65
GEO Gene Expression Omnibus
66
correlation coefficient
67
literature mining
68
different gene identifiers
69
synonyms list
70
Medline
71
SGD Saccharomyces Genome Database
72
The Interactive Fly
73
OMIM Online Mendelian Inheritance in Man
74
simple scheme
75
co-mentioning
76
more advanced
77
NLP Natural Language Processing
78
Gene and protein names Cue words for entity recognition Verbs for relation extraction [ nxgene The GAL4 gene] [ nxexpr The expression of [ nxgene the cytochrome genes [ nxpg CYC1 and CYC7]]] is controlled by [ nxpg HAP1]
79
Gene and protein names Cue words for entity recognition Verbs for relation extraction The expression of the cytochrome genes CYC1 and CYC7 is controlled by HAP1
80
calibrate against gold standard
82
combine all evidence
83
Bayesian scoring scheme
84
e.g.: two scores of 0.7 combined probability: ?
85
e.g.: two scores of 0.7 combined probability: 0.91 1 - (1-0.7) 2 = 0.91
86
evidence transfer
87
evidence spread over many species
88
transfer by orthology (or “fuzzy orthology”)
89
von Mering et al., Nucleic Acids Research, 2005
91
two modes
94
COG mode
95
von Mering et al., Nucleic Acids Research, 2005
96
higher coverage lower specificity includes all available evidence some orthologous groups are too large to be meaningful
97
proteins mode
98
von Mering et al., Nucleic Acids Research, 2005
99
maximum specificity lower coverage information will be relevant for selected species
100
Demo
102
outlook
105
take home message STRING integrates information and predicts interactions You can always go to the sources Proteins mode: specific species COG mode: more coverage, especially for prokaryotic genes
106
Acknowledgements The STRING team Lars Jensen Peer Bork Christian von Mering & group in Zurich Berend Snel Martijn Huynen
107
Thank you for your attention
108
take home message STRING integrates information and predicts interactions You can always go to the sources Proteins mode: specific species COG mode: more coverage, especially for prokaryotic genes
109
Exercises: tinyurl.com/36twzq (or via course wiki) Alternative server: xi.embl.de
111
Bork et al., Current Opinion in Structural Biology, 2004
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.