The STRING database Michael Kuhn EMBL Heidelberg
protein interactions
example Tryptophan synthase beta chain E. Coli K12
many sources
genomic context
curated knowledge
T experimental evidence
literature
Jensen et al., Drug Discovery Today: Targets, 2004
373 genomes (only completely sequenced genomes)
1.5 million genes (not proteins)
Genome Reviews
RefSeq
Ensembl
model organism databases
data integration
genomic context methods
gene fusion
gene neighborhood
phylogenetic profiles
Cell Cellulosomes Cellulose
automatic inference of interactions
correct interactions
wrong associations
gene fusion score: sequence similarity
gene neighborhood score: sum of intergenic distances
phylogenetic profiles
SVD singular value decomposition (removes redundancy)
score: Euclidean distance
all scores are “raw scores”
not comparable sequence similarity sum of intergenic distances Euclidean distance
benchmarking calibrate against “gold standard” (KEGG)
raw scores
probabilistic scores e.g. “70% chance for an assocation”
curated knowledge
KEGG Kyoto Encyclopedia of Genes
Reactome
MIPS Munich Information center for Protein Sequences
STKE Signal Transduction Knowledge Environment
GO Gene Ontology
primary experimental data
many sources
many parsers
physical protein interactions
BIND Biomolecular Interaction Network Database
GRID General Repository for Interaction Datasets
MINT Molecular Interactions Database
DIP Database of Interacting Proteins
HPRD Human Protein Reference Database
large sets are scored separately
co-expression microarray data
GEO Gene Expression Omnibus
correlation coefficient
literature mining
different gene identifiers
synonyms list
Medline
SGD Saccharomyces Genome Database
The Interactive Fly
OMIM Online Mendelian Inheritance in Man
simple scheme
co-mentioning
more advanced
NLP Natural Language Processing
Gene and protein names Cue words for entity recognition Verbs for relation extraction [ nxgene The GAL4 gene] [ nxexpr The expression of [ nxgene the cytochrome genes [ nxpg CYC1 and CYC7]]] is controlled by [ nxpg HAP1]
Gene and protein names Cue words for entity recognition Verbs for relation extraction The expression of the cytochrome genes CYC1 and CYC7 is controlled by HAP1
calibrate against gold standard
combine all evidence
Bayesian scoring scheme
e.g.: two scores of 0.7 combined probability: ?
e.g.: two scores of 0.7 combined probability: (1-0.7) 2 = 0.91
evidence transfer
evidence spread over many species
transfer by orthology (or “fuzzy orthology”)
von Mering et al., Nucleic Acids Research, 2005
two modes
COG mode
von Mering et al., Nucleic Acids Research, 2005
higher coverage lower specificity includes all available evidence some orthologous groups are too large to be meaningful
proteins mode
von Mering et al., Nucleic Acids Research, 2005
maximum specificity lower coverage information will be relevant for selected species
Demo
outlook
take home message STRING integrates information and predicts interactions You can always go to the sources Proteins mode: specific species COG mode: more coverage, especially for prokaryotic genes
Acknowledgements The STRING team Lars Jensen Peer Bork Christian von Mering & group in Zurich Berend Snel Martijn Huynen
Thank you for your attention
take home message STRING integrates information and predicts interactions You can always go to the sources Proteins mode: specific species COG mode: more coverage, especially for prokaryotic genes
Exercises: tinyurl.com/36twzq (or via course wiki) Alternative server: xi.embl.de
Bork et al., Current Opinion in Structural Biology, 2004