The STRING Database What it does and how it interfaces to other resources The STRING Database What it does and how it interfaces to other resources Christian von Mering, University of Zurich & SIB bigDATA Workshop
- viewers for all types of evidence - focus on useability and speed - integrated scoring scheme - information transfer between species Genomic Neighborhood Genes/Species Co-occurence Gene Fusions Database Imports Exp. Interaction Data Co-expression Literature co-occurence STRING
630 organisms 2.6 Mio proteins 88 Mio interactions server-footprint: 320 Gb Numbers:
networks Phylogenetic Profiles Conserved Neighborhood Gene-Fusions quantify … integrate … Interaction prediction from genome information “genomic context”
Other Interaction Sources Interaction DatabasesPathway Databases Reactome Automated TextminingInterolog Transfer
final interaction score: protein A – protein B between 0 and 1, pseudoprobability, “likelihood of functional association” 1 – (1 – nscore) * (1 – fscore) * (1 – pscore) * (1 – cscore) * (1 – escore) * (1 – tscore) neighborhoodfusioncooccurencecoexpression experimental textmining nscore = 1 – (1 – nscore query species ) * (1 – nscore transf. ) evidence transfer between species information transfer between species either via orthologs (COG database) or via homology analog for cscore, escore, tscore,... benchmarking raw score KEGG performance (fraction on same map) raw score Example - Neighborhood raw score: each predictor has its own raw-score regime gene Agene B 100 bp6 bp20 bp raw score: sum of intergenic distances The scoring system
The raw score regimes gene Agene B 100 bp6 bp20 bp raw score: sum of intergenic distances Neighborhood Phylogenetic profiles “similarity profiles” singular value decomposition raw score: euklidian distance filter: downweigh scores for homologous pairs raw score: constant (0.99) Fusionexperimental interactions two-hydrid, TAP, annotated complexes, … topology-based analysis: who with whom, how many other partners? raw score: various (usually ‘uniqueness’ of interaction). Co-expression download all microarray datasets for a given species data normalization (spatial correction) raw score: pairwise pearson-correlation coefficient Textmining download all PubMed abstracts identify proteins in the abstracts search for co-mentioned pairs raw score: log-odds score
User-Experience: Aiming to be Visual and Intuitive
1’000 visits / day 800 users / day 9’000 pageviews / day > 10’000 DB-queries / day
Citations 2000 NAR Snel et al NAR von Mering et al NAR von Mering et al NAR von Mering et al NAR Jensen et al. 80 citations 215 citations 183 citations 189 citations 47 citations total: 714 citations
Cross-links SMART: protein domain information GENECARDS: info and products on human genes SWISS-MODEL-REPOSITORY: homology models CYTOSCAPE: access via plug-in architecture SWISSPROT / UNIPROT: expert protein annotation
Cross-link example launch SwissModel
Reciprocal View popup: launch STRING
Example #1 A missing chaperone for Cytochrome C oxidase Question: who inserts the Copper-atom into CcO ?
Initial observation: Example #1 The missing chaperone for Cytochrome C oxidase
Example #1 The missing chaperone for Cytochrome C oxidase gene expressed structure solved it binds copper ! likely function - copper delivery
Example #2 Simplify discovery in genome-wide association screens ? Christian von Mering – UZH MolBio – SIB
a)download data in relational database scheme d)cross-link to server (version controlled, to network, protein, link,...) In-House Use of STRING b)download data as compact flat-files e)PSI-MI export f)[ SOAP / webservices ] c)in-house installation of webserver
Core organisms: include all model organisms (annotated knowledge) non-redundant, each genus is covered include organisms with functional genomics data Irrelevant Organisms [future category] Version 9.0 – exceeding 1000 genomes
More details & new features
“Payload Display” - Your Own STRING Server => “branding” STRING via remote-control: a call-back API => “branding” STRING via remote-control: a call-back API
Acknowledgements The STRING team: Samuel Chaffron Manuel Weiss Michael Kuhn Lars Juhl Jensen Sean Hooper Berend Snel Martijn Huynen Peer Bork The STRING institutions: SIB – Swiss Institute of Bioinformatics University of Zurich TU-Dresden, University of Copenhagen European Molecular Biology Laboratory
“MySTRING” users can register / login using OpenID or similar for authentication persistency of search results (“history”) store lists / items of interest (“bag of genes”) users can customize the interface generate revenue (?)
Feature #2 (Finding Relevant Texts)
Example #2 The missing enzymes for uric acid degradation Question: why can’t humans degrade uric acid ?
Example #2 The missing enzymes for uric acid degradation ? ?
Example #2 The missing enzymes for uric acid degradation initial observation:
Example #2 The missing enzymes for uric acid degradation genes cloned, expressed enzymatic activity demonstrated candidate short-term therapeutics !