The STRING database Michael Kuhn EMBL Heidelberg.

Slides:



Advertisements
Similar presentations
STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.
Advertisements

Prediction of protein function Lars Juhl Jensen EMBL Heidelberg.
MitoInteractome : Mitochondrial Protein Interactome Database Rohit Reja Korean Bioinformation Center, Daejeon, Korea.
Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Provenance in a Collaborative Bio-database RAASWiki Donald Dunbar & Jon Manning Queen’s Medical Research Institute University of Edinburgh Use Cases for.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
PPI network construction and false positive detection Jin Chen CSE Fall 1.
Two stories 1) reconstruction the evolution of a complex 2) Adding qualitative labels to predicted interactions Paulien Smits & Thijs Ettema Department.
EBI Proteomics Services Team – Standards, Data, and Tools for Proteomics Henning Hermjakob European Bioinformatics Institute SME forum 2009 Vienna.
Other biological databases. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and networks Biological.
Gene Ontology John Pinney
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
STRING Modeling of biological systems through cross-species data integration.
Data Mining Data Fusion Kathy Chiang IATUL June 2010
Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis Jonsson.
Archives and Information Retrieval
Gene Co-expression Network Analysis BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.
COG and GO tutorial.
ONCOMINE: A Bioinformatics Infrastructure for Cancer Genomics
1 Protein-Protein Interaction Networks MSC Seminar in Computational Biology
Biological networks Construction and Analysis. Recap Gene regulatory networks –Transcription Factors: special proteins that function as “keys” to the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)
Mining text and data on chemicals Lars Juhl Jensen.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
The STRING Database What it does and how it interfaces to other resources The STRING Database What it does and how it interfaces to other resources Christian.
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
Modeling Functional Genomics Datasets CVM Lessons 4&5 10 July 2007Bindu Nanduri.
Overview of Bioinformatics A/P Shoba Ranganathan Justin Choo National University of Singapore A Tutorial on Bioinformatics.
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
Protein-protein interactions Chapter 12. Stable complex Transient Interaction Transient Signaling Complex Rap1A – cRaf1 Interface 1310 Å 2 Stable complex:
Ch10. Intermolecular Interactions and Biological Pathways
Metagenomic Analysis Using MEGAN4
Review of Ondex Bernice Rogowitz G2P Visualization and Visual Analytics Team March 18, 2010.
Overview  Introduction  Biological network data  Text mining  Gene Ontology  Expression data basics  Expression, text mining, and GO  Modules and.
Cis-regulation Trans-regulation 5 Objective: pathway reconstruction.
Biological Pathways & Networks
Networks and Interactions Boo Virk v1.0.
Biomedical Databases & Tools Rolando Garcia-Milian Biomedical & Health Information Services Department Health Sciences Center Library.
Lars Juhl Jensen Biomedical text mining. exponential growth.
Abstract Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Gene Expression and Networks. 2 Microarray Analysis Supervised Methods -Analysis of variance -Discriminate analysis -Support Vector Machine (SVM) Unsupervised.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
PPI team Progress Report PPI team, IDB Lab. Sangwon Yoo, Hoyoung Jeong, Taewhi Lee Mar 2006.
I. Prolinks: a database of protein functional linkage derived from coevolution II. STRING: known and predicted protein-protein associations, integrated.
A curated database of biological pathways.
A collaborative tool for sequence annotation. Contact:
Bioinformatics and Computational Biology
1 The Genome Gamble, Knowledge or Carnage? Comparative Genomics Leading the Organon Tim Hulsen, Oss, November 11, 2003.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
GO based data analysis Iowa State Workshop 11 June 2009.
Open access – making the most of biomedical literature mining Lars Juhl Jensen EMBL Heidelberg.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Gene3D, Orthology and Homology-Based Inheritance of Protein-Protein Interactions Corin Yeats
Computational Biology Signaling networks and drug repositioning Lars Juhl Jensen.
` Comparison of Gene Ontology Term Annotations Between E.coli K12 Databases REDDYSAILAJA MARPURI WESTERN KENTUCKY UNIVERSITY.
Networks and Interactions
Protein association networks with STRING
STRING Large-scale data and text mining
Archives and Information Retrieval
STRING Protein networks from data and text mining
Introduction to Bioinformatics
Network biology An introduction to STRING and Cytoscape
Presentation transcript:

The STRING database Michael Kuhn EMBL Heidelberg

protein interactions

example Tryptophan synthase beta chain E. Coli K12

many sources

genomic context

curated knowledge

T experimental evidence

literature

Jensen et al., Drug Discovery Today: Targets, 2004

373 genomes (only completely sequenced genomes)

1.5 million genes (not proteins)

Genome Reviews

RefSeq

Ensembl

model organism databases

data integration

genomic context methods

gene fusion

gene neighborhood

phylogenetic profiles

Cell Cellulosomes Cellulose

automatic inference of interactions

correct interactions

wrong associations

gene fusion score: sequence similarity

gene neighborhood score: sum of intergenic distances

phylogenetic profiles

SVD singular value decomposition (removes redundancy)

score: Euclidean distance

all scores are “raw scores”

not comparable sequence similarity sum of intergenic distances Euclidean distance

benchmarking calibrate against “gold standard” (KEGG)

raw scores

probabilistic scores e.g. “70% chance for an assocation”

curated knowledge

KEGG Kyoto Encyclopedia of Genes

Reactome

MIPS Munich Information center for Protein Sequences

STKE Signal Transduction Knowledge Environment

GO Gene Ontology

primary experimental data

many sources

many parsers

physical protein interactions

BIND Biomolecular Interaction Network Database

GRID General Repository for Interaction Datasets

MINT Molecular Interactions Database

DIP Database of Interacting Proteins

HPRD Human Protein Reference Database

large sets are scored separately

co-expression microarray data

GEO Gene Expression Omnibus

correlation coefficient

literature mining

different gene identifiers

synonyms list

Medline

SGD Saccharomyces Genome Database

The Interactive Fly

OMIM Online Mendelian Inheritance in Man

simple scheme

co-mentioning

more advanced

NLP Natural Language Processing

Gene and protein names Cue words for entity recognition Verbs for relation extraction [ nxgene The GAL4 gene] [ nxexpr The expression of [ nxgene the cytochrome genes [ nxpg CYC1 and CYC7]]] is controlled by [ nxpg HAP1]

Gene and protein names Cue words for entity recognition Verbs for relation extraction The expression of the cytochrome genes CYC1 and CYC7 is controlled by HAP1

calibrate against gold standard

combine all evidence

Bayesian scoring scheme

e.g.: two scores of 0.7 combined probability: ?

e.g.: two scores of 0.7 combined probability: (1-0.7) 2 = 0.91

evidence transfer

evidence spread over many species

transfer by orthology (or “fuzzy orthology”)

von Mering et al., Nucleic Acids Research, 2005

two modes

COG mode

von Mering et al., Nucleic Acids Research, 2005

higher coverage lower specificity includes all available evidence some orthologous groups are too large to be meaningful

proteins mode

von Mering et al., Nucleic Acids Research, 2005

maximum specificity lower coverage information will be relevant for selected species

Demo

outlook

take home message STRING integrates information and predicts interactions You can always go to the sources Proteins mode: specific species COG mode: more coverage, especially for prokaryotic genes

Acknowledgements The STRING team Lars Jensen Peer Bork Christian von Mering & group in Zurich Berend Snel Martijn Huynen

Thank you for your attention

take home message STRING integrates information and predicts interactions You can always go to the sources Proteins mode: specific species COG mode: more coverage, especially for prokaryotic genes

Exercises: tinyurl.com/36twzq (or via course wiki) Alternative server: xi.embl.de

Bork et al., Current Opinion in Structural Biology, 2004