Presentation is loading. Please wait.

Presentation is loading. Please wait.

The STRING database Michael Kuhn EMBL Heidelberg.

Similar presentations


Presentation on theme: "The STRING database Michael Kuhn EMBL Heidelberg."— Presentation transcript:

1 The STRING database Michael Kuhn EMBL Heidelberg

2 protein interactions

3

4

5

6

7

8

9 example Tryptophan synthase beta chain E. Coli K12

10

11

12

13 many sources

14 genomic context

15 curated knowledge

16 T experimental evidence

17 literature

18 Jensen et al., Drug Discovery Today: Targets, 2004

19 373 genomes (only completely sequenced genomes)

20 1.5 million genes (not proteins)

21 Genome Reviews

22 RefSeq

23 Ensembl

24 model organism databases

25 data integration

26 genomic context methods

27 gene fusion

28 gene neighborhood

29 phylogenetic profiles

30

31

32

33 Cell Cellulosomes Cellulose

34 automatic inference of interactions

35 correct interactions

36 wrong associations

37 gene fusion score: sequence similarity

38 gene neighborhood score: sum of intergenic distances

39 phylogenetic profiles

40 SVD singular value decomposition (removes redundancy)

41 score: Euclidean distance

42 all scores are “raw scores”

43 not comparable sequence similarity sum of intergenic distances Euclidean distance

44 benchmarking calibrate against “gold standard” (KEGG)

45

46 raw scores

47 probabilistic scores e.g. “70% chance for an assocation”

48 curated knowledge

49 KEGG Kyoto Encyclopedia of Genes

50 Reactome

51 MIPS Munich Information center for Protein Sequences

52 STKE Signal Transduction Knowledge Environment

53 GO Gene Ontology

54 primary experimental data

55 many sources

56 many parsers

57 physical protein interactions

58 BIND Biomolecular Interaction Network Database

59 GRID General Repository for Interaction Datasets

60 MINT Molecular Interactions Database

61 DIP Database of Interacting Proteins

62 HPRD Human Protein Reference Database

63 large sets are scored separately

64 co-expression microarray data

65 GEO Gene Expression Omnibus

66 correlation coefficient

67 literature mining

68 different gene identifiers

69 synonyms list

70 Medline

71 SGD Saccharomyces Genome Database

72 The Interactive Fly

73 OMIM Online Mendelian Inheritance in Man

74 simple scheme

75 co-mentioning

76 more advanced

77 NLP Natural Language Processing

78 Gene and protein names Cue words for entity recognition Verbs for relation extraction [ nxgene The GAL4 gene] [ nxexpr The expression of [ nxgene the cytochrome genes [ nxpg CYC1 and CYC7]]] is controlled by [ nxpg HAP1]

79 Gene and protein names Cue words for entity recognition Verbs for relation extraction The expression of the cytochrome genes CYC1 and CYC7 is controlled by HAP1

80 calibrate against gold standard

81

82 combine all evidence

83 Bayesian scoring scheme

84 e.g.: two scores of 0.7 combined probability: ?

85 e.g.: two scores of 0.7 combined probability: 0.91 1 - (1-0.7) 2 = 0.91

86 evidence transfer

87 evidence spread over many species

88 transfer by orthology (or “fuzzy orthology”)

89 von Mering et al., Nucleic Acids Research, 2005

90

91 two modes

92

93

94 COG mode

95 von Mering et al., Nucleic Acids Research, 2005

96 higher coverage lower specificity includes all available evidence some orthologous groups are too large to be meaningful

97 proteins mode

98 von Mering et al., Nucleic Acids Research, 2005

99 maximum specificity lower coverage information will be relevant for selected species

100 Demo

101

102 outlook

103

104

105 take home message STRING integrates information and predicts interactions You can always go to the sources Proteins mode: specific species COG mode: more coverage, especially for prokaryotic genes

106 Acknowledgements The STRING team Lars Jensen Peer Bork Christian von Mering & group in Zurich Berend Snel Martijn Huynen

107 Thank you for your attention

108 take home message STRING integrates information and predicts interactions You can always go to the sources Proteins mode: specific species COG mode: more coverage, especially for prokaryotic genes

109 Exercises: tinyurl.com/36twzq (or via course wiki) Alternative server: xi.embl.de

110

111 Bork et al., Current Opinion in Structural Biology, 2004


Download ppt "The STRING database Michael Kuhn EMBL Heidelberg."

Similar presentations


Ads by Google