Presentation is loading. Please wait.

Presentation is loading. Please wait.

Testing sequence comparison methods with structure Organon, Oss 2006-02-07 Tim Hulsen.

Similar presentations


Presentation on theme: "Testing sequence comparison methods with structure Organon, Oss 2006-02-07 Tim Hulsen."— Presentation transcript:

1 Testing sequence comparison methods with structure similarity @ Organon, Oss 2006-02-07 Tim Hulsen

2 Introduction Main goal: transfer function of proteins in model organisms to proteins in humans Make use of “orthology”: proteins evolved from common ancestor in different species (very similar function!) Several ortholog identification methods, relying on: –Sequence comparisons –(Phylogenies)

3 Introduction Quality of ortholog identification depends on: 1.) Quality of sequence comparison algorithm: - Smith-Waterman vs. BLAST, FASTA, etc. - Z-value vs. E-value 2.) Quality of ortholog identification itself (phylogenies, clustering, etc.) 2 -> previous research 1 -> this presentation

4 Previous research Comparison of several ortholog identification methods Orthologs should have similar function Functional data of orthologs should behave similar: –Gene expression data –Protein interaction data –Interpro IDs –Gene order

5 Orthology method comparison Compared methods: –BBH, Best Bidirectional Hit –INP, InParanoid –KOG, euKaryotic Orthologous Groups –MCL, OrthoMCL –PGT, PhyloGenetic Tree –Z1H, Z-value > 1 Hundred

6 Orthology method comparison e.g. correlation in expression profiles Affymetrix human and mouse expr. data, using SNOMED tissue classification Check if the expression profile of a protein is similar to the expression profile of its ortholog Hs Mm

7 Orthology method comparison

8 e.g. conservation of protein interaction DIP (Database of Interacting Proteins) Check if the orthologs of two interacting proteins are still interacting in the other species -> calculate fraction Hs Mm

9 Orthology method comparison

10

11 Trade-off between sensitivity and selectivity BBH and INP are most sensitive but also most selective Results can differ depending on what sequence comparison algorithm is used: - BLAST, FASTA, Smith-Waterman? - E-value or Z-value?

12 E-value or Z-value? Smith-Waterman with Z-value statistics: 100 randomized shuffles to test significance of SW score O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2. YMSHQFTVGE etc. # seqs SW score rnd ori: 5*SD  Z = 5

13 E-value or Z-value? Z-value calculation takes much time (2x100 randomizations) Comet et al. (1999) and Bastien et al. (2004): Z-value is theoretically more sensitive and more selective than E- value? Advantage of Z-value has never been proven by experimental results

14 How to compare? Structural comparison is better than sequence comparison ASTRAL SCOP: Structural Classification Of Proteins e.g. a.2.1.3, c.1.2.4; same number ~ same structure Use structural classification as benchmark for sequence comparison methods

15 ASTRAL SCOP statistics max. % identitymembersfamiliesavg. fam. sizemax. fam. sizefamilies =1families >1 10%363122501.614251655595 20%396822971.727291605692 25%435723131.884321530783 30%482123202.078391435885 35%530123222.283461333989 40%567423222.4444712691053 50%644223242.7725011781146 70%755123253.24812710871238 90%875923263.76640510231303 95%949823264.0834799771349

16 Methods (1) Smith-Waterman algorithms: dynamic programming; computationally intensive –Paracel with e-value (PA E): SW implementation of Paracel –Biofacet with z-value (BF Z): SW implementation of Gene-IT –ParAlign with e-value (PA E): SW implementation of Sencel –SSEARCH with e-value (SS E): SW implementation of FASTA (see next page)

17 Methods (2) Heuristic algorithms: –FASTA (FA E) Pearson & Lipman, 1988 Heuristic approximation; performs better than BLAST with strongly diverged proteins –BLAST (BL E): Altschul et al., 1990 Heuristic approximation; stretches local alignments (HSPs) to global alignment Should be faster than FASTA

18 Method parameters - all: - matrix: BLOSUM62 - gap open penalty: 12 - gap extension penalty: 1 - Biofacet with z-value: 100 randomizations

19 Receiver Operating Characteristic R.O.C.: statistical value, mostly used in clinical medicine Proposed by Gribskov & Robinson (1996) to be used for sequence comparison analysis

20 ROC 50 Example queryd1c75a_a.3.1.1 hit #pc e 1d1gcya1b.71.1.10.31 2d1h32b_a.3.1.10.4 3d1gks__a.3.1.10.52 4d1a56__a.3.1.10.52 5d1kx2a_a.3.1.10.67 6d1etpa1a.3.1.40.67 7d1zpda3c.36.1.90.87 8d1eu1a2c.81.1.10.87 9d451c__a.3.1.11.1 10d1flca2c.23.10.21.1 11d1mdwa_d.3.1.31.1 12d2dvh__a.3.1.11.5 13d1shsa_b.15.1.11.5 14d1mg2d_a.3.1.11.5 15d1c53__a.3.1.12.4 16d3c2c__a.3.1.12.4 17d1bvsa1a.5.1.16.8 18d1dvva_a.3.1.16.8 19d1cyi__a.3.1.16.8 20d1dw0a_a.3.1.16.8 21d1h0ba_b.29.1.116.8 22d3pfk__c.89.1.16.8 23d1kful3d.3.1.36.8 24d1ixrc1a.4.5.1114 25d1ixsb1a.4.5.1114 - Take 100 best hits - True positives: in same SCOP family, or false positives: not in same family - For each of first 50 false positives: calculate number of true positives higher in list (0,4,4,4,5,5,6,9,12,12,12,12,12) - Divide sum of these numbers by number of false positives (50) and by total number of possible true positives (size of family -1) = ROC 50 (0,167) - Take average of ROC 50 scores for all entries

21 ROC 50 results

22 Coverage vs. Error C.V.E. = Coverage vs. Error (Brenner et al., 1998) E.P.Q. = selectivity indicator (how much false positives?) Coverage = sensitivity indicator (how much true positives of total?)

23 CVE Example queryd1c75a_a.3.1.1 hit #pc e 1d1gcya1b.71.1.10.31 2d1h32b_a.3.1.10.4 3d1gks__a.3.1.10.52 4d1a56__a.3.1.10.52 5d1kx2a_a.3.1.10.67 6d1etpa1a.3.1.40.67 7d1zpda3c.36.1.90.87 8d1eu1a2c.81.1.10.87 9d451c__a.3.1.11.1 10d1flca2c.23.10.21.1 11d1mdwa_d.3.1.31.1 12d2dvh__a.3.1.11.5 13d1shsa_b.15.1.11.5 14d1mg2d_a.3.1.11.5 15d1c53__a.3.1.12.4 16d3c2c__a.3.1.12.4 17d1bvsa1a.5.1.16.8 18d1dvva_a.3.1.16.8 19d1cyi__a.3.1.16.8 20d1dw0a_a.3.1.16.8 21d1h0ba_b.29.1.116.8 22d3pfk__c.89.1.16.8 23d1kful3d.3.1.36.8 24d1ixrc1a.4.5.1114 25d1ixsb1a.4.5.1114 - Vary threshold above which a hit is seen as a positive: e.g. e=10,e=1,e=0.1,e=0.01 - True positives: in same SCOP family, or false positives: not in same family - For each threshold, calculate the coverage: number of true positives divided by the total number of possible true positives - For each treshold, calculate the errors-per-query: number of false positives divided by the number of queries - Plot coverage on x-axis and errors- per-query on y-axis; right-bottom is best

24 CVE results (only PDB095) - +

25 Mean Average Precision A.P.: borrowed from information retrieval search (Salton, 1991) Recall: true positives divided by number of homologs Precision: true positives divided by number of hits A.P. = approximate integral to calculate area under recall-precision curve

26 Mean AP Example queryd1c75a_a.3.1.1 hit #pc e 1d1gcya1b.71.1.10.31 2d1h32b_a.3.1.10.4 3d1gks__a.3.1.10.52 4d1a56__a.3.1.10.52 5d1kx2a_a.3.1.10.67 6d1etpa1a.3.1.40.67 7d1zpda3c.36.1.90.87 8d1eu1a2c.81.1.10.87 9d451c__a.3.1.11.1 10d1flca2c.23.10.21.1 11d1mdwa_d.3.1.31.1 12d2dvh__a.3.1.11.5 13d1shsa_b.15.1.11.5 14d1mg2d_a.3.1.11.5 15d1c53__a.3.1.12.4 16d3c2c__a.3.1.12.4 17d1bvsa1a.5.1.16.8 18d1dvva_a.3.1.16.8 19d1cyi__a.3.1.16.8 20d1dw0a_a.3.1.16.8 21d1h0ba_b.29.1.116.8 22d3pfk__c.89.1.16.8 23d1kful3d.3.1.36.8 24d1ixrc1a.4.5.1114 25d1ixsb1a.4.5.1114 - Take 100 best hits - True positives: in same SCOP family, or false positives: not in same family -For each of the true positives: divide the true positive rank (1,2,3,4,5,6,7,8,9,10,11,12) by the positive rank (2,3,4,5,9,12,14,15,16,18,19,20) - Divide the sum of all of these numbers by the total number of hits (100) = AP (0.140) - Take average of AP scores for all entries = mean AP

27 Mean AP results

28 Time consumption PDB095 all-against-all comparison: –Biofacet: multiple days (z value calc.!) –BLAST: 2d,4h,16m –SSEARCH: 5h49m –ParAlign: 47m –FASTA: 40m

29 Preliminary conclusions SSEARCH gives best results When time is important, FASTA is a good alternative Z-value seems to have no advantage over E-value

30 Problems Bias in PDB? –Sequence length –Amino acid composition Difference in matrices? Difference in SW implementations?

31 Bias in PDB sequence length?  Yes! Short sequences are over-represented in the ASTRAL SCOP PDB sets

32 Bias in PDB aa distribution?  No! Approximately equal amino acid distribution in the ASTRAL SCOP PDB sets

33 Difference in matrices?

34 Difference in SW implementations?

35 Conclusions E-value better than Z-value! SW implementations are (more or less) the same (SSEARCH, ParAlign and Biofacet), but SSEARCH with e-value scores best of all Larger structural comparison database needed for better analysis

36 Credits NV Organon: –Peter Groenen –Wilco Fleuren Wageningen UR: –Jack Leunissen


Download ppt "Testing sequence comparison methods with structure Organon, Oss 2006-02-07 Tim Hulsen."

Similar presentations


Ads by Google