Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27.

Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

Introduction Sequence comparison: Important for finding similar proteins (homologs) for a protein with unknown function Algorithms: BLAST, FASTA, Smith- Waterman Statistical scores: E-value (standard), Z- value

E-value or Z-value? Smith-Waterman sequence comparison with Z-value statistics: 100 randomized shuffles to test significance of SW score O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2. YMSHQFTVGE etc. # seqs SW score rnd ori: 5*SD  Z = 5

E-value or Z-value? Z-value calculation takes much time (2x100 randomizations) Comet et al. (1999) and Bastien et al. (2004): Z-value is theoretically more sensitive and more selective than E-value BUT Advantage of Z-value has never been proven by experimental results

How to compare? Structural comparison is better than sequence comparison ASTRAL SCOP: Structural Classification Of Proteins e.g. a.2.1.3, c.1.2.4; same number ~ same structure Use structural classification as benchmark for sequence comparison methods

ASTRAL SCOP statistics max. % identitymembersfamiliesavg. fam. sizemax. fam. sizefamilies =1families >1 10%363122501.614251655595 20%396822971.727291605692 25%435723131.884321530783 30%482123202.078391435885 35%530123222.283461333989 40%567423222.4444712691053 50%644223242.7725011781146 70%755123253.24812710871238 90%875923263.76640510231303 95%949823264.0834799771349

Methods (1) Smith-Waterman algorithms: dynamic programming; computationally intensive –Paracel with e-value (PA E): SW implementation of Paracel –Biofacet with z-value (BF Z): SW implementation of Gene-IT –ParAlign with e-value (PA E): SW implementation of Sencel –SSEARCH with e-value (SS E): SW implementation of FASTA (see next page)

Methods (2) Heuristic algorithms: –FASTA (FA E) Pearson & Lipman, 1988 Heuristic approximation; performs better than BLAST with strongly diverged proteins –BLAST (BL E): Altschul et al., 1990 Heuristic approximation; stretches local alignments (HSPs) to global alignment Should be faster than FASTA

Method parameters - all: - matrix: BLOSUM62 - gap open penalty: 12 - gap extension penalty: 1 - Biofacet with z-value: 100 randomizations

Receiver Operating Characteristic R.O.C.: statistical value, mostly used in clinical medicine Proposed by Gribskov & Robinson (1996) to be used for sequence comparison analysis

ROC 50 Example queryd1c75a_a.3.1.1 hit #pc e 1d1gcya1b.71.1.10.31 2d1h32b_a.3.1.10.4 3d1gks__a.3.1.10.52 4d1a56__a.3.1.10.52 5d1kx2a_a.3.1.10.67 6d1etpa1a.3.1.40.67 7d1zpda3c.36.1.90.87 8d1eu1a2c.81.1.10.87 9d451c__a.3.1.11.1 10d1flca2c.23.10.21.1 11d1mdwa_d.3.1.31.1 12d2dvh__a.3.1.11.5 13d1shsa_b.15.1.11.5 14d1mg2d_a.3.1.11.5 15d1c53__a.3.1.12.4 16d3c2c__a.3.1.12.4 17d1bvsa1a.5.1.16.8 18d1dvva_a.3.1.16.8 19d1cyi__a.3.1.16.8 20d1dw0a_a.3.1.16.8 21d1h0ba_b.29.1.116.8 22d3pfk__c.89.1.16.8 23d1kful3d.3.1.36.8 24d1ixrc1a.4.5.1114 25d1ixsb1a.4.5.1114 - Take 100 best hits - True positives: in same SCOP family, or false positives: not in same family - For each of first 50 false positives: calculate number of true positives higher in list (0,4,4,4,5,5,6,9,12,12,12,12,12) - Divide sum of these numbers by number of false positives (50) and by total number of possible true positives (size of family -1) = ROC 50 (0,167) - Take average of ROC 50 scores for all entries

ROC 50 results

Coverage vs. Error C.V.E. = Coverage vs. Error (Brenner et al., 1998) E.P.Q. = selectivity indicator (how much false positives?) Coverage = sensitivity indicator (how much true positives of total?)

CVE Example queryd1c75a_a.3.1.1 hit #pc e 1d1gcya1b.71.1.10.31 2d1h32b_a.3.1.10.4 3d1gks__a.3.1.10.52 4d1a56__a.3.1.10.52 5d1kx2a_a.3.1.10.67 6d1etpa1a.3.1.40.67 7d1zpda3c.36.1.90.87 8d1eu1a2c.81.1.10.87 9d451c__a.3.1.11.1 10d1flca2c.23.10.21.1 11d1mdwa_d.3.1.31.1 12d2dvh__a.3.1.11.5 13d1shsa_b.15.1.11.5 14d1mg2d_a.3.1.11.5 15d1c53__a.3.1.12.4 16d3c2c__a.3.1.12.4 17d1bvsa1a.5.1.16.8 18d1dvva_a.3.1.16.8 19d1cyi__a.3.1.16.8 20d1dw0a_a.3.1.16.8 21d1h0ba_b.29.1.116.8 22d3pfk__c.89.1.16.8 23d1kful3d.3.1.36.8 24d1ixrc1a.4.5.1114 25d1ixsb1a.4.5.1114 - Vary threshold above which a hit is seen as a positive: e.g. e=10,e=1,e=0.1,e=0.01 - True positives: in same SCOP family, or false positives: not in same family - For each threshold, calculate coverage: number of true positives divided by total number of possible true positives - For each threshold, calculate errors- per-query: number of false positives divided by number of queries - Plot coverage on x-axis and errors- per-query on y-axis; right-bottom is best

CVE results (for PDB010) - +

Mean Average Precision A.P.: borrowed from information retrieval search (Salton, 1991) Recall: true positives divided by number of homologs Precision: true positives divided by number of hits A.P. = approximate integral to calculate area under recall-precision curve

Mean AP Example queryd1c75a_a.3.1.1 hit #pc e 1d1gcya1b.71.1.10.31 2d1h32b_a.3.1.10.4 3d1gks__a.3.1.10.52 4d1a56__a.3.1.10.52 5d1kx2a_a.3.1.10.67 6d1etpa1a.3.1.40.67 7d1zpda3c.36.1.90.87 8d1eu1a2c.81.1.10.87 9d451c__a.3.1.11.1 10d1flca2c.23.10.21.1 11d1mdwa_d.3.1.31.1 12d2dvh__a.3.1.11.5 13d1shsa_b.15.1.11.5 14d1mg2d_a.3.1.11.5 15d1c53__a.3.1.12.4 16d3c2c__a.3.1.12.4 17d1bvsa1a.5.1.16.8 18d1dvva_a.3.1.16.8 19d1cyi__a.3.1.16.8 20d1dw0a_a.3.1.16.8 21d1h0ba_b.29.1.116.8 22d3pfk__c.89.1.16.8 23d1kful3d.3.1.36.8 24d1ixrc1a.4.5.1114 25d1ixsb1a.4.5.1114 - Take 100 best hits - True positives: in same SCOP family, or false positives: not in same family -For each of the true positives: divide the positive rank (1,2,3,4,5,6,7,8,9,10,11,12) by the true positive rank (2,3,4,5,9,12,14,15,16,18,19,20) - Divide the sum of all of these numbers by the total number of hits (100) = AP (0.140) - Take average of AP scores for all entries = mean AP

Mean AP results

Time consumption PDB095 all-against-all comparison: –Biofacet: multiple days (Z-value calc.!) –SSEARCH: 5h49m –ParAlign: 47m –FASTA: 40m –BLAST: 15m

Conclusions e-value better than Z-value(!) SW implementations are (more or less) the same (SSEARCH, ParAlign and Biofacet), but SSEARCH with e-value scores best of all Use FASTA/BLAST only when time is important Larger structural comparison database needed for better analysis

Credits Peter Groenen Wilco Fleuren Jack Leunissen

Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27.

Similar presentations

Presentation on theme: "Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27.

Similar presentations

Presentation on theme: "Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27."— Presentation transcript:

Similar presentations

About project

Feedback