Download presentation
Presentation is loading. Please wait.
Published bySimon Dalton Modified over 9 years ago
1
Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27
2
Introduction Sequence comparison: Important for finding similar proteins (homologs) for a protein with unknown function Algorithms: BLAST, FASTA, Smith- Waterman Statistical scores: E-value (standard), Z- value
3
E-value or Z-value? Smith-Waterman sequence comparison with Z-value statistics: 100 randomized shuffles to test significance of SW score O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2. YMSHQFTVGE etc. # seqs SW score rnd ori: 5*SD Z = 5
4
E-value or Z-value? Z-value calculation takes much time (2x100 randomizations) Comet et al. (1999) and Bastien et al. (2004): Z-value is theoretically more sensitive and more selective than E-value BUT Advantage of Z-value has never been proven by experimental results
5
How to compare? Structural comparison is better than sequence comparison ASTRAL SCOP: Structural Classification Of Proteins e.g. a.2.1.3, c.1.2.4; same number ~ same structure Use structural classification as benchmark for sequence comparison methods
6
ASTRAL SCOP statistics max. % identitymembersfamiliesavg. fam. sizemax. fam. sizefamilies =1families >1 10%363122501.614251655595 20%396822971.727291605692 25%435723131.884321530783 30%482123202.078391435885 35%530123222.283461333989 40%567423222.4444712691053 50%644223242.7725011781146 70%755123253.24812710871238 90%875923263.76640510231303 95%949823264.0834799771349
7
Methods (1) Smith-Waterman algorithms: dynamic programming; computationally intensive –Paracel with e-value (PA E): SW implementation of Paracel –Biofacet with z-value (BF Z): SW implementation of Gene-IT –ParAlign with e-value (PA E): SW implementation of Sencel –SSEARCH with e-value (SS E): SW implementation of FASTA (see next page)
8
Methods (2) Heuristic algorithms: –FASTA (FA E) Pearson & Lipman, 1988 Heuristic approximation; performs better than BLAST with strongly diverged proteins –BLAST (BL E): Altschul et al., 1990 Heuristic approximation; stretches local alignments (HSPs) to global alignment Should be faster than FASTA
9
Method parameters - all: - matrix: BLOSUM62 - gap open penalty: 12 - gap extension penalty: 1 - Biofacet with z-value: 100 randomizations
10
Receiver Operating Characteristic R.O.C.: statistical value, mostly used in clinical medicine Proposed by Gribskov & Robinson (1996) to be used for sequence comparison analysis
11
ROC 50 Example queryd1c75a_a.3.1.1 hit #pc e 1d1gcya1b.71.1.10.31 2d1h32b_a.3.1.10.4 3d1gks__a.3.1.10.52 4d1a56__a.3.1.10.52 5d1kx2a_a.3.1.10.67 6d1etpa1a.3.1.40.67 7d1zpda3c.36.1.90.87 8d1eu1a2c.81.1.10.87 9d451c__a.3.1.11.1 10d1flca2c.23.10.21.1 11d1mdwa_d.3.1.31.1 12d2dvh__a.3.1.11.5 13d1shsa_b.15.1.11.5 14d1mg2d_a.3.1.11.5 15d1c53__a.3.1.12.4 16d3c2c__a.3.1.12.4 17d1bvsa1a.5.1.16.8 18d1dvva_a.3.1.16.8 19d1cyi__a.3.1.16.8 20d1dw0a_a.3.1.16.8 21d1h0ba_b.29.1.116.8 22d3pfk__c.89.1.16.8 23d1kful3d.3.1.36.8 24d1ixrc1a.4.5.1114 25d1ixsb1a.4.5.1114 - Take 100 best hits - True positives: in same SCOP family, or false positives: not in same family - For each of first 50 false positives: calculate number of true positives higher in list (0,4,4,4,5,5,6,9,12,12,12,12,12) - Divide sum of these numbers by number of false positives (50) and by total number of possible true positives (size of family -1) = ROC 50 (0,167) - Take average of ROC 50 scores for all entries
12
ROC 50 results
13
Coverage vs. Error C.V.E. = Coverage vs. Error (Brenner et al., 1998) E.P.Q. = selectivity indicator (how much false positives?) Coverage = sensitivity indicator (how much true positives of total?)
14
CVE Example queryd1c75a_a.3.1.1 hit #pc e 1d1gcya1b.71.1.10.31 2d1h32b_a.3.1.10.4 3d1gks__a.3.1.10.52 4d1a56__a.3.1.10.52 5d1kx2a_a.3.1.10.67 6d1etpa1a.3.1.40.67 7d1zpda3c.36.1.90.87 8d1eu1a2c.81.1.10.87 9d451c__a.3.1.11.1 10d1flca2c.23.10.21.1 11d1mdwa_d.3.1.31.1 12d2dvh__a.3.1.11.5 13d1shsa_b.15.1.11.5 14d1mg2d_a.3.1.11.5 15d1c53__a.3.1.12.4 16d3c2c__a.3.1.12.4 17d1bvsa1a.5.1.16.8 18d1dvva_a.3.1.16.8 19d1cyi__a.3.1.16.8 20d1dw0a_a.3.1.16.8 21d1h0ba_b.29.1.116.8 22d3pfk__c.89.1.16.8 23d1kful3d.3.1.36.8 24d1ixrc1a.4.5.1114 25d1ixsb1a.4.5.1114 - Vary threshold above which a hit is seen as a positive: e.g. e=10,e=1,e=0.1,e=0.01 - True positives: in same SCOP family, or false positives: not in same family - For each threshold, calculate coverage: number of true positives divided by total number of possible true positives - For each threshold, calculate errors- per-query: number of false positives divided by number of queries - Plot coverage on x-axis and errors- per-query on y-axis; right-bottom is best
15
CVE results (for PDB010) - +
16
Mean Average Precision A.P.: borrowed from information retrieval search (Salton, 1991) Recall: true positives divided by number of homologs Precision: true positives divided by number of hits A.P. = approximate integral to calculate area under recall-precision curve
17
Mean AP Example queryd1c75a_a.3.1.1 hit #pc e 1d1gcya1b.71.1.10.31 2d1h32b_a.3.1.10.4 3d1gks__a.3.1.10.52 4d1a56__a.3.1.10.52 5d1kx2a_a.3.1.10.67 6d1etpa1a.3.1.40.67 7d1zpda3c.36.1.90.87 8d1eu1a2c.81.1.10.87 9d451c__a.3.1.11.1 10d1flca2c.23.10.21.1 11d1mdwa_d.3.1.31.1 12d2dvh__a.3.1.11.5 13d1shsa_b.15.1.11.5 14d1mg2d_a.3.1.11.5 15d1c53__a.3.1.12.4 16d3c2c__a.3.1.12.4 17d1bvsa1a.5.1.16.8 18d1dvva_a.3.1.16.8 19d1cyi__a.3.1.16.8 20d1dw0a_a.3.1.16.8 21d1h0ba_b.29.1.116.8 22d3pfk__c.89.1.16.8 23d1kful3d.3.1.36.8 24d1ixrc1a.4.5.1114 25d1ixsb1a.4.5.1114 - Take 100 best hits - True positives: in same SCOP family, or false positives: not in same family -For each of the true positives: divide the positive rank (1,2,3,4,5,6,7,8,9,10,11,12) by the true positive rank (2,3,4,5,9,12,14,15,16,18,19,20) - Divide the sum of all of these numbers by the total number of hits (100) = AP (0.140) - Take average of AP scores for all entries = mean AP
18
Mean AP results
19
Time consumption PDB095 all-against-all comparison: –Biofacet: multiple days (Z-value calc.!) –SSEARCH: 5h49m –ParAlign: 47m –FASTA: 40m –BLAST: 15m
20
Conclusions e-value better than Z-value(!) SW implementations are (more or less) the same (SSEARCH, ParAlign and Biofacet), but SSEARCH with e-value scores best of all Use FASTA/BLAST only when time is important Larger structural comparison database needed for better analysis
21
Credits Peter Groenen Wilco Fleuren Jack Leunissen
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.