Testing sequence comparison methods with structure Organon, Oss Tim Hulsen
Introduction Main goal: transfer function of proteins in model organisms to proteins in humans Make use of “orthology”: proteins evolved from common ancestor in different species (very similar function!) Several ortholog identification methods, relying on: –Sequence comparisons –(Phylogenies)
Introduction Quality of ortholog identification depends on: 1.) Quality of sequence comparison algorithm: - Smith-Waterman vs. BLAST, FASTA, etc. - Z-value vs. E-value 2.) Quality of ortholog identification itself (phylogenies, clustering, etc.) 2 -> previous research 1 -> this presentation
Previous research Comparison of several ortholog identification methods Orthologs should have similar function Functional data of orthologs should behave similar: –Gene expression data –Protein interaction data –Interpro IDs –Gene order
Orthology method comparison Compared methods: –BBH, Best Bidirectional Hit –INP, InParanoid –KOG, euKaryotic Orthologous Groups –MCL, OrthoMCL –PGT, PhyloGenetic Tree –Z1H, Z-value > 1 Hundred
Orthology method comparison e.g. correlation in expression profiles Affymetrix human and mouse expr. data, using SNOMED tissue classification Check if the expression profile of a protein is similar to the expression profile of its ortholog Hs Mm
Orthology method comparison
e.g. conservation of protein interaction DIP (Database of Interacting Proteins) Check if the orthologs of two interacting proteins are still interacting in the other species -> calculate fraction Hs Mm
Orthology method comparison
Trade-off between sensitivity and selectivity BBH and INP are most sensitive but also most selective Results can differ depending on what sequence comparison algorithm is used: - BLAST, FASTA, Smith-Waterman? - E-value or Z-value?
E-value or Z-value? Smith-Waterman with Z-value statistics: 100 randomized shuffles to test significance of SW score O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2. YMSHQFTVGE etc. # seqs SW score rnd ori: 5*SD Z = 5
E-value or Z-value? Z-value calculation takes much time (2x100 randomizations) Comet et al. (1999) and Bastien et al. (2004): Z-value is theoretically more sensitive and more selective than E- value? Advantage of Z-value has never been proven by experimental results
How to compare? Structural comparison is better than sequence comparison ASTRAL SCOP: Structural Classification Of Proteins e.g. a.2.1.3, c.1.2.4; same number ~ same structure Use structural classification as benchmark for sequence comparison methods
ASTRAL SCOP statistics max. % identitymembersfamiliesavg. fam. sizemax. fam. sizefamilies =1families >1 10% % % % % % % % % %
Methods (1) Smith-Waterman algorithms: dynamic programming; computationally intensive –Paracel with e-value (PA E): SW implementation of Paracel –Biofacet with z-value (BF Z): SW implementation of Gene-IT –ParAlign with e-value (PA E): SW implementation of Sencel –SSEARCH with e-value (SS E): SW implementation of FASTA (see next page)
Methods (2) Heuristic algorithms: –FASTA (FA E) Pearson & Lipman, 1988 Heuristic approximation; performs better than BLAST with strongly diverged proteins –BLAST (BL E): Altschul et al., 1990 Heuristic approximation; stretches local alignments (HSPs) to global alignment Should be faster than FASTA
Method parameters - all: - matrix: BLOSUM62 - gap open penalty: 12 - gap extension penalty: 1 - Biofacet with z-value: 100 randomizations
Receiver Operating Characteristic R.O.C.: statistical value, mostly used in clinical medicine Proposed by Gribskov & Robinson (1996) to be used for sequence comparison analysis
ROC 50 Example queryd1c75a_a hit #pc e 1d1gcya1b d1h32b_a d1gks__a d1a56__a d1kx2a_a d1etpa1a d1zpda3c d1eu1a2c d451c__a d1flca2c d1mdwa_d d2dvh__a d1shsa_b d1mg2d_a d1c53__a d3c2c__a d1bvsa1a d1dvva_a d1cyi__a d1dw0a_a d1h0ba_b d3pfk__c d1kful3d d1ixrc1a d1ixsb1a Take 100 best hits - True positives: in same SCOP family, or false positives: not in same family - For each of first 50 false positives: calculate number of true positives higher in list (0,4,4,4,5,5,6,9,12,12,12,12,12) - Divide sum of these numbers by number of false positives (50) and by total number of possible true positives (size of family -1) = ROC 50 (0,167) - Take average of ROC 50 scores for all entries
ROC 50 results
Coverage vs. Error C.V.E. = Coverage vs. Error (Brenner et al., 1998) E.P.Q. = selectivity indicator (how much false positives?) Coverage = sensitivity indicator (how much true positives of total?)
CVE Example queryd1c75a_a hit #pc e 1d1gcya1b d1h32b_a d1gks__a d1a56__a d1kx2a_a d1etpa1a d1zpda3c d1eu1a2c d451c__a d1flca2c d1mdwa_d d2dvh__a d1shsa_b d1mg2d_a d1c53__a d3c2c__a d1bvsa1a d1dvva_a d1cyi__a d1dw0a_a d1h0ba_b d3pfk__c d1kful3d d1ixrc1a d1ixsb1a Vary threshold above which a hit is seen as a positive: e.g. e=10,e=1,e=0.1,e= True positives: in same SCOP family, or false positives: not in same family - For each threshold, calculate the coverage: number of true positives divided by the total number of possible true positives - For each treshold, calculate the errors-per-query: number of false positives divided by the number of queries - Plot coverage on x-axis and errors- per-query on y-axis; right-bottom is best
CVE results (only PDB095) - +
Mean Average Precision A.P.: borrowed from information retrieval search (Salton, 1991) Recall: true positives divided by number of homologs Precision: true positives divided by number of hits A.P. = approximate integral to calculate area under recall-precision curve
Mean AP Example queryd1c75a_a hit #pc e 1d1gcya1b d1h32b_a d1gks__a d1a56__a d1kx2a_a d1etpa1a d1zpda3c d1eu1a2c d451c__a d1flca2c d1mdwa_d d2dvh__a d1shsa_b d1mg2d_a d1c53__a d3c2c__a d1bvsa1a d1dvva_a d1cyi__a d1dw0a_a d1h0ba_b d3pfk__c d1kful3d d1ixrc1a d1ixsb1a Take 100 best hits - True positives: in same SCOP family, or false positives: not in same family -For each of the true positives: divide the true positive rank (1,2,3,4,5,6,7,8,9,10,11,12) by the positive rank (2,3,4,5,9,12,14,15,16,18,19,20) - Divide the sum of all of these numbers by the total number of hits (100) = AP (0.140) - Take average of AP scores for all entries = mean AP
Mean AP results
Time consumption PDB095 all-against-all comparison: –Biofacet: multiple days (z value calc.!) –BLAST: 2d,4h,16m –SSEARCH: 5h49m –ParAlign: 47m –FASTA: 40m
Preliminary conclusions SSEARCH gives best results When time is important, FASTA is a good alternative Z-value seems to have no advantage over E-value
Problems Bias in PDB? –Sequence length –Amino acid composition Difference in matrices? Difference in SW implementations?
Bias in PDB sequence length? Yes! Short sequences are over-represented in the ASTRAL SCOP PDB sets
Bias in PDB aa distribution? No! Approximately equal amino acid distribution in the ASTRAL SCOP PDB sets
Difference in matrices?
Difference in SW implementations?
Conclusions E-value better than Z-value! SW implementations are (more or less) the same (SSEARCH, ParAlign and Biofacet), but SSEARCH with e-value scores best of all Larger structural comparison database needed for better analysis
Credits NV Organon: –Peter Groenen –Wilco Fleuren Wageningen UR: –Jack Leunissen