Download presentation
Presentation is loading. Please wait.
Published byRosemary Reeves Modified over 9 years ago
1
Testing sequence comparison methods with structure similarity @ Organon, Oss 2006-02-07 Tim Hulsen
2
Introduction Main goal: transfer function of proteins in model organisms to proteins in humans Make use of “orthology”: proteins evolved from common ancestor in different species (very similar function!) Several ortholog identification methods, relying on: –Sequence comparisons –(Phylogenies)
3
Introduction Quality of ortholog identification depends on: 1.) Quality of sequence comparison algorithm: - Smith-Waterman vs. BLAST, FASTA, etc. - Z-value vs. E-value 2.) Quality of ortholog identification itself (phylogenies, clustering, etc.) 2 -> previous research 1 -> this presentation
4
Previous research Comparison of several ortholog identification methods Orthologs should have similar function Functional data of orthologs should behave similar: –Gene expression data –Protein interaction data –Interpro IDs –Gene order
5
Orthology method comparison Compared methods: –BBH, Best Bidirectional Hit –INP, InParanoid –KOG, euKaryotic Orthologous Groups –MCL, OrthoMCL –PGT, PhyloGenetic Tree –Z1H, Z-value > 1 Hundred
6
Orthology method comparison e.g. correlation in expression profiles Affymetrix human and mouse expr. data, using SNOMED tissue classification Check if the expression profile of a protein is similar to the expression profile of its ortholog Hs Mm
7
Orthology method comparison
8
e.g. conservation of protein interaction DIP (Database of Interacting Proteins) Check if the orthologs of two interacting proteins are still interacting in the other species -> calculate fraction Hs Mm
9
Orthology method comparison
11
Trade-off between sensitivity and selectivity BBH and INP are most sensitive but also most selective Results can differ depending on what sequence comparison algorithm is used: - BLAST, FASTA, Smith-Waterman? - E-value or Z-value?
12
E-value or Z-value? Smith-Waterman with Z-value statistics: 100 randomized shuffles to test significance of SW score O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2. YMSHQFTVGE etc. # seqs SW score rnd ori: 5*SD Z = 5
13
E-value or Z-value? Z-value calculation takes much time (2x100 randomizations) Comet et al. (1999) and Bastien et al. (2004): Z-value is theoretically more sensitive and more selective than E- value? Advantage of Z-value has never been proven by experimental results
14
How to compare? Structural comparison is better than sequence comparison ASTRAL SCOP: Structural Classification Of Proteins e.g. a.2.1.3, c.1.2.4; same number ~ same structure Use structural classification as benchmark for sequence comparison methods
15
ASTRAL SCOP statistics max. % identitymembersfamiliesavg. fam. sizemax. fam. sizefamilies =1families >1 10%363122501.614251655595 20%396822971.727291605692 25%435723131.884321530783 30%482123202.078391435885 35%530123222.283461333989 40%567423222.4444712691053 50%644223242.7725011781146 70%755123253.24812710871238 90%875923263.76640510231303 95%949823264.0834799771349
16
Methods (1) Smith-Waterman algorithms: dynamic programming; computationally intensive –Paracel with e-value (PA E): SW implementation of Paracel –Biofacet with z-value (BF Z): SW implementation of Gene-IT –ParAlign with e-value (PA E): SW implementation of Sencel –SSEARCH with e-value (SS E): SW implementation of FASTA (see next page)
17
Methods (2) Heuristic algorithms: –FASTA (FA E) Pearson & Lipman, 1988 Heuristic approximation; performs better than BLAST with strongly diverged proteins –BLAST (BL E): Altschul et al., 1990 Heuristic approximation; stretches local alignments (HSPs) to global alignment Should be faster than FASTA
18
Method parameters - all: - matrix: BLOSUM62 - gap open penalty: 12 - gap extension penalty: 1 - Biofacet with z-value: 100 randomizations
19
Receiver Operating Characteristic R.O.C.: statistical value, mostly used in clinical medicine Proposed by Gribskov & Robinson (1996) to be used for sequence comparison analysis
20
ROC 50 Example queryd1c75a_a.3.1.1 hit #pc e 1d1gcya1b.71.1.10.31 2d1h32b_a.3.1.10.4 3d1gks__a.3.1.10.52 4d1a56__a.3.1.10.52 5d1kx2a_a.3.1.10.67 6d1etpa1a.3.1.40.67 7d1zpda3c.36.1.90.87 8d1eu1a2c.81.1.10.87 9d451c__a.3.1.11.1 10d1flca2c.23.10.21.1 11d1mdwa_d.3.1.31.1 12d2dvh__a.3.1.11.5 13d1shsa_b.15.1.11.5 14d1mg2d_a.3.1.11.5 15d1c53__a.3.1.12.4 16d3c2c__a.3.1.12.4 17d1bvsa1a.5.1.16.8 18d1dvva_a.3.1.16.8 19d1cyi__a.3.1.16.8 20d1dw0a_a.3.1.16.8 21d1h0ba_b.29.1.116.8 22d3pfk__c.89.1.16.8 23d1kful3d.3.1.36.8 24d1ixrc1a.4.5.1114 25d1ixsb1a.4.5.1114 - Take 100 best hits - True positives: in same SCOP family, or false positives: not in same family - For each of first 50 false positives: calculate number of true positives higher in list (0,4,4,4,5,5,6,9,12,12,12,12,12) - Divide sum of these numbers by number of false positives (50) and by total number of possible true positives (size of family -1) = ROC 50 (0,167) - Take average of ROC 50 scores for all entries
21
ROC 50 results
22
Coverage vs. Error C.V.E. = Coverage vs. Error (Brenner et al., 1998) E.P.Q. = selectivity indicator (how much false positives?) Coverage = sensitivity indicator (how much true positives of total?)
23
CVE Example queryd1c75a_a.3.1.1 hit #pc e 1d1gcya1b.71.1.10.31 2d1h32b_a.3.1.10.4 3d1gks__a.3.1.10.52 4d1a56__a.3.1.10.52 5d1kx2a_a.3.1.10.67 6d1etpa1a.3.1.40.67 7d1zpda3c.36.1.90.87 8d1eu1a2c.81.1.10.87 9d451c__a.3.1.11.1 10d1flca2c.23.10.21.1 11d1mdwa_d.3.1.31.1 12d2dvh__a.3.1.11.5 13d1shsa_b.15.1.11.5 14d1mg2d_a.3.1.11.5 15d1c53__a.3.1.12.4 16d3c2c__a.3.1.12.4 17d1bvsa1a.5.1.16.8 18d1dvva_a.3.1.16.8 19d1cyi__a.3.1.16.8 20d1dw0a_a.3.1.16.8 21d1h0ba_b.29.1.116.8 22d3pfk__c.89.1.16.8 23d1kful3d.3.1.36.8 24d1ixrc1a.4.5.1114 25d1ixsb1a.4.5.1114 - Vary threshold above which a hit is seen as a positive: e.g. e=10,e=1,e=0.1,e=0.01 - True positives: in same SCOP family, or false positives: not in same family - For each threshold, calculate the coverage: number of true positives divided by the total number of possible true positives - For each treshold, calculate the errors-per-query: number of false positives divided by the number of queries - Plot coverage on x-axis and errors- per-query on y-axis; right-bottom is best
24
CVE results (only PDB095) - +
25
Mean Average Precision A.P.: borrowed from information retrieval search (Salton, 1991) Recall: true positives divided by number of homologs Precision: true positives divided by number of hits A.P. = approximate integral to calculate area under recall-precision curve
26
Mean AP Example queryd1c75a_a.3.1.1 hit #pc e 1d1gcya1b.71.1.10.31 2d1h32b_a.3.1.10.4 3d1gks__a.3.1.10.52 4d1a56__a.3.1.10.52 5d1kx2a_a.3.1.10.67 6d1etpa1a.3.1.40.67 7d1zpda3c.36.1.90.87 8d1eu1a2c.81.1.10.87 9d451c__a.3.1.11.1 10d1flca2c.23.10.21.1 11d1mdwa_d.3.1.31.1 12d2dvh__a.3.1.11.5 13d1shsa_b.15.1.11.5 14d1mg2d_a.3.1.11.5 15d1c53__a.3.1.12.4 16d3c2c__a.3.1.12.4 17d1bvsa1a.5.1.16.8 18d1dvva_a.3.1.16.8 19d1cyi__a.3.1.16.8 20d1dw0a_a.3.1.16.8 21d1h0ba_b.29.1.116.8 22d3pfk__c.89.1.16.8 23d1kful3d.3.1.36.8 24d1ixrc1a.4.5.1114 25d1ixsb1a.4.5.1114 - Take 100 best hits - True positives: in same SCOP family, or false positives: not in same family -For each of the true positives: divide the true positive rank (1,2,3,4,5,6,7,8,9,10,11,12) by the positive rank (2,3,4,5,9,12,14,15,16,18,19,20) - Divide the sum of all of these numbers by the total number of hits (100) = AP (0.140) - Take average of AP scores for all entries = mean AP
27
Mean AP results
28
Time consumption PDB095 all-against-all comparison: –Biofacet: multiple days (z value calc.!) –BLAST: 2d,4h,16m –SSEARCH: 5h49m –ParAlign: 47m –FASTA: 40m
29
Preliminary conclusions SSEARCH gives best results When time is important, FASTA is a good alternative Z-value seems to have no advantage over E-value
30
Problems Bias in PDB? –Sequence length –Amino acid composition Difference in matrices? Difference in SW implementations?
31
Bias in PDB sequence length? Yes! Short sequences are over-represented in the ASTRAL SCOP PDB sets
32
Bias in PDB aa distribution? No! Approximately equal amino acid distribution in the ASTRAL SCOP PDB sets
33
Difference in matrices?
34
Difference in SW implementations?
35
Conclusions E-value better than Z-value! SW implementations are (more or less) the same (SSEARCH, ParAlign and Biofacet), but SSEARCH with e-value scores best of all Larger structural comparison database needed for better analysis
36
Credits NV Organon: –Peter Groenen –Wilco Fleuren Wageningen UR: –Jack Leunissen
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.