Testing sequence comparison methods with structure Organon, Oss 2006-02-07 Tim Hulsen.

Testing sequence comparison methods with structure similarity @ Organon, Oss 2006-02-07 Tim Hulsen

Introduction Main goal: transfer function of proteins in model organisms to proteins in humans Make use of “orthology”: proteins evolved from common ancestor in different species (very similar function!) Several ortholog identification methods, relying on: –Sequence comparisons –(Phylogenies)

Introduction Quality of ortholog identification depends on: 1.) Quality of sequence comparison algorithm: - Smith-Waterman vs. BLAST, FASTA, etc. - Z-value vs. E-value 2.) Quality of ortholog identification itself (phylogenies, clustering, etc.) 2 -> previous research 1 -> this presentation

Previous research Comparison of several ortholog identification methods Orthologs should have similar function Functional data of orthologs should behave similar: –Gene expression data –Protein interaction data –Interpro IDs –Gene order

Orthology method comparison Compared methods: –BBH, Best Bidirectional Hit –INP, InParanoid –KOG, euKaryotic Orthologous Groups –MCL, OrthoMCL –PGT, PhyloGenetic Tree –Z1H, Z-value > 1 Hundred

Orthology method comparison e.g. correlation in expression profiles Affymetrix human and mouse expr. data, using SNOMED tissue classification Check if the expression profile of a protein is similar to the expression profile of its ortholog Hs Mm

Orthology method comparison

e.g. conservation of protein interaction DIP (Database of Interacting Proteins) Check if the orthologs of two interacting proteins are still interacting in the other species -> calculate fraction Hs Mm

Orthology method comparison

Trade-off between sensitivity and selectivity BBH and INP are most sensitive but also most selective Results can differ depending on what sequence comparison algorithm is used: - BLAST, FASTA, Smith-Waterman? - E-value or Z-value?

E-value or Z-value? Smith-Waterman with Z-value statistics: 100 randomized shuffles to test significance of SW score O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2. YMSHQFTVGE etc. # seqs SW score rnd ori: 5*SD  Z = 5

E-value or Z-value? Z-value calculation takes much time (2x100 randomizations) Comet et al. (1999) and Bastien et al. (2004): Z-value is theoretically more sensitive and more selective than E- value? Advantage of Z-value has never been proven by experimental results

How to compare? Structural comparison is better than sequence comparison ASTRAL SCOP: Structural Classification Of Proteins e.g. a.2.1.3, c.1.2.4; same number ~ same structure Use structural classification as benchmark for sequence comparison methods

ASTRAL SCOP statistics max. % identitymembersfamiliesavg. fam. sizemax. fam. sizefamilies =1families >1 10%363122501.614251655595 20%396822971.727291605692 25%435723131.884321530783 30%482123202.078391435885 35%530123222.283461333989 40%567423222.4444712691053 50%644223242.7725011781146 70%755123253.24812710871238 90%875923263.76640510231303 95%949823264.0834799771349

Methods (1) Smith-Waterman algorithms: dynamic programming; computationally intensive –Paracel with e-value (PA E): SW implementation of Paracel –Biofacet with z-value (BF Z): SW implementation of Gene-IT –ParAlign with e-value (PA E): SW implementation of Sencel –SSEARCH with e-value (SS E): SW implementation of FASTA (see next page)

Methods (2) Heuristic algorithms: –FASTA (FA E) Pearson & Lipman, 1988 Heuristic approximation; performs better than BLAST with strongly diverged proteins –BLAST (BL E): Altschul et al., 1990 Heuristic approximation; stretches local alignments (HSPs) to global alignment Should be faster than FASTA

Method parameters - all: - matrix: BLOSUM62 - gap open penalty: 12 - gap extension penalty: 1 - Biofacet with z-value: 100 randomizations

Receiver Operating Characteristic R.O.C.: statistical value, mostly used in clinical medicine Proposed by Gribskov & Robinson (1996) to be used for sequence comparison analysis

ROC 50 Example queryd1c75a_a.3.1.1 hit #pc e 1d1gcya1b.71.1.10.31 2d1h32b_a.3.1.10.4 3d1gks__a.3.1.10.52 4d1a56__a.3.1.10.52 5d1kx2a_a.3.1.10.67 6d1etpa1a.3.1.40.67 7d1zpda3c.36.1.90.87 8d1eu1a2c.81.1.10.87 9d451c__a.3.1.11.1 10d1flca2c.23.10.21.1 11d1mdwa_d.3.1.31.1 12d2dvh__a.3.1.11.5 13d1shsa_b.15.1.11.5 14d1mg2d_a.3.1.11.5 15d1c53__a.3.1.12.4 16d3c2c__a.3.1.12.4 17d1bvsa1a.5.1.16.8 18d1dvva_a.3.1.16.8 19d1cyi__a.3.1.16.8 20d1dw0a_a.3.1.16.8 21d1h0ba_b.29.1.116.8 22d3pfk__c.89.1.16.8 23d1kful3d.3.1.36.8 24d1ixrc1a.4.5.1114 25d1ixsb1a.4.5.1114 - Take 100 best hits - True positives: in same SCOP family, or false positives: not in same family - For each of first 50 false positives: calculate number of true positives higher in list (0,4,4,4,5,5,6,9,12,12,12,12,12) - Divide sum of these numbers by number of false positives (50) and by total number of possible true positives (size of family -1) = ROC 50 (0,167) - Take average of ROC 50 scores for all entries

ROC 50 results

Coverage vs. Error C.V.E. = Coverage vs. Error (Brenner et al., 1998) E.P.Q. = selectivity indicator (how much false positives?) Coverage = sensitivity indicator (how much true positives of total?)

CVE Example queryd1c75a_a.3.1.1 hit #pc e 1d1gcya1b.71.1.10.31 2d1h32b_a.3.1.10.4 3d1gks__a.3.1.10.52 4d1a56__a.3.1.10.52 5d1kx2a_a.3.1.10.67 6d1etpa1a.3.1.40.67 7d1zpda3c.36.1.90.87 8d1eu1a2c.81.1.10.87 9d451c__a.3.1.11.1 10d1flca2c.23.10.21.1 11d1mdwa_d.3.1.31.1 12d2dvh__a.3.1.11.5 13d1shsa_b.15.1.11.5 14d1mg2d_a.3.1.11.5 15d1c53__a.3.1.12.4 16d3c2c__a.3.1.12.4 17d1bvsa1a.5.1.16.8 18d1dvva_a.3.1.16.8 19d1cyi__a.3.1.16.8 20d1dw0a_a.3.1.16.8 21d1h0ba_b.29.1.116.8 22d3pfk__c.89.1.16.8 23d1kful3d.3.1.36.8 24d1ixrc1a.4.5.1114 25d1ixsb1a.4.5.1114 - Vary threshold above which a hit is seen as a positive: e.g. e=10,e=1,e=0.1,e=0.01 - True positives: in same SCOP family, or false positives: not in same family - For each threshold, calculate the coverage: number of true positives divided by the total number of possible true positives - For each treshold, calculate the errors-per-query: number of false positives divided by the number of queries - Plot coverage on x-axis and errors- per-query on y-axis; right-bottom is best

CVE results (only PDB095) - +

Mean Average Precision A.P.: borrowed from information retrieval search (Salton, 1991) Recall: true positives divided by number of homologs Precision: true positives divided by number of hits A.P. = approximate integral to calculate area under recall-precision curve

Mean AP Example queryd1c75a_a.3.1.1 hit #pc e 1d1gcya1b.71.1.10.31 2d1h32b_a.3.1.10.4 3d1gks__a.3.1.10.52 4d1a56__a.3.1.10.52 5d1kx2a_a.3.1.10.67 6d1etpa1a.3.1.40.67 7d1zpda3c.36.1.90.87 8d1eu1a2c.81.1.10.87 9d451c__a.3.1.11.1 10d1flca2c.23.10.21.1 11d1mdwa_d.3.1.31.1 12d2dvh__a.3.1.11.5 13d1shsa_b.15.1.11.5 14d1mg2d_a.3.1.11.5 15d1c53__a.3.1.12.4 16d3c2c__a.3.1.12.4 17d1bvsa1a.5.1.16.8 18d1dvva_a.3.1.16.8 19d1cyi__a.3.1.16.8 20d1dw0a_a.3.1.16.8 21d1h0ba_b.29.1.116.8 22d3pfk__c.89.1.16.8 23d1kful3d.3.1.36.8 24d1ixrc1a.4.5.1114 25d1ixsb1a.4.5.1114 - Take 100 best hits - True positives: in same SCOP family, or false positives: not in same family -For each of the true positives: divide the true positive rank (1,2,3,4,5,6,7,8,9,10,11,12) by the positive rank (2,3,4,5,9,12,14,15,16,18,19,20) - Divide the sum of all of these numbers by the total number of hits (100) = AP (0.140) - Take average of AP scores for all entries = mean AP

Mean AP results

Time consumption PDB095 all-against-all comparison: –Biofacet: multiple days (z value calc.!) –BLAST: 2d,4h,16m –SSEARCH: 5h49m –ParAlign: 47m –FASTA: 40m

Preliminary conclusions SSEARCH gives best results When time is important, FASTA is a good alternative Z-value seems to have no advantage over E-value

Problems Bias in PDB? –Sequence length –Amino acid composition Difference in matrices? Difference in SW implementations?

Bias in PDB sequence length?  Yes! Short sequences are over-represented in the ASTRAL SCOP PDB sets

Bias in PDB aa distribution?  No! Approximately equal amino acid distribution in the ASTRAL SCOP PDB sets

Difference in matrices?

Difference in SW implementations?

Conclusions E-value better than Z-value! SW implementations are (more or less) the same (SSEARCH, ParAlign and Biofacet), but SSEARCH with e-value scores best of all Larger structural comparison database needed for better analysis

Credits NV Organon: –Peter Groenen –Wilco Fleuren Wageningen UR: –Jack Leunissen

Testing sequence comparison methods with structure Organon, Oss 2006-02-07 Tim Hulsen.

Similar presentations

Presentation on theme: "Testing sequence comparison methods with structure Organon, Oss 2006-02-07 Tim Hulsen."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Testing sequence comparison methods with structure Organon, Oss 2006-02-07 Tim Hulsen.

Similar presentations

Presentation on theme: "Testing sequence comparison methods with structure Organon, Oss 2006-02-07 Tim Hulsen."— Presentation transcript:

Similar presentations

About project

Feedback