Testing sequence comparison methods with structure Organon, Oss 2006-02-07 Tim Hulsen.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
BLAST Sequence alignment, E-value & Extreme value distribution.
Types of homology BLAST
Bioinformatics for biomedicine Sequence search: BLAST, FASTA Lecture 2, Per Kraulis
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Sequence Similarity Searching Class 4 March 2010.
Benchmarking Orthology in Eukaryotes Nijmegen Tim Hulsen.
Heuristic alignment algorithms and cost matrices
We continue where we stopped last week: FASTA – BLAST
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Comparing Database Search Methods & Improving the Performance of PSI-BLAST Stephen Altschul.
Multiple Sequence Alignments
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)
An Introduction to Bioinformatics
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
Genomics in Drug Organon, Oss Tim Hulsen.
Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Protein World SARA Amsterdam Tim Hulsen.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
1 The Genome Gamble, Knowledge or Carnage? Comparative Genomics Leading the Organon Tim Hulsen, Oss, November 11, 2003.
The evolution of the immune system in chicken and higher Organon, Oss Tim Hulsen.
Construction of Substitution matrices
Copyright OpenHelix. No use or reproduction without express written consent1.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Blast Basic Local Alignment Search Tool
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Pairwise alignment incorporating dipeptide covariation
Sequence Based Analysis Tutorial
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Testing sequence comparison methods with structure Organon, Oss Tim Hulsen

Introduction Main goal: transfer function of proteins in model organisms to proteins in humans Make use of “orthology”: proteins evolved from common ancestor in different species (very similar function!) Several ortholog identification methods, relying on: –Sequence comparisons –(Phylogenies)

Introduction Quality of ortholog identification depends on: 1.) Quality of sequence comparison algorithm: - Smith-Waterman vs. BLAST, FASTA, etc. - Z-value vs. E-value 2.) Quality of ortholog identification itself (phylogenies, clustering, etc.) 2 -> previous research 1 -> this presentation

Previous research Comparison of several ortholog identification methods Orthologs should have similar function Functional data of orthologs should behave similar: –Gene expression data –Protein interaction data –Interpro IDs –Gene order

Orthology method comparison Compared methods: –BBH, Best Bidirectional Hit –INP, InParanoid –KOG, euKaryotic Orthologous Groups –MCL, OrthoMCL –PGT, PhyloGenetic Tree –Z1H, Z-value > 1 Hundred

Orthology method comparison e.g. correlation in expression profiles Affymetrix human and mouse expr. data, using SNOMED tissue classification Check if the expression profile of a protein is similar to the expression profile of its ortholog Hs Mm

Orthology method comparison

e.g. conservation of protein interaction DIP (Database of Interacting Proteins) Check if the orthologs of two interacting proteins are still interacting in the other species -> calculate fraction Hs Mm

Orthology method comparison

Trade-off between sensitivity and selectivity BBH and INP are most sensitive but also most selective Results can differ depending on what sequence comparison algorithm is used: - BLAST, FASTA, Smith-Waterman? - E-value or Z-value?

E-value or Z-value? Smith-Waterman with Z-value statistics: 100 randomized shuffles to test significance of SW score O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2. YMSHQFTVGE etc. # seqs SW score rnd ori: 5*SD  Z = 5

E-value or Z-value? Z-value calculation takes much time (2x100 randomizations) Comet et al. (1999) and Bastien et al. (2004): Z-value is theoretically more sensitive and more selective than E- value? Advantage of Z-value has never been proven by experimental results

How to compare? Structural comparison is better than sequence comparison ASTRAL SCOP: Structural Classification Of Proteins e.g. a.2.1.3, c.1.2.4; same number ~ same structure Use structural classification as benchmark for sequence comparison methods

ASTRAL SCOP statistics max. % identitymembersfamiliesavg. fam. sizemax. fam. sizefamilies =1families >1 10% % % % % % % % % %

Methods (1) Smith-Waterman algorithms: dynamic programming; computationally intensive –Paracel with e-value (PA E): SW implementation of Paracel –Biofacet with z-value (BF Z): SW implementation of Gene-IT –ParAlign with e-value (PA E): SW implementation of Sencel –SSEARCH with e-value (SS E): SW implementation of FASTA (see next page)

Methods (2) Heuristic algorithms: –FASTA (FA E) Pearson & Lipman, 1988 Heuristic approximation; performs better than BLAST with strongly diverged proteins –BLAST (BL E): Altschul et al., 1990 Heuristic approximation; stretches local alignments (HSPs) to global alignment Should be faster than FASTA

Method parameters - all: - matrix: BLOSUM62 - gap open penalty: 12 - gap extension penalty: 1 - Biofacet with z-value: 100 randomizations

Receiver Operating Characteristic R.O.C.: statistical value, mostly used in clinical medicine Proposed by Gribskov & Robinson (1996) to be used for sequence comparison analysis

ROC 50 Example queryd1c75a_a hit #pc e 1d1gcya1b d1h32b_a d1gks__a d1a56__a d1kx2a_a d1etpa1a d1zpda3c d1eu1a2c d451c__a d1flca2c d1mdwa_d d2dvh__a d1shsa_b d1mg2d_a d1c53__a d3c2c__a d1bvsa1a d1dvva_a d1cyi__a d1dw0a_a d1h0ba_b d3pfk__c d1kful3d d1ixrc1a d1ixsb1a Take 100 best hits - True positives: in same SCOP family, or false positives: not in same family - For each of first 50 false positives: calculate number of true positives higher in list (0,4,4,4,5,5,6,9,12,12,12,12,12) - Divide sum of these numbers by number of false positives (50) and by total number of possible true positives (size of family -1) = ROC 50 (0,167) - Take average of ROC 50 scores for all entries

ROC 50 results

Coverage vs. Error C.V.E. = Coverage vs. Error (Brenner et al., 1998) E.P.Q. = selectivity indicator (how much false positives?) Coverage = sensitivity indicator (how much true positives of total?)

CVE Example queryd1c75a_a hit #pc e 1d1gcya1b d1h32b_a d1gks__a d1a56__a d1kx2a_a d1etpa1a d1zpda3c d1eu1a2c d451c__a d1flca2c d1mdwa_d d2dvh__a d1shsa_b d1mg2d_a d1c53__a d3c2c__a d1bvsa1a d1dvva_a d1cyi__a d1dw0a_a d1h0ba_b d3pfk__c d1kful3d d1ixrc1a d1ixsb1a Vary threshold above which a hit is seen as a positive: e.g. e=10,e=1,e=0.1,e= True positives: in same SCOP family, or false positives: not in same family - For each threshold, calculate the coverage: number of true positives divided by the total number of possible true positives - For each treshold, calculate the errors-per-query: number of false positives divided by the number of queries - Plot coverage on x-axis and errors- per-query on y-axis; right-bottom is best

CVE results (only PDB095) - +

Mean Average Precision A.P.: borrowed from information retrieval search (Salton, 1991) Recall: true positives divided by number of homologs Precision: true positives divided by number of hits A.P. = approximate integral to calculate area under recall-precision curve

Mean AP Example queryd1c75a_a hit #pc e 1d1gcya1b d1h32b_a d1gks__a d1a56__a d1kx2a_a d1etpa1a d1zpda3c d1eu1a2c d451c__a d1flca2c d1mdwa_d d2dvh__a d1shsa_b d1mg2d_a d1c53__a d3c2c__a d1bvsa1a d1dvva_a d1cyi__a d1dw0a_a d1h0ba_b d3pfk__c d1kful3d d1ixrc1a d1ixsb1a Take 100 best hits - True positives: in same SCOP family, or false positives: not in same family -For each of the true positives: divide the true positive rank (1,2,3,4,5,6,7,8,9,10,11,12) by the positive rank (2,3,4,5,9,12,14,15,16,18,19,20) - Divide the sum of all of these numbers by the total number of hits (100) = AP (0.140) - Take average of AP scores for all entries = mean AP

Mean AP results

Time consumption PDB095 all-against-all comparison: –Biofacet: multiple days (z value calc.!) –BLAST: 2d,4h,16m –SSEARCH: 5h49m –ParAlign: 47m –FASTA: 40m

Preliminary conclusions SSEARCH gives best results When time is important, FASTA is a good alternative Z-value seems to have no advantage over E-value

Problems Bias in PDB? –Sequence length –Amino acid composition Difference in matrices? Difference in SW implementations?

Bias in PDB sequence length?  Yes! Short sequences are over-represented in the ASTRAL SCOP PDB sets

Bias in PDB aa distribution?  No! Approximately equal amino acid distribution in the ASTRAL SCOP PDB sets

Difference in matrices?

Difference in SW implementations?

Conclusions E-value better than Z-value! SW implementations are (more or less) the same (SSEARCH, ParAlign and Biofacet), but SSEARCH with e-value scores best of all Larger structural comparison database needed for better analysis

Credits NV Organon: –Peter Groenen –Wilco Fleuren Wageningen UR: –Jack Leunissen