CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences Ilka Hoof Ph.D. student Immunological Bioinformatics Center for Biological Sequence Analysis Danmarks Tekniske Universitet
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 2/31 Significant positions? HIV-1 gp120 PDB: 2NY7
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 3/31 Significant positions? HIV-1 gp120 PDB: 2NY7 Antibody-binding site?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 4/31 Significant positions? HIV-1 protease PDB: 2CEN Catalytic efficiency?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 5/31 Significant positions? “Which sites in HIV-1 protease contribute significantly to the fitness level of an HIV-1 mutant?” “Where is the binding site of a specific antibody located on the antigen?” “Which sites are important for enzymatic activity?” Given a multiple sequence alignment and a numerical value associated with each sequence Values imply a ranking of the sequences What we’re interested in: Which positions distinguish high and low ranking sequence? e.g. binders vs. non-binders high vs. low fitness high vs low enzymatic activity
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 6/31 The data we have
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 7/31 The output we want...how do we get there?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 8/31 SigniSite 1.0
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 9/31 SigniSite - method Rank-based statistical test real-valued dataranks Calculate mean rank for each residue type
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 10/31 SigniSite - the method
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 11/31 SigniSite - the method Calculate the mean rank for each residue type.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 12/31 SigniSite - the method What’s the null hypothesis of our statistical test? The observed mean rank of a residue type does not significantly deviate from the expected mean rank. What is expected? We assume random distribution of the amino acids in the column. Given N sequences, the expected mean rank is
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 13/31 Z score determines significance Given the shape of the distribution, what’s significant? mean sd obs. rank Z score can be calculated from mean and standard deviation: p < 0.025
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 14/31 Z score determines significance observed mean rank for E
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 15/31 Are the random mean ranks normally distributed?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 16/31 Same mean, but different standard deviation Frequencies: Mean rank distributions for different frequencies
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 17/31 How to estimate the standard deviation? Our test reminds of the Wilcoxon rank statistic: Given two samples of size n 1 and n 2, n 1 +n 2 = N. Let R be the mean rank of sample 1. The distribution of mean ranks R can be approximated by the normal distribution with mean and standard deviation
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 18/31 Coping with ties Formula as before but weighted with tie-correction factor T where and t is a vector which contains the counts of ties, i.e. m denotes the number of distinct values in the data set. Example: all values the same => T = 0 all values different => T = 1
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 19/31 Simple example category 1 category 2
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 20/31 Simple example Tie correction vs. no tie correction Standard deviation Z score
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 21/31 Multiple testing problem We perform a significance test for each amino acid type in each column. Problem: The more hypotheses we test, the higher the probability of obtaining at least one false positive. Each test is performed with the same type-I error e.g. = The total significance level tot of m significance tests is then given by tot 1 - (1 - ) m Examples: 1 test tot 1 - ( ) 1 = tests tot 1 - ( ) 2 = tests tot 1 - ( ) 100 = 0.99 Correction for multiple testing necessary!
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 22/31 How many statistical tests are performed? One test per amino acid type and column. w i is the number of different amino acids in column i
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 23/31 Correction for multiple testing Adjusted p-values using Bonferroni’s single-step method: Multiply all unadjusted p-values by the number of tests m Adjusted p-values are given by for j = 1,..., m
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 24/31 Correction for multiple testing Adjusted p-values using Holm’s step-down method: observed ordered unadjusted p-values Adjusted p-values are given by for j = 1,..., m So, nothing more than:
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 25/31 Application of SigniSite
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 26/31 Ab-binding affinity to HIV-1 gp120 Alignment length: 569 residues
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 27/31 SigniSite web service
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 28/31 SigniSite results 10 significant sites identified. Holm step-down correction, = 0.05 Heatmap
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 29/31 SigniSite results Sequence logos display Z score for all amino acid types display Z score only for significant amino acid types “ordinary” frequency logo
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 30/31 SigniSite results
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 31/31 SDPpred Kalinina et al. (2004), Protein Sci 13(2):
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 32/31 SDPpred Categories instead of continuous values Mutual information Amino acids with similar physico-chemical properties are weakly penalized Statistical test: observed mutual inf. = expected mutual inf.?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 33/31 SDPpred - Results
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 34/31 SDPpred - Results
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 35/31 SDPpred - Results
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 36/31 You can use SigniSite and SDPpred to find sites of interest in your biological data Logos are a nice and clear way of displaying sequence information Whenever you perform statistical tests, remember the multiple testing problem! Conclusion