CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences.

Slides:



Advertisements
Similar presentations
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Advertisements

Statistical Significance What is Statistical Significance? What is Statistical Significance? How Do We Know Whether a Result is Statistically Significant?
© 2010 Pearson Prentice Hall. All rights reserved The Chi-Square Goodness-of-Fit Test.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Stat 301 – Day 28 Review. Last Time - Handout (a) Make sure you discuss shape, center, and spread, and cite graphical and numerical evidence, in context.
Differentially expressed genes
Statistical Significance What is Statistical Significance? How Do We Know Whether a Result is Statistically Significant? How Do We Know Whether a Result.
Probability & Statistics for Engineers & Scientists, by Walpole, Myers, Myers & Ye ~ Chapter 10 Notes Class notes for ISE 201 San Jose State University.
Lecture 9 Today: –Log transformation: interpretation for population inference (3.5) –Rank sum test (4.2) –Wilcoxon signed-rank test (4.4.2) Thursday: –Welch’s.
Test for a Mean. Example A city needs $32,000 in annual revenue from parking fees. Parking is free on weekends and holidays; there are 250 days in which.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Correlation and Regression Analysis
The t Tests Independent Samples.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
The t-test Inferences about Population Means when population SD is unknown.
Multiple testing correction
Multiple testing in high- throughput biology Petter Mostad.
Hypothesis Testing.
Claims about a Population Mean when σ is Known Objective: test a claim.
More About Significance Tests
Lecture 3: Review Review of Point and Interval Estimators
Stats for Engineers Lecture 9. Summary From Last Time Confidence Intervals for the mean t-tables Q Student t-distribution.
X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ μ.
Protein Sequence Alignment and Database Searching.
1 Power and Sample Size in Testing One Mean. 2 Type I & Type II Error Type I Error: reject the null hypothesis when it is true. The probability of a Type.
14 Elements of Nonparametric Statistics
Education Research 250:205 Writing Chapter 3. Objectives Subjects Instrumentation Procedures Experimental Design Statistical Analysis  Displaying data.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
ANOVA (Analysis of Variance) by Aziza Munir
Chapter 10 – Sampling Distributions Math 22 Introductory Statistics.
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
Comp. Genomics Recitation 3 The statistics of database searching.
Confidence intervals and hypothesis testing Petter Mostad
Quick and Simple Statistics Peter Kasper. Basic Concepts Variables & Distributions Variables & Distributions Mean & Standard Deviation Mean & Standard.
Introduction to Inferece BPS chapter 14 © 2010 W.H. Freeman and Company.
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Statistics: Unlocking the Power of Data Lock 5 Exam 2 Review STAT 101 Dr. Kari Lock Morgan 11/13/12 Review of Chapters 5-9.
EGR 252 S10 JMB Ch.10 Part 3 Slide 1 Statistical Hypothesis Testing - Part 3  A statistical hypothesis is an assertion concerning one or more populations.
Review Lecture 51 Tue, Dec 13, Chapter 1 Sections 1.1 – 1.4. Sections 1.1 – 1.4. Be familiar with the language and principles of hypothesis testing.
Aron, Aron, & Coups, Statistics for the Behavioral and Social Sciences: A Brief Course (3e), © 2005 Prentice Hall Chapter 6 Hypothesis Tests with Means.
: An alternative representation of level of significance. - normal distribution applies. - α level of significance (e.g. 5% in two tails) determines the.
Monday, October 22 Hypothesis testing using the normal Z-distribution. Student’s t distribution. Confidence intervals.
1 URBDP 591 A Lecture 12: Statistical Inference Objectives Sampling Distribution Principles of Hypothesis Testing Statistical Significance.
Hypothesis Testing Errors. Hypothesis Testing Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Statistical Inference Drawing conclusions (“to infer”) about a population based upon data from a sample. Drawing conclusions (“to infer”) about a population.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
From the population to the sample The sampling distribution FETP India.
Monday, October 21 Hypothesis testing using the normal Z-distribution. Student’s t distribution. Confidence intervals.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 12 Tests of Goodness of Fit and Independence n Goodness of Fit Test: A Multinomial.
Hypothesis Testing. Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean μ = 120 and variance σ.
Chapter 7 Inference Concerning Populations (Numeric Responses)
Hypothesis Testing and Statistical Significance
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Lecture Slides Elementary Statistics Tenth Edition and the.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Lecturer: Ing. Martina Hanová, PhD.. Regression analysis Regression analysis is a tool for analyzing relationships between financial variables:  Identify.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Test of Goodness of Fit Lecture 41 Section 14.1 – 14.3 Wed, Nov 14, 2007.
Distributions of Nominal Variables
Distributions of Nominal Variables
Comparing Two Proportions
Lecture 41 Section 14.1 – 14.3 Wed, Nov 14, 2007
Sequence comparison: Multiple testing correction
Presentation transcript:

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences Ilka Hoof Ph.D. student Immunological Bioinformatics Center for Biological Sequence Analysis Danmarks Tekniske Universitet

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 2/31 Significant positions? HIV-1 gp120 PDB: 2NY7

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 3/31 Significant positions? HIV-1 gp120 PDB: 2NY7 Antibody-binding site?

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 4/31 Significant positions? HIV-1 protease PDB: 2CEN Catalytic efficiency?

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 5/31 Significant positions? “Which sites in HIV-1 protease contribute significantly to the fitness level of an HIV-1 mutant?” “Where is the binding site of a specific antibody located on the antigen?” “Which sites are important for enzymatic activity?” Given a multiple sequence alignment and a numerical value associated with each sequence  Values imply a ranking of the sequences What we’re interested in: Which positions distinguish high and low ranking sequence? e.g. binders vs. non-binders high vs. low fitness high vs low enzymatic activity

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 6/31 The data we have

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 7/31 The output we want...how do we get there?

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 8/31 SigniSite 1.0

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 9/31 SigniSite - method Rank-based statistical test real-valued dataranks Calculate mean rank for each residue type

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 10/31 SigniSite - the method

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 11/31 SigniSite - the method Calculate the mean rank for each residue type.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 12/31 SigniSite - the method What’s the null hypothesis of our statistical test? The observed mean rank of a residue type does not significantly deviate from the expected mean rank. What is expected? We assume random distribution of the amino acids in the column. Given N sequences, the expected mean rank is

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 13/31 Z score determines significance Given the shape of the distribution, what’s significant? mean sd obs. rank Z score can be calculated from mean and standard deviation: p < 0.025

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 14/31 Z score determines significance observed mean rank for E

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 15/31 Are the random mean ranks normally distributed?

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 16/31 Same mean, but different standard deviation Frequencies: Mean rank distributions for different frequencies

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 17/31 How to estimate the standard deviation? Our test reminds of the Wilcoxon rank statistic: Given two samples of size n 1 and n 2, n 1 +n 2 = N. Let R be the mean rank of sample 1. The distribution of mean ranks R can be approximated by the normal distribution with mean and standard deviation

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 18/31 Coping with ties Formula as before but weighted with tie-correction factor T where and t is a vector which contains the counts of ties, i.e. m denotes the number of distinct values in the data set. Example: all values the same => T = 0 all values different => T = 1

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 19/31 Simple example category 1 category 2

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 20/31 Simple example Tie correction vs. no tie correction Standard deviation Z score

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 21/31 Multiple testing problem We perform a significance test for each amino acid type in each column. Problem: The more hypotheses we test, the higher the probability of obtaining at least one false positive. Each test is performed with the same type-I error  e.g.  = The total significance level  tot of m significance tests is then given by  tot   1 - (1 -  ) m Examples: 1 test  tot   1 - ( ) 1 = tests  tot   1 - ( ) 2 = tests  tot   1 - ( ) 100 = 0.99 Correction for multiple testing necessary!

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 22/31 How many statistical tests are performed? One test per amino acid type and column. w i is the number of different amino acids in column i

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 23/31 Correction for multiple testing Adjusted p-values using Bonferroni’s single-step method: Multiply all unadjusted p-values by the number of tests m Adjusted p-values are given by for j = 1,..., m

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 24/31 Correction for multiple testing Adjusted p-values using Holm’s step-down method: observed ordered unadjusted p-values Adjusted p-values are given by for j = 1,..., m So, nothing more than:

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 25/31 Application of SigniSite

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 26/31 Ab-binding affinity to HIV-1 gp120 Alignment length: 569 residues

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 27/31 SigniSite web service

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 28/31 SigniSite results 10 significant sites identified. Holm step-down correction,  = 0.05 Heatmap

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 29/31 SigniSite results Sequence logos display Z score for all amino acid types display Z score only for significant amino acid types “ordinary” frequency logo

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 30/31 SigniSite results

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 31/31 SDPpred Kalinina et al. (2004), Protein Sci 13(2):

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 32/31 SDPpred Categories instead of continuous values Mutual information Amino acids with similar physico-chemical properties are weakly penalized Statistical test: observed mutual inf. = expected mutual inf.?

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 33/31 SDPpred - Results

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 34/31 SDPpred - Results

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 35/31 SDPpred - Results

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 36/31 You can use SigniSite and SDPpred to find sites of interest in your biological data Logos are a nice and clear way of displaying sequence information Whenever you perform statistical tests, remember the multiple testing problem! Conclusion