Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 11 th,

Slides:

Advertisements

Similar presentations

Bayes rule, priors and maximum a posteriori

Advertisements

Primer on Probability Sushmita Roy BMI/CS 576 Sushmita Roy Oct 2nd, 2012 BMI/CS 576.

Probability Unit 3.

Business Statistics for Managerial Decision

1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.

.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.

Primer on Probability for Discrete Variables BMI/CS 576 Colin Dewey Fall 2010.

OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.

Probability - 1 Probability statements are about likelihood, NOT determinism Example: You can’t say there is a 100% chance of rain (no possibility of.

Parameter Estimation using likelihood functions Tutorial #1

Chapter 5 Basic Probability Distributions

Probability. Probability Definitions and Relationships Sample space: All the possible outcomes that can occur. Simple event: one outcome in the sample.

Heuristic alignment algorithms and cost matrices

This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.

Probability Distributions

. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at

Class 3: Estimating Scoring Rules for Sequence Alignment.

QMS 6351 Statistics and Research Methods Probability and Probability distributions Chapter 4, page 161 Chapter 5 (5.1) Chapter 6 (6.2) Prof. Vera Adamchik.

Alignment III PAM Matrices. 2 PAM250 scoring matrix.

Probability (cont.). Assigning Probabilities A probability is a value between 0 and 1 and is written either as a fraction or as a proportion. For the.

Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.

Jointly distributed Random variables

Expected Value (Mean), Variance, Independence Transformations of Random Variables Last Time:

Recitation 1 Probability Review

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 4 and 5 Probability and Discrete Random Variables.

Machine Learning Queens College Lecture 3: Probability and Statistics.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Review and Preview This chapter combines the methods of descriptive statistics presented in.

Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 5 Section 1 – Slide 1 of 33 Chapter 5 Section 1 Probability Rules.

Introduction to Probability Theory March 24, 2015 Credits for slides: Allan, Arms, Mihalcea, Schutze.

Theory of Probability Statistics for Business and Economics.

Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 34 Chapter 11 Section 1 Random Variables.

Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori.

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Intro to Probability for Discrete Variables BMI/CS 576 Colin Dewey Fall 2015.

Pairwise Sequence Analysis-III

Random Variables Presentation 6.. Random Variables A random variable assigns a number (or symbol) to each outcome of a random circumstance. A random variable.

Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.

Essential Statistics Chapter 91 Introducing Probability.

CHAPTER 10 Introducing Probability BPS - 5TH ED.CHAPTER 10 1.

Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.

The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.

BPS - 3rd Ed. Chapter 91 Introducing Probability.

Machine Learning CUNY Graduate Center Lecture 2: Math Primer.

V7 Foundations of Probability Theory „Probability“ : degree of confidence that an event of an uncertain nature will occur. „Events“ : we will assume that.

Construction of Substitution matrices

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

Welcome to MM305 Unit 3 Seminar Prof Greg Probability Concepts and Applications.

1 Chapter 10 Probability. Chapter 102 Idea of Probability u Probability is the science of chance behavior u Chance behavior is unpredictable in the short.

Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.

Probability and Simulation The Study of Randomness.

Probability Distribution. Probability Distributions: Overview To understand probability distributions, it is important to understand variables and random.

Statistical principles: the normal distribution and methods of testing Or, “Explaining the arrangement of things”

AP Statistics From Randomness to Probability Chapter 14.

Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Probability Distributions ( 확률분포 ) Chapter 5. 2 모든 가능한 ( 확률 ) 변수의 값에 대해 확률을 할당하는 체계 X 가 1, 2, …, 6 의 값을 가진다면 이 6 개 변수 값에 확률을 할당하는 함수 Definition.

Basic Practice of Statistics - 3rd Edition Introducing Probability

Appendix A: Probability Theory

Review of Probability and Estimators Arun Das, Jason Rebello

Basic Practice of Statistics - 3rd Edition Introducing Probability

Pairwise Sequence Alignment (cont.)

Basic Practice of Statistics - 3rd Edition Introducing Probability

Bayes for Beginners Luca Chech and Jolanda Malamud

CS 594: Empirical Methods in HCC Introduction to Bayesian Analysis

M248: Analyzing data Block A UNIT A3 Modeling Variation.

Alignment IV BLOSUM Matrices

Essential Statistics Introducing Probability

Basic Practice of Statistics - 5th Edition Introducing Probability

Presentation transcript:

Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS Sushmita Roy Sep 11 th, 2014

Key concepts in today’s class Probability distributions Discrete and continuous continuous distributions Joint, conditional and marginal distributions Statistical independence Probabilistic interpretation of scores in alignment algorithms Different substitution matrices Estimating simple substitution matrices Assessing significance of scores

RECAP Four issues in sequence alignment – Type of alignment – Algorithm for alignment – Scores for alignment – Significance of alignment scores

Guiding principles of scores in alignments We want to know whether the alignment we observed is meaningful or by chance By meaningful we mean whether the sequences represent a similar biological function To study whether what we observe by chance we need to understand some concepts in probability theory

PROBABILITY PRIMER

Definition of probability Intuitively, we use “probability” to refer to our degree of confidence in an event of an uncertain nature. Always a number in the interval [0,1] 0 means “never occurs” 1 means “always occurs” Frequentist interpretation: the fraction of times an event will be true if we repeated the experiment indefinitely Bayesian interpretation: degree of belief in an event

Sample spaces Sample space: a set of possible outcomes for some experiment Examples – Flight to Chicago: {on time, late} – Lottery: {ticket 1 wins, ticket 2 wins,…,ticket n wins} – Weather tomorrow: {rain, not rain} or {sun, rain, snow} or {sun, clouds, rain, snow, sleet} – Roll of a die: {1,2,3,4,5,6} – Coin toss: {Heads, Tail}

Random variables Random variable: A variable that represents the outcome of an experiment A random variable can be – Discrete: Outcomes take a fixed set of values Roll of die, flight to chicago, weather tomorrow – Continuous: Outcomes take continuous values Height, weight

Notation Uppercase letters and words denote random variables – X, Y Lowercase letters and words denote values – x, y Probability that X takes value x We’ll also use the shorthand form For Boolean random variables, we’ll use the shorthand

Discrete Probability distributions A probability distribution is a mathematical function that specifies the probability of each possible outcome of a random variable We denote this as P(X) for random variable X It specifies the probability of each possible value of X, x Requirements: sun clouds rain snow sleet

Joint probability distributions Joint probability distribution: the function given by P(X = x, Y = y) Read “ X equals x and Y equals y ” Example x, y P(X = x, Y = y) sun, on-time 0.20 rain, on-time 0.20 snow, on-time 0.05 sun, late 0.10 rain, late 0.30 snow, late 0.15 probability that it’s sunny and my flight is on time

Marginal probability distributions The marginal distribution of X is defined by “the distribution of X ignoring other variables” This definition generalizes to more than two variables, e.g.

Marginal distribution example x, y P(X = x, Y = y) sun, on-time 0.20 rain, on-time 0.20 snow, on-time 0.05 sun, late 0.10 rain, late 0.30 snow, late 0.15 xP(X = x) sun 0.3 rain 0.5 snow 0.2 joint distributionmarginal distribution for X

Conditional distributions The conditional distribution of X given Y is defined as: Or in short The distribution of X given that we know the value of Y Intuitively, how much does knowing Y tell us about X ?

Conditional distribution example x, y P(X = x, Y = y) sun, on-time 0.20 rain, on-time 0.20 snow, on-time 0.05 sun, late 0.10 rain, late 0.30 snow, late 0.15 xP(X = x|Y=on-time) sun 0.20/0.45 = rain 0.20/0.45 = snow 0.05/0.45 = joint distribution conditional distribution for X given Y=on-time

Independence Two random variables, X and Y, are independent if Another way to think about this is knowing X does not tell us anything about Y

Independence example #1 x, y P(X = x, Y = y) sun, on-time 0.20 rain, on-time 0.20 snow, on-time 0.05 sun, late 0.10 rain, late 0.30 snow, late 0.15 xP(X = x) sun 0.3 rain 0.5 snow 0.2 joint distributionmarginal distributions yP(Y = y) on-time 0.45 late 0.55 Are X and Y independent here?NO.

Independence example #2 x, y P(X = x, Y = y) sun, fly-United 0.27 rain, fly-United 0.45 snow, fly-United 0.18 sun, fly-Northwest 0.03 rain, fly-Northwest 0.05 snow, fly-Northwest 0.02 xP(X = x) sun 0.3 rain 0.5 snow 0.2 joint distributionmarginal distributions yP(Y = y) fly-United 0.9 fly-Northwest 0.1 Are X and Y independent here?YES.

Two random variables X and Y are conditionally independent given Z if “once you know the value of Z, knowing Y doesn’t tell you anything about X ” Alternatively Conditional independence

Conditional independence example FluFeverHeadacheP true 0.04 true false0.04 truefalsetrue0.01 truefalse 0.01 falsetrue falsetruefalse0.081 false true0.081 false Are Fever and Headache independent? NO.

Conditional independence example FluFeverHeadacheP true 0.04 true false0.04 truefalsetrue0.01 truefalse 0.01 falsetrue falsetruefalse0.081 false true0.081 false Are Fever and Headache conditionally independent given Flu : YES.

Chain rule of probability For two variables For three variables etc. to see that this is true, note that

Example discrete distributions Binomial distribution Multinomial distribution

Two outcomes per trial of an experiment Distribution over the number of successes in a fixed number n of independent trials (with same probability of success p in each) e.g. the probability of x heads in n coin flips The binomial distribution P(X=x) p=0.5p=0.1 xx P(X=x)

k possible outcomes on each trial Probability p i for outcome x i in each trial Distribution over the number of occurrences x i for each outcome in a fixed number n of independent trials e.g. with k=6 (a six-sided die) and n=30 The multinomial distribution vector of outcome occurrences

Continuous random variables When our outcome is a continuous number we need a continuous random variable Examples: Weight, Height We specify a density function for random variable X as Probabilities are specified over an interval to derive probability values Probability of taking on a single value is 0.

Continuous random variables contd To define a probability distribution for a continuous variable, we need to integrate f(x)

Examples of continuous distributions Gaussian distribution Exponential distribution Extreme Value distribution

Gaussian distribution

Extreme Value Distribution Used for describing the distribution of extreme values of another distribution Max values from 1000 sets of 500 samples from a standard normal distribution f(x)

END PROBABILITY PRIMER

Guiding principles of scores in alignments We need to assess whether an alignment is biologically meaningful Compute the probability of seeing an alignment under two models – Related model R – Unrelated model U

R : related model Assume each pair of aligned positions evolved from a common ancestor Let p ab be the probability of observing a pair {a,b} Probability of an alignment between x and y is

U : unrelated model Assume the individual amino acids at a position are independent of the amino acid in another position. Let q a be the probability of a The probability of an n-character alignment of x and y is

Determine which model is more likely The score of an alignment is the relative likelihood of the alignment from U and R Taking the log we get

Score of an alignment Substitution matrix entry should thus be Score of an alignment is

How to estimate the probabilities? Need a good set of confirmed alignments Depends upon what we know about when the two sequences might have diverged – p ab for closely related species is likely to be low if a !=b – p ab for species that have diverged a long time ago is likely close to the background.

Some common substitution matrices BLOSUM matrices [Henikoff and Henikoff, 1992] – BLOSUM45 – BLOSUM50 – BLOSUM62 Number represents percent identity of sequences used to construct substitution matrices PAM [Dayhoff et al, 1978] Empirically, BLOSUM62 works the best

BLOSUM matrices BLOck Substitution Matrix Derived from a set of aligned ungapped regions from protein families called BLOCKS – Reside in the BLOCKS database Cluster proteins such that they have no less than L% of similarity – BLOSUM50: Proteins >50% similarity are in the same group – BLOSUM62: Proteins >62% similarity are in the same group Calculate substitution frequencies A ab of observing a in one cluster and b in another cluster

Example substitution scoring matrix (BLOSUM62)

Estimating the probabilities in BLOSUM Number of occurrences of a Number of occurrences of a and b

Example of BLOSUM matrix calculation 1A T C K Q 2A T C R N 3A S C K N 4S S C R N 5S D C E Q 6S E C E N 7 T E C R Q Three blocks at 50% similarity Seven blocks at 62% similarity Probabilities at 62% similarity

Assessing significance of the alignment score There are two ways to do this – Bayesian framework – Classical approach

The classical approach to assessing sequence :Extreme Value Distribution Suppose we have a particular substitution matrix and amino-acid frequencies We need to consider random sequences of lengths m and n and finding the best alignment of these sequences This will give us a distribution over alignment scores for random pairs of sequences If the probability of a random score being greater than our alignment score is small, we can consider our score to be significant

The extreme value distribution we’re picking the best alignments, so we want to know what the distribution of max scores for alignments against a random set of sequences looks like this is given by an extreme value distribution x P(x)

Assessing significance of sequence score alignments It can be shown that the mode of the distribution for optimal scores is – K, λ estimated from the substitution matrix Probability of observing a score greater than S

Bayes theorem An extremely useful theorem There are many cases when it is hard to estimate P(x | y) directly, but it’s not too hard to estimate P(y | x) and P(x)

Bayes theorem example MDs usually aren’t good at estimating P(Disorder| Symptom) They’re usually better at estimating P(Symptom | Disorder) If we can estimate P(Fever | Flu) and P(Flu) we can use Bayes’ Theorem to do diagnosis