Pairwise Sequence Alignment (II)

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Dynamic Programming: Sequence alignment
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Heuristic alignment algorithms and cost matrices
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Sequence Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Introduction to bioinformatics
Sequence similarity.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Developing Sequence Alignment Algorithms in C++ Dr. Nancy Warter-Perez May 21, 2002.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.
Dynamic Programming I Definition of Dynamic Programming
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Brandon Andrews.  Longest Common Subsequences  Global Sequence Alignment  Scoring Alignments  Local Sequence Alignment  Alignment with Gap Penalties.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
An Introduction to Bioinformatics
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Sequence Alignment.
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Sequence Alignment Using Dynamic Programming
Pairwise sequence Alignment.
Pair Hidden Markov Model
Intro to Alignment Algorithms: Global and Local
Pairwise Sequence Alignment (cont.)
Sequence Alignment.
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Multiple Sequence Alignment (I)
Pairwise Alignment Global & local alignment
Presentation transcript:

Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htm

Review: Dynamic Programming for LCS T C w Edit graph representation of alignment Path = alignment Incrementally fill in the table Backtrack to find the best alignment 1 2 3 4 5 6 7 A T G v 1 1 1 1 1 1 1 1 2 1 2 2 2 2 2 2 3 1 2 2 3 3 3 3 4 1 2 2 3 4 4 4 5 1 2 2 3 4 4 4 6 1 2 2 3 4 5 5 1 7 2 2 3 4 5 5

The LCS Recurrence Revisited The formula can be rewritten by adding zero to the edges that come from an indel, since the penalty of indels are 0: Insertion/deletion score Matching score si-1, j-1+1 if vi = wj si,j = max si-1, j + 0 si, j-1 + 0 How do we improve scoring?

How do we improve the scoring of alignments? Can we still find an alignment efficiently?

Outline Improve Scoring Variants of Alignment Scoring Matrix Affine Gap Penalty Variants of Alignment Global vs. Local alignment Assessing Score Significance

The same dynamic programming algorithm would still work! Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ. In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score with comparison of a gap character “-”. This will simplify the scoring algorithm as follows: si-1,j-1 + δ (vi, wj) si,j = max s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) { The same dynamic programming algorithm would still work!

The Global Alignment Problem Find the best alignment between two strings under a given scoring matrix Input : Strings v & w and a scoring matrix δ Output : Alignment of maximum score Algorithm: Dynamic programming si-1,j-1 + δ (vi, wj) si,j = max s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) { The only question left is how to define the scoring matrix…

Measuring Similarity Measuring the extent of similarity between two sequences Based on percent sequence identity Based on conservation

Percent Sequence Identity The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G – A G A C G T G – G C A G mismatch indel 70% identical

Simple Scoring When mismatches are penalized by some constant –μ, indels are penalized by some other constant –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels)

Making a Better Scoring Matrix Scoring matrices are created based on biological evidence. Alignments can be thought of as two sequences that differ due to mutations in the sequence. Some of these mutations have little effect on the organism’s function, therefore some penalties, δ(vi , wj), will be less harsh than others.

Scoring Matrix: Example K 5 -2 -1 - 7 3 6 Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids will not greatly change function of protein. AKRANR KAAANK -1 + (-1) + (-2) + 5 + 7 + 3 = 11

Scoring matrices Amino acid substitution matrices PAM BLOSUM DNA substitution matrices DNA: less conserved than protein sequences Less effective to compare coding regions at nucleotide level Simple scoring is often used

PAM some residues may have mutated several times Point Accepted Mutation (Dayhoff et al.) 1 PAM = PAM1 = 1% average change of all amino acid positions After 100 PAMs of evolution, not every residue will have changed some residues may have mutated several times some residues may have returned to their original state some residues may have not changed at all

Think of PAM1 as 1-step transitions and PAM250 as 250-step transitions PAMX PAMx = PAM1x PAM250 = PAM1250 PAM250 is a widely used scoring matrix: Think of PAM1 as 1-step transitions and PAM250 as 250-step transitions Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ... Ala A 13 6 9 9 5 8 9 12 6 8 6 7 ... Arg R 3 17 4 3 2 5 3 2 6 3 2 9 Asn N 4 4 6 7 2 5 6 4 6 3 2 5 Asp D 5 4 8 11 1 7 10 5 6 3 2 5 Cys C 2 1 1 1 52 1 1 2 2 2 1 1 Gln Q 3 5 5 6 1 10 7 3 7 2 3 5 ... Trp W 0 2 0 0 0 0 0 0 1 0 1 0 Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1 Val V 7 4 4 4 4 4 4 4 5 4 15 10

BLOSUM Blocks Substitution Matrix Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins Matrix name indicates evolutionary distance BLOSUMx was created using sequences sharing no more than x% identity E.g., BLOSUM62 <-> 62% identity

The Blosum50 Scoring Matrix Probability of seeing x aligned with y Val(x,y)=log(p(x,y)/p(x)p(y)) Probability of seeing x (or y) alone

Deficiency in Scoring of Indels A fixed penalty σ is given to every indel: -σ when there is 1 indel, -2σ for 2 consecutive indels, -3σ for 3 consecutive indels, etc. Can be too severe penalty for a series of 100 consecutive indels

Deficiency in Scoring of Indels (cont.) In nature, many times indels come as a unit, not just at 1 nucleotide at a time. ATA__GC ATATTGC ATAG_GC AT_GTGC In nature, this is more likely. Normal scoring would give the same score for both alignments

Accounting for Gaps Gaps- contiguous sequence of spaces in one of the rows Score for a gap of length x is: -(ρ + σx), where ρ >0 is the penalty for introducing a gap. ρ will be large relative to σ because you do not want to add too much of a penalty for extending the gap.

Affine Gap Penalties Gap penalties: -ρ-σ when there is 1 indels, -ρ-2σ when there are 2 indels, -ρ-3σ when there are 3 indels, etc. -ρ- x * σ (-gap opening - x gap extensions) Somehow reduced penalties (as compared to naïve scoring) are given to runs of horizontal and vertical edges

Affine Gap Penalty Recurrences si,j = s i-1,j - σ max s i-1,j –(ρ+σ) si,j = s i,j-1 - σ max s i,j-1 –(ρ+σ) si,j = si-1,j-1 + δ (vi, wj) max s i,j s i,j Continue Gap in w (deletion) Start Gap in w (deletion) Continue Gap in v (insertion) Start Gap in v (insertion) Match or Mismatch End deletion End insertion Once again, the same dynamic programming algorithm would work!

Local vs. Global Alignment The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph.

Local vs. Global Alignment (cont’d) Local Alignment—better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc

Local Alignments: Why? Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. Example: Homeobox genes have a short region called the homeodomain that is highly conserved between species. A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence

The Local Alignment Problem Goal: Find the best local alignment between two strings Input : Strings v, w and scoring matrix δ Output : Alignment of substrings of v & w whose alignment score is maximum among all possible alignment of all possible substrings

Local Alignment in Edit Graph Compute a “mini” Global Alignment to get Local Local alignment Global alignment

The Problem with this Problem Problem of this, long run time O(n4): - There are ~n2 pairs of vertices (i,j) - For each pair of vertices computing an alignment takes O(n2) time. Solution: Dynamic programming again! Question: How do we recursively compute the best score of any local (as opposed to global) alignment for each cell in the edit graph?

The Local Alignment Recurrence The largest value of si,j over the whole edit graph is the score of the best local alignment. The recurrence is shown below: Notice there is only this change from the original recurrence of a Global Alignment si,j = max si-1,j-1 + δ (vi, wj) s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) { This is the well-known Waterman-Smith local alignment algoirthm

Assessing Score Signficance In general, larger s  more significant. The question is how large should s be? Factors to be considered: Sequence length: longer sequences are expected to give higher scores # sequences in the database: the score of the best alignment is expected to be higher for a larger DB Evolution time: longer evolution causes more mismatches, making a lower score more significant The Challenge is how to quantify all these…

Log-odds score of the alignment Two Basic Approaches The classical approach: Extreme value distribution (EVD) Assume a null (random) model for scores MR P(Score > s|MR, a(x, y))=? (a(x,y)=alignment of x, y) The Bayesian approach: Model comparison Assume two models for a(x,y): random R; aligned: M P(M|a(x,y))/P(R|a(x,y))=? Log-odds score of the alignment prior

EVD of the Best Score in Ungapped Local Alignment The number of unrelated local matches with score higher than S is approximately Poisson distributed, with mean The probability that there is a match of score greater than S is K and  can be fit using randomly generated data This gives a way to test statistical significance p(x>21)= 0.01 vs. p(x>21)=0.3 Parameters Sequence lengths

Bayesian Model Comparison Assumptions: M is a model for related sequences R is a model for unrelated sequences (random) Ungapped alignment n=m Alignment of each pair is independent BLOSUM Scoring Score S(x,y) Prior (Subjective!) This partially addresses Q1: how to design the scoring function?

Pairwise Alignment Summary (Modeling evolution) Q1: How should we define s? Q2: How should we define A? (Application-specific) Model: scoring function s: A X=x1,…,xn X=x1,…,xn Possible alignments of X and Y: A ={a1,…,ak} Find the best alignment(s) … S(a*)= 21 Y=y1,…,ym Y=y1,…,ym Q4: Is the alignment biologically Meaningful or just the best alignment of two unrelated sequences? Q3: How can we find a* quickly? (Dynamic programming) Q1 & Q4 are related! (Models for scores)

What You Should Know Alignment Scoring Methods (Matrix & Gap) Global vs. Local alignments How the dynamic programming algorithm solves both local and global alignments with a number of scoring strategies Basic idea in assessing score significance