Download presentation
Presentation is loading. Please wait.
Published byHailee Harrill Modified over 10 years ago
1
Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576
2
Key concepts in today’s class What is pairwise sequence alignment? How to handle gaps? – Gap penalties What are the issues in sequence alignment? What are the different types of alignment problems – Global alignment – Local alignment Dynamic programming solution to finding the best alignment Given a scoring function, find the best alignment of two sequences – Global – Local
3
What is sequence alignment The task of locating equivalent regions of two or more sequences to maximize their overall similarity
4
Why sequence alignment? Identifying similarity between two sequences Sequence specifies the function of a protein Similarity in sequence can imply similarity in function. – Assign function to uncharacterized sequences based on characterized sequences Sequence from different species can be compared to estimate the evolutionary relationships between species – We will come back to this in Phylogenetic trees.
5
Alignment example T H I S S E Q U E N C E T H A T S E Q U E N C E
6
How to compare these two sequences? T H I S S E Q U E N C E T H A T I S A S E Q U E N C E
7
Need to incorporate gaps _ _ _T H I S S E Q U E N C E T H A T I S A S E Q U E N C E T H I S _ _ _ S E Q U E N C E T H A T I S A S E Q U E N C E Alignment 1: 3 gaps, 8 matchesAlignment 2: 3 gaps, 9 matches
8
Issues in sequence alignment What type of alignment? – Align the entire sequence or part of it? How to score an alignment? – the sequences we’re comparing typically differ in length – some amino acid pairs are more substitutable than others How to find the alignment? – Algorithms for alignment How to tell if the alignment is biologically meaningful? – Computing significance of an alignment
9
Pairwise alignment: task definition Given – a pair of sequences (DNA or protein) – a scoring function Do – determine the common substrings in the sequences such that the similarity score is maximized
10
Sequence can change through different mutations substitutions: ACGA AGGA insertions: ACGA ACGGA deletions: ACGA AGA Gaps
11
Scoring function Scoring function comprises – Substitution matrix s(a,b): indicates the score of aligning a with b – Gap penalty function γ (g) where g is the gap length DNA has a simple scoring function – Consider match and mismatch Proteins have a more complex scoring function – Some amino acid pairs might be more substitutable versus others – Scores might capture similar physical and chemical properties evolutionary relationships More next class
12
Gap penalty function Different gap penalty functions require somewhat different scoring algorithms Linear score – γ(g)=-dg, where d is a fixed cost, g is the gap length – Longer the gap higher the penalty Affine score – γ(g)=-(d +(g-1)e), d and e are fixed penalties and e<d. – d : gap open penalty – e : gap extension penalty
13
Types of pairwise sequence alignment problems Global alignment (Needleman-Wunsch algorithm) – Align the entire sequence Local alignment (Smith-Waterman algorithm) – Align part of the sequence
14
Global alignment Given two sequences u and v and a scoring function, find the alignment with the maximal score. The number of possible alignments is very big.
15
Number of possible alignments CAATGAATTGATSequence 1Sequence 2 CAATGA_ _ATTGAT CAATGA ATTGAT CAATGA_ AT_TGAT Alignment 1Alignment 2Alignment 3
16
Number of possible alignments There are possible global alignments for 2 sequences of length n E.g. two sequences of length 100 have possible alignments But we can use dynamic programming to find an optimal alignment efficiently
17
Dynamic programming (DP) Divide and conquer Solve a problem using pre-computed solutions for subparts of the problem
18
Dynamic programming for sequence alignment Two sequences: AAAC, AAG x: AAAC – x i : Denotes the i th letter y: AAG – y i denotes the j th letter in y. Assume cost of gap is d. Assume we have aligned x 1..x i to y 1..y j There are three possibilities AAx i AAy j AAA AAG AA_ AAG AAA AA_
19
Dynamic programming idea Let x’ s length be n Let y’ s length be m construct an ( n +1) ( m +1) matrix F F ( j, i ) = score of the best alignment of x 1 …x i with y 1 …y j y x A A CAG A C score of best alignment of AAA to AG
20
DP for global alignment with linear gap penalty F(j,i) F(j-1,i-1) F(j-1,i) F(j,i-1) -d s(x i,y j ) DP recurrence relation Score of the best partial alignment between x 1..x i and y 1.. y j
21
DP algorithm sketch: global alignment Initialize first row and column of matrix Fill in rest of matrix from top to bottom, left to right For each F(i, j), save pointer(s) to cell(s) that resulted in best score F(m, n) holds the optimal alignment score; Trace back from F(m, n) to F(0, 0) to recover alignment
22
Global alignment example Suppose we choose the following scoring scheme d = 2, where d is the gap penalty
23
Initializing matrix: global alignment with linear gap penalty A -d A -2d CAG A -3d C -4d 0 -3d-d-2d
24
Global alignment example A -2 A -4 CAG A -6 C -8 0-6-2-4 1 -3 0-2 -5-4 -3
25
Trace back to retrieve the alignment A -2 A -4 CAG A -6 C -8 0-6-2-4 1 -3 0-2 -5-4 -3 x: y: one optimal alignment C C A _ A G A A
26
Computational complexity initialization: O(m), O(n) where sequence lengths are m, n filling in rest of matrix: O(mn) traceback: O(m + n) Since sequences have nearly same length, the computational complexity is
27
Summary of DP Maximize F(j,i) using F(j-1,i-1), F(j-1,i) or F(j,i-1) works for either DNA or protein sequences, although the substitution matrices used differ finds an optimal alignment the exact algorithm (and computational complexity) depends on gap penalty function
28
Local alignment Look for local regions sequence similarity. – Aligning substrings of x and y. Try to find the best alignment over all possible substrings. This seems difficult computationally. – But can be solved by DP.
29
Local alignment motivation Comparing protein sequences that share a domain (independently folded unit) but differ elsewhere Comparing DNA sequences that share a similar sequence pattern but differ elsewhere More sensitive when comparing highly diverged sequences – Selection on the genome differs between regions – Unconstrained regions are not very alignable
30
Local alignment DP Similar to global alignment with a few exceptions DP recurrence is different Top and left row are initialized to 0s. New in local: starts a new alignment
31
Local alignment DP algorithm Initialization: first row and first column initialized with 0’s Traceback: – Find maximum value of F(j, i) can be anywhere in matrix – Stop when we get to a cell with value 0
32
Local alignment example 0 0 0000 0000 0 T T A A G 0 0 0 0 0 00 G 0 A 0 A 0 A 1 0 1 12 3 1 1 x: y: G G A A A A 1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.