A superposition of two sequences that reveals a large number of common regions (matches) Possible alignment of ACATGCGATT and GAGATCTGA -AC-ATGC-GATT 6 matches, 6 gaps, 0 mismatches GA-GAT-CTGA-- -ACATGC-GATT 6 matches, 5 gaps, 1 mismatches GAGAT-CTGA-- -ACATGCGATT 5 matches, 3 gaps, 3 mismatches GAGATCTGA— Pairwise Alignment
An alignment is a hypothesis about the transformations that have converted one sequence into another GATTACA mutationsGATTAGA deletionsGAT. ACA insertionsGATTTACA (the gaps represent insertions/deletions, also called indels) Pairwise Alignment
To evaluate the quality of an alignment assign scores for matches(m) gaps(g) mismatches(s) Score = #matches × m + #gaps × g + #mismatches × s With m = 2, g = -2, s = -1 Scoring Function -AC-ATGC-GATT Score = 6 × × × -1 = 0 GA-GAT-CTGA-- -ACATGC-GATT Score = 6 × × × -1 = 1 GAGAT-CTGA-- -ACATGCGATT Score = 5 × × × -1 = 1 GAGATCTGA--
Computing Alignment Different types of alignment depending on research question Global Alignment – find the overall similarity Semiglobal Alignment – ignore trailing gaps at both ends of alignment Local Alignment – look for a maximal scoring common fragment All can be computed using variation of Dynamic Programming (table-filling) algorithm Illustrative example – a tour of Manhattan
A sightseeing tour starts at 1 st str, 1 st ave up to 7 th str, 9 th ave The tourists are allowed to move only South and East Goal: See as many landmarks as possible Manhattan Tour avenue (1, 1) (7, 9)
For each crossing record max # of sites that can be seen Manhattan Tour Strategy ENTER
Let T(s, a) denote the maximum number of sites that can be seen starting from the origin up to intersection (s, a) Then the previous algorithm uses the fact that T(s-1, a) + # of sites between streets s-1 and s T(s, a-1) + # of sites between avenues a-1 and a In other words, to get to (s, a) we could have moved one block East, from (s, a-1) or one block South, from (s-1, a) If we know the max # of sites that could be seen up to (s, a-1) and up to (s-1, a) we just need to add the number of sites along each direction and pick the larger number Manhattan Tour Strategy T(s, a) = max
How is Manhattan Tour related to global sequence alignment Given strands A, B of length m and n align A[1:m] and B[1:n] option 1: ignore last base of A (pair with gap) – then align A[1 : m-1] and B[1 : n] option 2: ignore last base of B (pair with gap) – then align A[1 : m] and B[1 : n-1] option 3: pair up last two bases of A and B – then align A[1 : m-1] and B[1 : n-1] (Pick the best option) Global Alignment gap penalty match/mismatch penalty
In other words, if Score(i, j) denotes the best score for aligning A[1 : i] and B[1 : j] then Score(i-1, j) + galign A[i] with GAP Score(i, j-1) + galign B[j] with GAP Score(i, j) = max Score(i-1, j-1) + mif A[i] == B[j] Score(i-1, j-1) + sif A[i] <> B[j] Just like the Manhattan tour if we use a 2D table the contents of cell (i, j) depends only on the cell above: (i-1, j) the cell to the left: (i, j-1) the cell diagonally above: (i-1, j-1) Computing Global Alignment
What do we do when one strand runs out of bases, i.e. aligning first i bases of A, A[1 : i], with first 0 bases of B (empty) Score(i, 0) = i*g aligning first 0 bases of A (empty) with first j bases of B, B[1 : j] Score(0, j) = j*g Computing Global Alignment
Align CACTAG and GATTACA using g = -2, s = -1, m = 2 Global Alignment Example -GATTACA - C A C T A G
Align CACTAG and GATTACA using g = -2, s = -1, m = 2 Global Alignment Example -GATTACA C-2 A-4 C-6 T-8 A-10 G-12
Align CACTAG and GATTACA using g = -2, s = -1, m = 2 Global Alignment Example -GATTACA C A C T-8142 A-10 G-12
Align CACTAG and GATTACA using g = -2, s = -1, m = 2 Global Alignment Example -GATTACA C A C T A G
-AGATC - G C T G C Align GCTGC and AGATC using g = -2, s = -1, m = 2 Global Alignment Example
Align GCTGC and AGATC using g = -2, s = -1, m = 2 Global Alignment Example -AGATC - G C T G C GCTGC: AGATC: C C - G T T A C G G A -
If Score(i, j) denotes best score to aligning A[1 : i] and B[1 : j] Score(i-1, j) + galign A[i] with GAP Score(i, j-1) + galign B[j] with GAP Score(i, j) = max Score(i-1, j-1) + mif A[i] == B[j] Score(i-1, j-1) + sif A[i] <> B[j] Score(i, 0) = i * g Score(j, 0) = j * g Identifying the actual alignment is done by tracing back the pointers starting at lower-right corner Global Alignment Summary
To compute GLOBAL ALIGNMENT given two sequences: 1. create a matrix with rows, cols equal to the lengths of the two sequences, respectively # initialize the cells of row 0 and column 0 only 2. for each column c, set cell(0, c) to c*gap 3. for each row r, set cell(r, 0) to r*gap 4. for each row in the matrix starting at 1: 5. for each col in the matrix starting at 1: 6. calculate option1, option2, option3 7. set the current cell to the largest value of option1, option2, option3 8. return the Matrix (or highest score) Global Alignment Algorithm