Sequence Alignment 11/24/2018
Motivation: Types Two sequences of same length, some characters are different (Database search) Aagtacggaga aagcaccgaga Two seq are of different length, possible gaps in one of them (Database search) Aaccaccgaga Aa-caccgaga 11/24/2018
Motivation: Types Match longest prefix of one with the suffix of the other (fragment assembly) Aaacgtcgata gatacgatg Local alignment: longest substring matching over two sequences (homolog search) Gatacgatgctagtttacg agagcgatgcataattcgaatga 11/24/2018
Motivation: Types Multiple sequence alignment (page 71) (Comparative studies of sequences) 11/24/2018
Formalizing sequence comparison Either a character matches with the corresponding character in an an alignment (+1), Or, it does not (-1), Or, a gap needs to be inserted (-2) 11/24/2018
Global Alignment Smith-Waterman (1981) Dynamic programming algorithm Scoring matrix for alignment ( p 31) Initializing boundaries of the scoring matrix for gaps in front of either string Meaning of an entry to the matrix Corner element is the final score 11/24/2018
Global Alignment Three alternatives in each iteration Ordering of calculation: row or column-wise The algorithm (p 52) Recursive recovery process from corner element (constant m and n, the string lengths) Variable len returned by the algorithm Convention for tie braking 11/24/2018
Local alignment Alignment will stop anywhere So, the min score is zero, even on boundaries Best local alignment is where the score is max in the matrix Recovery starts from that max value, stops at a zero value 11/24/2018
Semi-global (as-required alignment) alignment Four alternatives: penalty-less gaps in front of string s, in front of t, at the back of s, back of t) Prefix-suffix matching by playing with alternatives E.g., suffix of s with prefix of t: gaps at the back of s but in the front of t 11/24/2018
Semi-global alignment Example: p 56 Gaps in front: zeros in row or column representing the string Gaps at the back: recovery starts from the max of row or column representing the string Above may be combined as required Exercise: how to combine for matching suffix of s with prefix of t 11/24/2018
Generalized gap penalty Multiple gaps with the same penalty as that of one or by some formula w(k) Each block matching gaps is to be considered as one unit (like a char) Boundary (first row and col) initialization with w(k) 11/24/2018
Generalized gap penalty Three matrices interplaying: one for character matching with p(I,j) One for gaps in s One for gaps in t Formula on p 63 11/24/2018
Affine gap penalty Generalized gap penalty, with W(k) = h + gk, first gap costs more h+g Formula changes slightly with known w(k) block gap-matrices compares only previous elements: complexity reduces 11/24/2018
Multiple sequence alignment Function for each column: character or gap for each sequence Combinatorics: 2^k –1, for k sequences (-1 for not putting gaps in all columns) But . . . 11/24/2018
Multiple sequence alignment Order of arguments for the function should not matter: f(I,-,v) = f(I,v,-) Score pairwise on a column Combinatorics: (k choose 2) For k=10, 2^k-1 = 1111, kC2=45 We need gap to gap scoring now 11/24/2018
Multiple sequence alignment Total score can be measured either way: Sum over all columns, Or, Sum over all pairs of sequences If p(-, -) = 0, then both the scoring above is same 11/24/2018
Multiple sequence alignment Consider 3 sequence alignment s1, s2, and s3 (I, j, k)-th entry of the scoring matrix is for aligning s1[1..I], s2[1..j], s3[1..k] 3D matrix (n x m x l) dimension, for |s1|=n, |s2|=m, |s3|=l 11/24/2018
Multiple sequence alignment Each entry in scoring matrix will be at a corner of a 3D box Optimal score is calculated over all other 7 corners (max): A[I-1, j,k], A[I, j-1, k], A[I,j, k-1], A[I-1, j-1, k], A[I-1, j, k-1], A[I, j-1, k-1], A[I-1, j-1, k-1] [Vector(I,j,k) - bit-vector] In each case sum-of-pair scores are to be added for the column [EXAMPLE] Initialization: (-4)I 1<=I<=n, for two gaps against substrings of s1, likewise for s2 and s3 11/24/2018
Multiple sequence alignment For k sequences, k-dimensional matrix Each entry is a calculation over 2^k –1 other corners of the “box” Formula page 72 11/24/2018
Alignment improvements Alignment could be from the back also: S[I+1..n], t[j+1..m] Front and back alignment could be combined to “cut” alignment: compute the two matrices, add them, align according to the added matrix 11/24/2018
Alignment improvements When the length of two sequences are comparable and expectation is to have good global alignment: Retrieval is mostly along the diagonal Computation can focus around a strip (fixed (k) number) around diagonal: k-band More efficient Usage of relevant cells only 11/24/2018
Multiple sequence alignment: Star alignment One sequence at center: all others are pairwise aligned against it Which sequence to put at the center? Try each: create a 2D similarity matrix for all pairs, pick up the best (least of summed) row [page 79] 11/24/2018
Multiple sequence alignment: Tree alignment A spanning tree out of the sequences: nodes are sequences Each edge labels the similarity between pair of nodes Total tree cost, or aggregate over edges should be max Star is a special tree 11/24/2018
PAM matrix for matching residues 11/24/2018
BLAST search engine 11/24/2018