Sequence Alignment. Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch:

Sequence Alignment

Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch: -s Gap:-d Score F = (# matches)  m - (# mismatches)  s – (#gaps)  d Alternative definition: minimal edit distance “Given two strings x, y, find minimum # of edits (insertions, deletions, mutations) to transform one string to the other”

The Needleman-Wunsch Matrix x 1 ……………………………… x M y 1 ……………………………… y N Every nondecreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences An optimal alignment is composed of optimal subalignments

The Needleman-Wunsch Algorithm 1.Initialization. a.F(0, 0) = 0 b.F(0, j) = - j  d c.F(i, 0)= - i  d 2.Main Iteration. Filling-in partial alignments a.For each i = 1……M For each j = 1……N F(i – 1,j – 1) + s(x i, y j ) [case 1] F(i, j) = max F(i – 1, j) – d [case 2] F(i, j – 1) – d [case 3] DIAG, if [case 1] Ptr(i, j)= LEFT,if [case 2] UP,if [case 3] 3.Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Bounded Dynamic Programming Assume we know that x and y are very similar Assumption: # gaps(x, y) < k(N) xixi Then,|implies | i – j | < k(N) yj yj We can align x and y more efficiently: Time, Space: O(N  k(N)) << O(N 2 )

Bounded Dynamic Programming Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ s(x i, y j ) F(i, j) = maxF(i, j – 1) – d, if j > i – k(N) F(i – 1, j) – d, if j < i + k(N) Termination:same x 1 ………………………… x M y 1 ………………………… y N k(N)

A variant of the basic algorithm: Maybe it is OK to have an unlimited # of gaps in the beginning and end: ----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC ||||||| |||| | || || GCGAGTTCATCTATCAC--GACCGC--GGTCG-------------- Then, we don’t want to penalize gaps in the ends

Different types of overlaps Example: 2 overlapping“reads” from a sequencing project Example: Search for a mouse gene within a human chromosome

The Overlap Detection variant Changes: 1.Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2.Termination max i F(i, N) F OPT = max max j F(M, j) x 1 ……………………………… x M y 1 ……………………………… y N

The local alignment problem Given two strings x = x 1 ……x M, y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum x = aaaacccccggggtta y = ttcccgggaaccaacc

Why local alignment – examples Genes are shuffled between genomes Portions of proteins (domains) are often conserved

The Smith-Waterman algorithm Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization:F(0, j) = F(i, 0) = 0 0 Iteration:F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + s(x i, y j )

The Smith-Waterman algorithm Termination: 1.If we want the best local alignment… F OPT = max i,j F(i, j) Find F OPT and trace back 2.If we want all local alignments scoring > t ??For all i, j find F(i, j) > t, and trace back? Complicated by overlapping local alignments Waterman–Eggert ’87: find all non-overlapping local alignments with minimal recalculation of the DP matrix

Scoring the gaps more accurately Current model: Gap of length n incurs penaltyn  d However, gaps usually occur in bunches Convex gap penalty function:  (n): for all n,  (n + 1) -  (n)   (n) -  (n – 1)  (n)

Convex gap dynamic programming Initialization:same Iteration: F(i – 1, j – 1) + s(x i, y j ) F(i, j) = maxmax k=0…i-1 F(k, j) –  (i – k) max k=0…j-1 F(i, k) –  (j – k) Termination: same Running Time: O(N 2 M)(assume N>M) Space: O(NM)

Compromise: affine gaps  (n) = d + (n – 1)  e || gap gap open extend To compute optimal alignment, At position i, j, need to “remember” best score if gap is open best score if gap is not open F(i, j):score of alignment x 1 …x i to y 1 …y j if if x i aligns to y j if G(i, j):score if x i aligns to a gap after y j if H(i, j): score if y j aligns to a gap after x i V(i, j) = best score of alignment x 1 …x i to y 1 …y j d e  (n)

Needleman-Wunsch with affine gaps Why do we need matrices F, G, H? x i aligns to y j x 1 ……x i-1 x i x i+1 y 1 ……y j-1 y j - x i aligns to a gap after y j x 1 ……x i-1 x i x i+1 y 1 ……y j …- - Add -d Add -e G(i+1, j) = F(i, j) – d G(i+1, j) = G(i, j) – e Because, perhaps G(i, j) < V(i, j) (it is best to align x i to y j if we were aligning only x 1 …x i to y 1 …y j and not the rest of x, y), but on the contrary G(i, j) – e > V(i, j) – d (i.e., had we “fixed” our decision that x i aligns to y j, we could regret it at the next step when aligning x 1 …x i+1 to y 1 …y j )

Needleman-Wunsch with affine gaps Initialization:V(i, 0) = d + (i – 1)  e V(0, j) = d + (j – 1)  e Iteration: V(i, j) = max{ F(i, j), G(i, j), H(i, j) } F(i, j) = V(i – 1, j – 1) + s(x i, y j ) V(i – 1, j) – d G(i, j) = max G(i – 1, j) – e V(i, j – 1) – d H(i, j) = max H(i, j – 1) – e Termination: V(i, j) has the best alignment Time? Space?

To generalize a bit… … think of how you would compute optimal alignment with this gap function ….in time O(MN)  (n)

Linear-Space Alignment

Subsequences and Substrings Definition A string x’ is a substring of a string x, if x = ux’v for some prefix string u and suffix string v (similarly, x’ = x i …x j, for some 1  i  j  |x|) A string x’ is a subsequence of a string x if x’ can be obtained from x by deleting 0 or more letters (x’ = x i1 …x ik, for some 1  i 1  …  i k  |x|) Note: a substring is always a subsequence Example: x = abracadabra y = cadabr; substring z = brcdbr;subseqence, not substring

Hirschberg’s algortihm Given a set of strings x, y,…, a common subsequence is a string u that is a subsequence of all strings x, y, … Longest common subsequence  Given strings x = x 1 x 2 … x M, y = y 1 y 2 … y N,  Find longest common subsequence u = u 1 … u k Algorithm: F(i – 1, j) F(i, j) = maxF(i, j – 1) F(i – 1, j – 1) + [1, if x i = y j ; 0 otherwise] Ptr(i, j) = (same as in N-W) Termination: trace back from Ptr(M, N), and prepend a letter to u whenever Ptr(i, j) = DIAG and F(i – 1, j – 1) < F(i, j) Hirschberg’s original algorithm solves this in linear space

F(i,j) Introduction: Compute optimal score It is easy to compute F(M, N) in linear space Allocate ( column[1] ) Allocate ( column[2] ) For i = 1….M If i > 1, then: Free( column[ i – 2 ] ) Allocate( column[ i ] ) For j = 1…N F(i, j) = …

Linear-space alignment To compute both the optimal score and the optimal alignment: Divide & Conquer approach: Notation: x r, y r : reverse of x, y E.g.x = accgg; x r = ggcca F r (i, j): optimal score of aligning x r 1 …x r i & y r 1 …y r j same as aligning x M-i+1 …x M & y N-j+1 …y N

Linear-space alignment Lemma: (assume M is even) F(M, N) = max k=0…N ( F(M/2, k) + F r (M/2, N – k) ) x y M/2 k*k* F(M/2, k) F r (M/2, N – k) Example: ACC-GGTGCCCAGGACTG--CAT ACCAGGTG----GGACTGGGCAG k * = 8

Linear-space alignment Now, using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, N – k) PLUS the backpointers x1x1 …x M/2 y1y1 xMxM yNyN x1x1 …x M/2+1 xMxM … y1y1 yNyN …

Linear-space alignment Now, we can find k * maximizing F(M/2, k) + F r (M/2, N-k) Also, we can trace the path exiting column M/2 from k * k*k* k * +1 0 1 …… M/2 M/2+1 …… M M+1

Linear-space alignment Iterate this procedure to the left and right! N-k * M/2 k*k*

Linear-space alignment Hirschberg’s Linear-space algorithm: MEMALIGN(l, l’, r, r’):(aligns x l …x l’ with y r …y r’ ) 1.Let h =  (l’-l)/2  2.Find (in Time O((l’ – l)  (r’ – r)), Space O(r’ – r)) the optimal path,L h, entering column h – 1, exiting column h Let k 1 = pos’n at column h – 2 where L h enters k 2 = pos’n at column h + 1 where L h exits 3.MEMALIGN(l, h – 2, r, k 1 ) 4.Output L h 5.MEMALIGN(h + 1, l’, k 2, r’) Top level call: MEMALIGN(1, M, 1, N)

Linear-space alignment Time, Space analysis of Hirschberg’s algorithm: To compute optimal path at middle column, For box of size M  N, Space: 2N Time:cMN, for some constant c Then, left, right calls cost c( M/2  k * + M/2  (N – k * ) ) = cMN/2 All recursive calls cost Total Time: cMN + cMN/2 + cMN/4 + ….. = 2cMN = O(MN) Total Space: O(N) for computation, O(N + M) to store the optimal alignment

Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs of Words, Patterns 3.Systems for local alignment

Indexing-based local alignment Dictionary: All words of length k (~10) Alignment initiated between words of alignment score  T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… query DB query scan

Indexing-based local alignment— Extensions A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A Gapped extensions until threshold Extensions with gaps until score < C below best score so far Output: GTAAGGTCCAGT GTTAGGTC-AGT

Sensitivity-Speed Tradeoff long words (k = 15) short words (k = 7) Sensitivity Speed Kent WJ, Genome Research 2002 Sens. Speed X%

Sensitivity-Speed Tradeoff Methods to improve sensitivity/speed 1.Using pairs of words 2.Using inexact words 3.Patterns—non consecutive positions ……ATAACGGACGACTGATTACACTGATTCTTAC…… ……GGCACGGACCAGTGACTACTCTGATTCCCAG…… ……ATAACGGACGACTGATTACACTGATTCTTAC…… ……GGCGCCGACGAGTGATTACACAGATTGCCAG…… TTTGATTACACAGAT T G TT CAC G

Measured improvement Kent WJ, Genome Research 2002

Non-consecutive words—Patterns Patterns increase the likelihood of at least one match within a long conserved region 3 common 5 common 7 common Consecutive PositionsNon-Consecutive Positions 6 common On a 100-long 70% conserved region: Consecutive Non-consecutive Expected # hits: 1.070.97 Prob[at least one hit]:0.300.47

Advantage of Patterns 11 positions 10 positions

Multiple patterns K patterns  Takes K times longer to scan  Patterns can complement one another Computational problem:  Given: a model (prob distribution) for homology between two regions  Find: best set of K patterns that maximizes Prob(at least one match) TTTGATTACACAGAT T G TT CAC G T G T C CAG TTGATT A G Buhler et al. RECOMB 2003 Sun & Buhler RECOMB 2004 How long does it take to search the query?

Variants of BLAST NCBI BLAST: search the universe http://www.ncbi.nlm.nih.gov/BLAST/ http://www.ncbi.nlm.nih.gov/BLAST/ MEGABLAST: http://genopole.toulouse.inra.fr/blast/megablast.html http://genopole.toulouse.inra.fr/blast/megablast.html  Optimized to align very similar sequences Works best when k = 4i  16 Linear gap penalty WU-BLAST: (Wash U BLAST) http://blast.wustl.edu/ http://blast.wustl.edu/  Very good optimizations  Good set of features & command line arguments BLAT http://genome.ucsc.edu/cgi-bin/hgBlat http://genome.ucsc.edu/cgi-bin/hgBlat  Faster, less sensitive than BLAST  Good for aligning huge numbers of queries CHAOS http://www.cs.berkeley.edu/~brudno/chaos http://www.cs.berkeley.edu/~brudno/chaos  Uses inexact k-mers, sensitive PatternHunter http://www.bioinformaticssolutions.com/products/ph/index.php http://www.bioinformaticssolutions.com/products/ph/index.php  Uses patterns instead of k-mers BlastZ http://www.psc.edu/general/software/packages/blastz/ http://www.psc.edu/general/software/packages/blastz/  Uses patterns, good for finding genes Typhon http://typhon.stanford.edu http://typhon.stanford.edu  Uses multiple alignments to improve sensitivity/speed tradeoff

Sequence Alignment. Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch:

Similar presentations

Presentation on theme: "Sequence Alignment. Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence Alignment. Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch:

Similar presentations

Presentation on theme: "Sequence Alignment. Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch:"— Presentation transcript:

Similar presentations

About project

Feedback