Sequence Alignment. Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch:

Slides:



Advertisements
Similar presentations
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Advertisements

CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Seeds for Similarity Search Presentation by: Anastasia Fedynak.
Sequence Alignment.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Marc Schaub Andreas Sundquist Monday & Wednesday.
Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs of Words, Patterns 3.Systems.
Linear-Space Alignment. Subsequences and Substrings Definition A string x’ is a substring of a string x, if x = ux’v for some prefix string u and suffix.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Heuristic alignment algorithms and cost matrices
Space Efficient Alignment Algorithms and Affine Gap Penalties
CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties.
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
Linear-Space Alignment. Linear-space alignment Using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, N – k) PLUS the backpointers.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Sequence Alignment Cont’d. Needleman-Wunsch with affine gaps Initialization:V(i, 0) = d + (i – 1)  e V(0, j) = d + (j – 1)  e Iteration: V(i, j) = max{
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Sequence Alignment Lecture 2, Thursday April 3, 2003.
Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA:
Time Warping Hidden Markov Models Lecture 2, Thursday April 3, 2003.
Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia.
Index-based search of single sequences Omkar Mate CS 374 Stanford University.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence Alignment Cont’d. Evolution Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function:
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.
Sequence Alignment Cont’d. Linear-space alignment Iterate this procedure to the left and right! N-k * M/2 k*k*
Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.
CS262 Lecture 4, Win07, Batzoglou Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs.
Alignment II Dynamic Programming
Sequence Alignment Cont’d. CS262 Lecture 4, Win06, Batzoglou Indexing-based local alignment (BLAST- Basic Local Alignment Search Tool) 1.SEED Construct.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Sequence Alignment Lecture 2, Thursday April 3, 2003.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Sequence Alignment. CS262 Lecture 3, Win06, Batzoglou Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition.
Sequence Alignment. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N,
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Sequence Alignment. 2 Sequence Comparison Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
Minimum Edit Distance Definition of Minimum Edit Distance.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Computational Genomics I: Sequence Alignment Eric Xing Lecture.
Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter.
Sequence Alignment Tanya Berger-Wolf CS502: Algorithms in Computational Biology January 25, 2011.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Doug Raiford Phage class: introduction to sequence databases.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
1 Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y.
Definition of Minimum Edit Distance
Homology Search Tools Kun-Mao Chao (趙坤茂)
Presentation transcript:

Sequence Alignment

Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch: -s Gap:-d Score F = (# matches)  m - (# mismatches)  s – (#gaps)  d Alternative definition: minimal edit distance “Given two strings x, y, find minimum # of edits (insertions, deletions, mutations) to transform one string to the other”

The Needleman-Wunsch Matrix x 1 ……………………………… x M y 1 ……………………………… y N Every nondecreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences An optimal alignment is composed of optimal subalignments

The Needleman-Wunsch Algorithm 1.Initialization. a.F(0, 0) = 0 b.F(0, j) = - j  d c.F(i, 0)= - i  d 2.Main Iteration. Filling-in partial alignments a.For each i = 1……M For each j = 1……N F(i – 1,j – 1) + s(x i, y j ) [case 1] F(i, j) = max F(i – 1, j) – d [case 2] F(i, j – 1) – d [case 3] DIAG, if [case 1] Ptr(i, j)= LEFT,if [case 2] UP,if [case 3] 3.Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Bounded Dynamic Programming Assume we know that x and y are very similar Assumption: # gaps(x, y) < k(N) xixi Then,|implies | i – j | < k(N) yj yj We can align x and y more efficiently: Time, Space: O(N  k(N)) << O(N 2 )

Bounded Dynamic Programming Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ s(x i, y j ) F(i, j) = maxF(i, j – 1) – d, if j > i – k(N) F(i – 1, j) – d, if j < i + k(N) Termination:same x 1 ………………………… x M y 1 ………………………… y N k(N)

A variant of the basic algorithm: Maybe it is OK to have an unlimited # of gaps in the beginning and end: CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC ||||||| |||| | || || GCGAGTTCATCTATCAC--GACCGC--GGTCG Then, we don’t want to penalize gaps in the ends

Different types of overlaps Example: 2 overlapping“reads” from a sequencing project Example: Search for a mouse gene within a human chromosome

The Overlap Detection variant Changes: 1.Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2.Termination max i F(i, N) F OPT = max max j F(M, j) x 1 ……………………………… x M y 1 ……………………………… y N

The local alignment problem Given two strings x = x 1 ……x M, y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum x = aaaacccccggggtta y = ttcccgggaaccaacc

Why local alignment – examples Genes are shuffled between genomes Portions of proteins (domains) are often conserved

The Smith-Waterman algorithm Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization:F(0, j) = F(i, 0) = 0 0 Iteration:F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + s(x i, y j )

The Smith-Waterman algorithm Termination: 1.If we want the best local alignment… F OPT = max i,j F(i, j) Find F OPT and trace back 2.If we want all local alignments scoring > t ??For all i, j find F(i, j) > t, and trace back? Complicated by overlapping local alignments Waterman–Eggert ’87: find all non-overlapping local alignments with minimal recalculation of the DP matrix

Scoring the gaps more accurately Current model: Gap of length n incurs penaltyn  d However, gaps usually occur in bunches Convex gap penalty function:  (n): for all n,  (n + 1) -  (n)   (n) -  (n – 1)  (n)

Convex gap dynamic programming Initialization:same Iteration: F(i – 1, j – 1) + s(x i, y j ) F(i, j) = maxmax k=0…i-1 F(k, j) –  (i – k) max k=0…j-1 F(i, k) –  (j – k) Termination: same Running Time: O(N 2 M)(assume N>M) Space: O(NM)

Compromise: affine gaps  (n) = d + (n – 1)  e || gap gap open extend To compute optimal alignment, At position i, j, need to “remember” best score if gap is open best score if gap is not open F(i, j):score of alignment x 1 …x i to y 1 …y j if if x i aligns to y j if G(i, j):score if x i aligns to a gap after y j if H(i, j): score if y j aligns to a gap after x i V(i, j) = best score of alignment x 1 …x i to y 1 …y j d e  (n)

Needleman-Wunsch with affine gaps Why do we need matrices F, G, H? x i aligns to y j x 1 ……x i-1 x i x i+1 y 1 ……y j-1 y j - x i aligns to a gap after y j x 1 ……x i-1 x i x i+1 y 1 ……y j …- - Add -d Add -e G(i+1, j) = F(i, j) – d G(i+1, j) = G(i, j) – e Because, perhaps G(i, j) < V(i, j) (it is best to align x i to y j if we were aligning only x 1 …x i to y 1 …y j and not the rest of x, y), but on the contrary G(i, j) – e > V(i, j) – d (i.e., had we “fixed” our decision that x i aligns to y j, we could regret it at the next step when aligning x 1 …x i+1 to y 1 …y j )

Needleman-Wunsch with affine gaps Initialization:V(i, 0) = d + (i – 1)  e V(0, j) = d + (j – 1)  e Iteration: V(i, j) = max{ F(i, j), G(i, j), H(i, j) } F(i, j) = V(i – 1, j – 1) + s(x i, y j ) V(i – 1, j) – d G(i, j) = max G(i – 1, j) – e V(i, j – 1) – d H(i, j) = max H(i, j – 1) – e Termination: V(i, j) has the best alignment Time? Space?

To generalize a bit… … think of how you would compute optimal alignment with this gap function ….in time O(MN)  (n)

Linear-Space Alignment

Subsequences and Substrings Definition A string x’ is a substring of a string x, if x = ux’v for some prefix string u and suffix string v (similarly, x’ = x i …x j, for some 1  i  j  |x|) A string x’ is a subsequence of a string x if x’ can be obtained from x by deleting 0 or more letters (x’ = x i1 …x ik, for some 1  i 1  …  i k  |x|) Note: a substring is always a subsequence Example: x = abracadabra y = cadabr; substring z = brcdbr;subseqence, not substring

Hirschberg’s algortihm Given a set of strings x, y,…, a common subsequence is a string u that is a subsequence of all strings x, y, … Longest common subsequence  Given strings x = x 1 x 2 … x M, y = y 1 y 2 … y N,  Find longest common subsequence u = u 1 … u k Algorithm: F(i – 1, j) F(i, j) = maxF(i, j – 1) F(i – 1, j – 1) + [1, if x i = y j ; 0 otherwise] Ptr(i, j) = (same as in N-W) Termination: trace back from Ptr(M, N), and prepend a letter to u whenever Ptr(i, j) = DIAG and F(i – 1, j – 1) < F(i, j) Hirschberg’s original algorithm solves this in linear space

F(i,j) Introduction: Compute optimal score It is easy to compute F(M, N) in linear space Allocate ( column[1] ) Allocate ( column[2] ) For i = 1….M If i > 1, then: Free( column[ i – 2 ] ) Allocate( column[ i ] ) For j = 1…N F(i, j) = …

Linear-space alignment To compute both the optimal score and the optimal alignment: Divide & Conquer approach: Notation: x r, y r : reverse of x, y E.g.x = accgg; x r = ggcca F r (i, j): optimal score of aligning x r 1 …x r i & y r 1 …y r j same as aligning x M-i+1 …x M & y N-j+1 …y N

Linear-space alignment Lemma: (assume M is even) F(M, N) = max k=0…N ( F(M/2, k) + F r (M/2, N – k) ) x y M/2 k*k* F(M/2, k) F r (M/2, N – k) Example: ACC-GGTGCCCAGGACTG--CAT ACCAGGTG----GGACTGGGCAG k * = 8

Linear-space alignment Now, using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, N – k) PLUS the backpointers x1x1 …x M/2 y1y1 xMxM yNyN x1x1 …x M/2+1 xMxM … y1y1 yNyN …

Linear-space alignment Now, we can find k * maximizing F(M/2, k) + F r (M/2, N-k) Also, we can trace the path exiting column M/2 from k * k*k* k * …… M/2 M/2+1 …… M M+1

Linear-space alignment Iterate this procedure to the left and right! N-k * M/2 k*k*

Linear-space alignment Hirschberg’s Linear-space algorithm: MEMALIGN(l, l’, r, r’):(aligns x l …x l’ with y r …y r’ ) 1.Let h =  (l’-l)/2  2.Find (in Time O((l’ – l)  (r’ – r)), Space O(r’ – r)) the optimal path,L h, entering column h – 1, exiting column h Let k 1 = pos’n at column h – 2 where L h enters k 2 = pos’n at column h + 1 where L h exits 3.MEMALIGN(l, h – 2, r, k 1 ) 4.Output L h 5.MEMALIGN(h + 1, l’, k 2, r’) Top level call: MEMALIGN(1, M, 1, N)

Linear-space alignment Time, Space analysis of Hirschberg’s algorithm: To compute optimal path at middle column, For box of size M  N, Space: 2N Time:cMN, for some constant c Then, left, right calls cost c( M/2  k * + M/2  (N – k * ) ) = cMN/2 All recursive calls cost Total Time: cMN + cMN/2 + cMN/4 + ….. = 2cMN = O(MN) Total Space: O(N) for computation, O(N + M) to store the optimal alignment

Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs of Words, Patterns 3.Systems for local alignment

Indexing-based local alignment Dictionary: All words of length k (~10) Alignment initiated between words of alignment score  T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… query DB query scan

Indexing-based local alignment— Extensions A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A Gapped extensions until threshold Extensions with gaps until score < C below best score so far Output: GTAAGGTCCAGT GTTAGGTC-AGT

Sensitivity-Speed Tradeoff long words (k = 15) short words (k = 7) Sensitivity Speed Kent WJ, Genome Research 2002 Sens. Speed X%

Sensitivity-Speed Tradeoff Methods to improve sensitivity/speed 1.Using pairs of words 2.Using inexact words 3.Patterns—non consecutive positions ……ATAACGGACGACTGATTACACTGATTCTTAC…… ……GGCACGGACCAGTGACTACTCTGATTCCCAG…… ……ATAACGGACGACTGATTACACTGATTCTTAC…… ……GGCGCCGACGAGTGATTACACAGATTGCCAG…… TTTGATTACACAGAT T G TT CAC G

Measured improvement Kent WJ, Genome Research 2002

Non-consecutive words—Patterns Patterns increase the likelihood of at least one match within a long conserved region 3 common 5 common 7 common Consecutive PositionsNon-Consecutive Positions 6 common On a 100-long 70% conserved region: Consecutive Non-consecutive Expected # hits: Prob[at least one hit]:

Advantage of Patterns 11 positions 10 positions

Multiple patterns K patterns  Takes K times longer to scan  Patterns can complement one another Computational problem:  Given: a model (prob distribution) for homology between two regions  Find: best set of K patterns that maximizes Prob(at least one match) TTTGATTACACAGAT T G TT CAC G T G T C CAG TTGATT A G Buhler et al. RECOMB 2003 Sun & Buhler RECOMB 2004 How long does it take to search the query?

Variants of BLAST NCBI BLAST: search the universe MEGABLAST:  Optimized to align very similar sequences Works best when k = 4i  16 Linear gap penalty WU-BLAST: (Wash U BLAST)  Very good optimizations  Good set of features & command line arguments BLAT  Faster, less sensitive than BLAST  Good for aligning huge numbers of queries CHAOS  Uses inexact k-mers, sensitive PatternHunter  Uses patterns instead of k-mers BlastZ  Uses patterns, good for finding genes Typhon  Uses multiple alignments to improve sensitivity/speed tradeoff