Sequence Alignment Lecture 2, Thursday April 3, 2003.

Slides:



Advertisements
Similar presentations
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Advertisements

CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Sequence Alignment.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Marc Schaub Andreas Sundquist Monday & Wednesday.
Linear-Space Alignment. Subsequences and Substrings Definition A string x’ is a substring of a string x, if x = ux’v for some prefix string u and suffix.
Lecture 6, Thursday April 17, 2003
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.
Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Space Efficient Alignment Algorithms and Affine Gap Penalties
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez June 24, 2005.
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties.
Sequence Alignment. Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch:
CSE 421 Algorithms Richard Anderson Lecture 19 Longest Common Subsequence.
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
Linear-Space Alignment. Linear-space alignment Using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, N – k) PLUS the backpointers.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Sequence Alignment Lecture 2, Thursday April 3, 2003.
Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA:
Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia.
Sequence Alignment Cont’d. Evolution Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function:
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
CS 6293 Advanced Topics: Current Bioinformatics Lectures 3-4: Pair-wise Sequence Alignment.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.
Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.
CISC667, F05, Lec6, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Pairwise sequence alignment Smith-Waterman (local alignment)
Alignment II Dynamic Programming
Sequence Alignment. CS262 Lecture 3, Win06, Batzoglou Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition.
Sequence Alignment. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N,
Class 2: Basic Sequence Alignment
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Needleman Wunsch Sequence Alignment
Sequence Alignment.
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
Sequence Alignment. 2 Sequence Comparison Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
ADA: 7. Dynamic Prog.1 Objective o introduce DP, its two hallmarks, and two major programming techniques o look at two examples: the fibonacci.
Minimum Edit Distance Definition of Minimum Edit Distance.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Computational Genomics I: Sequence Alignment Eric Xing Lecture.
Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter.
Sequence Alignment Tanya Berger-Wolf CS502: Algorithms in Computational Biology January 25, 2011.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
1 Выравнивание двух последовательностей. 2 AGC A A A C
Sequence Similarity.
A Sub-quadratic Sequence Alignment Algorithm. Global alignment Alignment graph for S = aacgacga, T = ctacgaga Complexity: O(n 2 ) V(i,j) = max { V(i-1,j-1)
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment.
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y.
Definition of Minimum Edit Distance
Lecture 5: Local Sequence Alignment Algorithms
Pairwise sequence Alignment.
Intro to Alignment Algorithms: Global and Local
Presentation transcript:

Sequence Alignment Lecture 2, Thursday April 3, 2003

Review of Last Lecture Lecture 2, Thursday April 3, 2003

Sequence conservation implies function Interleukin region in human and mouse

Lecture 2, Thursday April 3, 2003 Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- || ||||||| |||| | || ||| ||||| TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

Lecture 2, Thursday April 3, 2003 The Needleman-Wunsch Matrix x 1 ……………………………… x M y 1 ……………………………… y N Every nondecreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences

Lecture 2, Thursday April 3, 2003 The Needleman-Wunsch Algorithm x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 F(i,j) i = j = Optimal Alignment: F(4,3) = 2 AGTA A - TA

Lecture 2, Thursday April 3, 2003 The Needleman-Wunsch Algorithm 1.Initialization. a.F(0, 0) = 0 b.F(0, j) = - j  d c.F(i, 0)= - i  d 2.Main Iteration. Filling-in partial alignments a.For each i = 1……M For eachj = 1……N F(i-1,j) – d [case 1] F(i, j) = max F(i, j-1) – d [case 2] F(i-1, j-1) + s(x i, y j ) [case 3] UP, if [case 1] Ptr(i,j)= LEFTif [case 2] DIAGif [case 3] 3.Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Lecture 2, Thursday April 3, 2003 The Overlap Detection variant Changes: 1.Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2.Termination max i F(i, N) F OPT = max max j F(M, j) x 1 ……………………………… x M y 1 ……………………………… y N

Lecture 2, Thursday April 3, 2003 Today Structure of a genome, and cross-species similarity Local alignment More elaborate scoring function Linear-Space Alignment The Four-Russian Speedup

Lecture 2, Thursday April 3, 2003 Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation mature mRNA protein a gene

Lecture 2, Thursday April 3, 2003 Structure of a genome ABMake DC If B then NOT D If A and B then D Make BD If D then B C gene D gene B short sequences regulate expression of genes lots of “junk” sequence e.g. ~50% repeats selfish DNA

Lecture 2, Thursday April 3, 2003 Cross-species genome similarity 98% of genes are conserved between any two mammals ~75% average similarity in protein sequence hum_a : 57331/ mus_a : 78560/ rat_a : / fug_a : 36008/68174 hum_a : 57381/ mus_a : 78610/ rat_a : / fug_a : 36058/68174 hum_a : 57431/ mus_a : 78659/ rat_a : / fug_a : 36084/68174 hum_a : 57481/ mus_a : 78708/ rat_a : / fug_a : 36097/68174 “atoh” enhancer in human, mouse, rat, fugu fish

Lecture 2, Thursday April 3, 2003 The local alignment problem Given two strings x = x 1 ……x M, y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g.x = aaaacccccgggg y = cccgggaaccaacc

Lecture 2, Thursday April 3, 2003 Why local alignment Genes are shuffled between genomes Portions of proteins (domains) are often conserved

Lecture 2, Thursday April 3, 2003 The Smith-Waterman algorithm Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization:F(0, j) = F(i, 0) = 0 0 Iteration:F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + s(x i, y j )

Lecture 2, Thursday April 3, 2003 The Smith-Waterman algorithm Termination: 1.If we want the best local alignment… F OPT = max i,j F(i, j) 2.If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

Lecture 2, Thursday April 3, 2003 Scoring the gaps more accurately Current model: Gap of length n incurs penaltyn  d However, gaps usually occur in bunches Convex gap penalty function:  (n): for all n,  (n + 1) -  (n)   (n) -  (n – 1)  (n)

Lecture 2, Thursday April 3, 2003 General gap dynamic programming Initialization:same Iteration: F(i-1, j-1) + s(x i, y j ) F(i, j) = maxmax k=0…i-1 F(k,j) –  (i-k) max k=0…j-1 F(i,k) –  (j-k) Termination: same Running Time: O(N 2 M)(assume N>M) Space:O(NM)

Lecture 2, Thursday April 3, 2003 Compromise: affine gaps  (n) = d + (n – 1)  e |gap openextend To compute optimal alignment, At position i,j, need to “remember” best score if gap is open best score if gap is not open F(i, j):score of alignment x 1 …x i to y 1 …y j if if x i aligns to y j if G(i, j):score if x i, or y j, aligns to a gap d e  (n)

Lecture 2, Thursday April 3, 2003 Needleman-Wunsch with affine gaps Initialization:F(i, 0) = d + (i – 1)  e F(0, j) = d + (j – 1)  e Iteration: F(i – 1, j – 1) + s(x i, y j ) F(i, j) = max G(i – 1, j – 1) + s(x i, y j ) F(i – 1, j) – d F(i, j – 1) – d G(i, j) = max G(i, j – 1) – e G(i – 1, j) – e Termination: same

Lecture 2, Thursday April 3, 2003 To generalize a little… … think of how you would compute optimal alignment with this gap function ….in time O(MN)  (n)

Lecture 2, Thursday April 3, 2003 Bounded Dynamic Programming Assume we know that x and y are very similar Assumption: # gaps(x, y) M ) xixi Then,|implies | i – j | < k(N) yj yj We can align x and y more efficiently: Time, Space: O(N  k(N)) << O(N 2 )

Lecture 2, Thursday April 3, 2003 Bounded Dynamic Programming Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ s(x i, y j ) F(i, j) = maxF(i, j – 1) – d, if j > i – k(N) F(i – 1, j) – d, if j < i + k(N) Termination:same Easy to extend to the affine gap case x 1 ………………………… x M y 1 ………………………… y N k(N)

Linear-Space Alignment

Lecture 2, Thursday April 3, 2003 F(i,j) Introduction: Compute the optimal score It is easy to compute F(M, N) in linear space Allocate ( column[1] ) Allocate ( column[2] ) For i = 1….M If i > 1, then: Free( column[i – 2] ) Allocate( column[ i ] ) For j = 1…N F(i, j) = …

Lecture 2, Thursday April 3, 2003 Linear-space alignment To compute both the optimal score and the optimal alignment: Divide & Conquer approach: Notation: x r, y r : reverse of x, y E.g.x = accgg; x r = ggcca F r (i, j): optimal score of aligning x r 1 …x r i & y r 1 …y r j same as F(M-i+1, N-j+1)

Lecture 2, Thursday April 3, 2003 Linear-space alignment Lemma: F(M, N) = max k=0…N ( F(M/2, k) + F r (M/2, N-k) ) x y M/2 k*k* F(M/2, k) F r (M/2, N-k)

Lecture 2, Thursday April 3, 2003 Linear-space alignment Now, using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, k) PLUS the backpointers

Lecture 2, Thursday April 3, 2003 Linear-space alignment Now, we can find k * maximizing F(M/2, k) + F r (M/2, k) Also, we can trace the path exiting column M/2 from k * Conclusion: In O(NM) time,O(N) space, we found optimal alignment path at column M/2

Lecture 2, Thursday April 3, 2003 Linear-space alignment Iterate this procedure to the left and right! N-k * M/2 k*k*

Lecture 2, Thursday April 3, 2003 Linear-space alignment Hirschberg’s Linear-space algorithm: MEMALIGN(l, l’, r, r’):(aligns x l …x l’ with y r …y r’ ) 1.Let h =  (l’-l)/2  2.Find in Time O((l’ – l)  (r’-r)), Space O(r’-r) the optimal path,L h, at column h Let k 1 = pos’n at column h – 1 where L h enters k 2 = pos’n at column h + 1 where L h exits 1.MEMALIGN(l, h-1, r, k 1 ) 2.Output L h 3.MEMALIGN(h+1, l’, k 2, r’)

Lecture 2, Thursday April 3, 2003 Linear-space Alignment Time, Space analysis of Hirschberg’s algorithm: To compute optimal path at middle column, For box of size M  N, Space: 2N Time:cMN, for some constant c Then, left, right calls cost c( M/2  k * + M/2  (N-k * ) ) = cMN/2 All recursive calls cost Total Time: cMN + cMN/2 + cMN/4 + ….. = 2cMN = O(MN) Total Space: O(N) for computation, O(N+M) to store the optimal alignment

The Four-Russian Algorithm A useful speedup of Dynamic Programming

Lecture 2, Thursday April 3, 2003 Main Observation Within a rectangle of the DP matrix, values of D depend only on the values of A, B, C, and substrings x l...l’, y r…r’ Definition: A t-block is a t  t square of the DP matrix Idea: Divide matrix in t-blocks, Precompute t-blocks Speedup: O(t) A B C D xlxl x l’ yryr y r’ t

Lecture 2, Thursday April 3, 2003 The Four-Russian Algorithm Main structure of the algorithm: Divide N  N DP matrix into K  K log 2 N-blocks that overlap by 1 column & 1 row For i = 1……K For j = 1……K Compute D i,j as a function of A i,j, B i,j, C i,j, x[l i …l’ i ], y[r j …r’ j ] Time: O(N 2 / log 2 N) times the cost of step 4

Lecture 2, Thursday April 3, 2003 The Four-Russian Algorithm Another observation: ( Assume m = 1, s = 1, d = 1 ) Two adjacent cells of F(.,.) differ by at most 1.

Lecture 2, Thursday April 3, 2003 The Four-Russian Algorithm Definition: The offset vector is a t-long vector of values from {-1, 0, 1}, where the first entry is 0 If we know the value at A, and the top row, left column offset vectors, and x l ……x l’, y r ……y r’, Then we can find D A B C D xlxl x l’ yryr y r’ t

Lecture 2, Thursday April 3, 2003 The Four-Russian Algorithm Definition: The offset function of a t- block is a function that for any given offset vectors of top row, left column, and x l ……x l’, y r ……y r’, produces offset vectors of bottom row, right column A B C D xlxl x l’ yryr y r’ t

Lecture 2, Thursday April 3, 2003 The Four-Russian Algorithm We can pre-compute the offset function: 3 2(t-1) possible input offset vectors 4 2t possible strings x l ……x l’, y r ……y r’ Therefore 3 2(t-1)  4 2t values to pre-compute We can keep all these values in a table, and look up in linear time, or in O(1) time if we assume constant-lookup RAM for log-sized inputs