CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties.

Slides:



Advertisements
Similar presentations
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.
Advertisements

Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Sequence Alignment.
Lecture 8 Alignment of pairs of sequence Local and global alignment
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Sequence Alignment. Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch:
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment.
Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA:
Sequence Alignment Cont’d. Evolution Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function:
Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
BNFO 602 Multiple sequence alignment Usman Roshan.
CISC667, F05, Lec6, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Pairwise sequence alignment Smith-Waterman (local alignment)
Alignment II Dynamic Programming
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Sequence Alignment Lecture 2, Thursday April 3, 2003.
Pairwise alignment Computational Genomics and Proteomics.
Sequence Alignment. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N,
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence Alignment.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Cédric Notredame (19/10/2015) Using Dynamic Programming To Align Sequences Cédric Notredame.
Minimum Edit Distance Definition of Minimum Edit Distance.
We want to calculate the score for the yellow box. The final score that we fill in the yellow box will be the SUM of two other scores, we’ll call them.
CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.
CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Expected accuracy sequence alignment Usman Roshan.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Alignment Tanya Berger-Wolf CS502: Algorithms in Computational Biology January 25, 2011.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
DNA, RNA and protein are an alien language
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
CS 5263 Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.
TU/e Algorithms (2IL15) – Lecture 4 1 DYNAMIC PROGRAMMING II
1 Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y.
Learning to Align: a Statistical Approach
CS502: Algorithms in Computational Biology
Lectures 3-6: Pair-wise Sequence Alignment
Lecture 5: Local Sequence Alignment Algorithms
Sequence Alignment 11/24/2018.
Using Dynamic Programming To Align Sequences
Lecture 6: Sequence Alignment Statistics
Affine gaps for sequence alignment
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Presentation transcript:

CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties

Last lecture Local Sequence Alignment Bounded Dynamic Programming Linear Space Sequence Alignment

The Smith-Waterman algorithm Initialization: F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) +  (x i, y j ) Iteration: F(i, j) = max

The Smith-Waterman algorithm Termination: 1.If we want the best local alignment… F OPT = max i,j F(i, j) 2.If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

Bounded Dynamic Programming O(kM) time O(kN) memory x 1 ………………………… x M y N ………………………… y 1 k

Linear-space alignment N-k * M/2 k*k* O(M+N) memory 2MN time

Homework Problem 5 hints Dot matrix for visualizing seq similarities Seq1: x[1..m] Seq2: y[1..n] A(i, j) = 1 if  k=1:10 (  (x i+k, y j+k )) > 7 A(i, j) = 1 if  k=1:20 (  (x i+k, y j+k )) > 15 A dot matrix does not do any alignment (global or local). It helps to detect strongly conserved regions. A(i, j) = 1 if  (x i, y j ) = 1

Seq1 Seq2

Today How to model gaps more accurately? Statistics of alignments –Where does  (x i, y j ) come from? –Are two aligned sequences actually related? – not today

What’s a better alignment? GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG Score = 8 x m – 3 x d However, gaps usually occur in bunches. -During evolution, chunks of DNA may be lost entirely -Aligning genomic sequences vs. cDNAs (reverse complimentary to mRNAs)

Model gaps more accurately Current model: –Gap of length n incurs penalty n  d General: –Convex function –E.g.  (n) = c * sqrt (n)   n n

General gap dynamic programming Initialization:same Iteration: F(i-1, j-1) + s(x i, y j ) F(i, j) = max max k=0…i-1 F(k,j) –  (i-k) max k=0…j-1 F(i,k) –  (j-k) Termination: same Running Time: O((M+N)MN)(cubic) Space: O(NM) (linear-space algorithm not applicable)

Compromise: affine gaps  (n) = d + (n – 1)  e | | gap open extension d e  (n) Match: 2 Gap open: -5 Gap extension: -1 GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG 8x2-5-2 = 98x2-3x5 = 1 We want to find the optimal alignment with affine gap penalty in O(MN) time O(MN) or better O(M+N) memory

Allowing affine gap penalties Still three cases –x i aligned with y j –X i aligns to a gap Are we continuing a gap in x? (if no, start is more expensive) –Y j aligns to a gap Are we continuing a gap in y? (if no, start is more expensive) We can use a finite state machine to represent the three cases as three states –The machine has two heads, reading the chars on the two strings separately –At every step, each head reads 0 or 1 char from each sequence –Depending on what it reads, goes to a different state, and produces different scores

Finite State Machine F: have just read 1 char from each seq (x i aligned to y j ) Ix: have read 0 char from x. (y j aligned to a gap) Iy: have read 0 char from y ( x i aligned to a gap) F Ix Iy ? / ? Input Output State

F Ix Iy (x i,y j ) /  (x i,-) / d (x i,-) / e (-, y j ) / d (-, y j ) / e Input Output Start state Current stateInputOutputNext state F (x i,y j )  F F (-,y j )d Ix F (x i,-)d Iy Ix (-,y j )e Ix … …… …

AAC ACT F-F-F-FF-F-F-F AAC ||| ACT F-Iy-F-F-IxF-Iy-F-F-Ix AAC- || -ACT F-F-Iy-F-IxF-F-Iy-F-Ix AAC- | A-CT F Ix Iy (x i,y j ) /  (x i,-) / d (x i,-) / e (-, y j ) / d (-, y j ) / e start state Given a pair of sequences, an alignment (not necessarily optimal) corresponds to a state path in the FSM. Optimal alignment: find a state path to read the two sequences such that the total output score is the highest

Dynamic programming We encode this information in three different matrices For each element (i,j) we use three variables –F(i,j): best alignment (score) of x 1..x i & y 1..y j if x i aligns to y j –I x (i,j): best alignment of x 1..x i & y 1..y j if y j aligns to gap –I y (i,j): best alignment of x 1..x i & y 1..y j if x i aligns to gap xixi yjyj xixi yjyj xixi yjyj F(i, j) Ix(i, j) Iy(i, j)

F Ix Iy (x i,y j ) /  (x i,-) /d (x i,-)/e (-, y j ) /d (-, y j )/e F(i-1, j-1) +  (x i, y j ) F(i, j) = max Ix(i-1, j-1) +  (x i, y j ) Iy(i-1, j-1) +  (x i, y j ) xixi yjyj

F Ix Iy (x i,y j ) /  (x i,-) /d (x i,-)/e (-, y j ) /d (-, y j )/e F(i, j-1) + d Ix(i, j) = max Ix(i, j-1) + e xixi yjyj Ix(i, j)

F Ix Iy (x i,y j ) /  (x i,-) /d (x i,-)/e (-, y j ) /d (-, y j )/e F(i-1, j) + d Iy(i, j) = max Iy(i-1, j) + e xixi yjyj Iy(i, j)

F(i – 1, j – 1) F(i, j) =  (x i, y j ) + max I x (i – 1, j – 1) I y (i – 1, j – 1) F(i, j – 1) + d I x (i, j) = max I x (i, j – 1) + e F(i – 1, j) + d I y (i, j) = max I y (i – 1, j) + e Continuing alignment Closing gaps in x Closing gaps in y Opening a gap in x Gap extension in x Opening a gap in y Gap extension in y

Data dependency F IxIx IyIy i j i-1 j-1 i-1 j-1

Data dependency IyIy IxIx F i j i j i j

If we stack all three matrices –No cyclic dependency –Therefore, we can fill in all three matrices in order

Algorithm for i = 1:m –for j = 1:n Fill in F(i, j), I x (i, j), I y (i, j) –end end F(M, N) = max (F(M, N), I x (M, N), I y (M, N)) Time: O(MN) Space: O(MN) or O(N) when combine with the linear-space algorithm

Exercise x = GCAC y = GCC m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- -- -- -- -- -- -- -- - - -- -- -- F: aligned on bothIy: Insertion on y F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = Ix: Insertion on x  (xi, yj) d e d e m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- 2 -- -- -- -- -- -- -- - - -- -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1)  (xi, yj) = 2 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- 2-7 -- -- -- -- -- -- -- - - -- -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1)  (xi, yj) = -2 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -- -- -- -- -- -- - - -- -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1)  (xi, yj) = -2 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -- -- -- -- -- - -- -3 -- -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = Ix(i,j) Ix(i,j-1) F(i,j-1) d = -5 e = -1 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -- -- -- -- -- - -- - -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = Ix(i,j) Ix(i,j-1) F(i,j-1) d = -5 e = -1 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -- -- -- -- -- -5 -- -- -- - -- - -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = Iy(i,j) Iy(i-1,j) F(i-1,j) d=-5 e=-1 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -7 -- -- -- -- -- -5 -- -- -- - -- - -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1)  (xi, yj) = -2 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- -- -- -- -- -5 -- -- -- - -- - -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1)  (xi, yj) = 2 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- -- -- -- -- -5 -- -- -- - -- - -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1)  (xi, yj) = 2 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- -- -- -- -- -5 -- -- -- - -- - -- -12 -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = Ix(i,j) Ix(i,j-1) F(i,j-1) d = -5 e = -1 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- -- -- -- -- -5 -- -- -- - -- - -- -12 -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = Iy(i,j) Iy(i-1,j) F(i-1,j) d=-5 e=-1 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- -- -- -- -- -5 -- -- -- - -- - -- -12 -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j)  (xi, yj) d e d e m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - - - -- -- -- -- -- -- - -- - -- -12 -- -- - FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j)  (xi, yj) d e d e m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- - -- -- -- -- -- -- - -- - -- -12 -- -- - FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = Iy(i,j) Iy(i-1,j) F(i-1,j) d=-5 e=-1 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- - -- -- -- -- -- -- - -- - -- -12 -- -- - FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = Iy(i,j) Iy(i-1,j) F(i-1,j) d=-5 e=-1 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- - - -- -- -5 -- -- -- - -- - -- -12 -- -- - -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j)  (xi, yj) d e d e m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- - - -- -- -5 -- -- -- - -- - -- -12 -- -- - -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j)  (xi, yj) d e d e m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- - - -- -- -5 -- -- -- - -- - -- -12 -- -- - -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC GCAC || | GC-C x = y = x = y = x = y = x y GCACGCAC G C C x = y = m = 2 s = -2 d = -5 e = -1