Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x3-3334 Office hours: Monday 2:00-3:30 TA:

Slides:



Advertisements
Similar presentations
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.
Advertisements

Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Sequence Alignment.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Marc Schaub Andreas Sundquist Monday & Wednesday.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Sequence Alignment Tutorial #2
Lecture outline Sequence alignment Gaps and scoring matrices
Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.
Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties.
Sequence Alignment. Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch:
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Marc Schaub Andreas Sundquist Monday & Wednesday.
Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia.
Sequence Alignment Cont’d. Evolution Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function:
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
CS 6293 Advanced Topics: Current Bioinformatics Lectures 3-4: Pair-wise Sequence Alignment.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.
Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Computational Genomics Lecture 1, Tuesday April 1, 2003.
CISC667, F05, Lec6, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Pairwise sequence alignment Smith-Waterman (local alignment)
Alignment II Dynamic Programming
Sequence Alignment Lecture 2, Thursday April 3, 2003.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Pairwise alignment Computational Genomics and Proteomics.
Sequence Alignment. CS262 Lecture 3, Win06, Batzoglou Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition.
Sequence Alignment. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N,
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Deoxyribonucleic acid (DNA) Biometrics CPSC 4600 Biometrics and Cryptography.
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
Sequence Alignment. 2 Sequence Comparison Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Minimum Edit Distance Definition of Minimum Edit Distance.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Computational Genomics I: Sequence Alignment Eric Xing Lecture.
Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
1 Выравнивание двух последовательностей. 2 AGC A A A C
Sequence Similarity.
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment.
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y.
Definition of Minimum Edit Distance
Lecture 5: Local Sequence Alignment Algorithms
Pairwise sequence Alignment.
Intro to Alignment Algorithms: Global and Local
Presentation transcript:

Sequence Alignment

Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA: Omkar Deshpande x Office hours: Wednesday 4:00-6:00 Class Website: Books: they should be in by Monday Let’s schedule Weekly Discussion Session Teams: ~3, <= 4 people: please use su.class.cs262  Individual writeups Lecture notes: prose + figures, ~4 – 8 pages  Previous year’s lecture notes are available

Comparing Genomes

Evolution at the DNA level …ACGGTGCAGTCACCA… …ACGTTGCAGTCCACCA… C SEQUENCE EDITSREARRANGEMENTS

Sequence conservation implies function Interleukin region in human and mouse 100% 40%

Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N, an alignment is an assignment of gaps to positions 0,…, M in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function: Match: +m Mismatch: -s Gap:-d Score F = (# matches)  m - (# mismatches)  s – (#gaps)  d

How do we compute the best alignment? AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Too many possible alignments: O( 2 M+N )

Alignment is additive Observation: The score of aligningx 1 ……x M y 1 ……y N is additive Say thatx 1 …x i x i+1 …x M aligns to y 1 …y j y j+1 …y N The two scores add up: F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])

Dynamic Programming We will now describe a dynamic programming algorithm Suppose we wish to align x 1 ……x M y 1 ……y N Let F(i,j) = optimal score of aligning x 1 ……x i y 1 ……y j

Dynamic Programming (cont’d) Notice three possible cases: 1.x i aligns to y j x 1 ……x i-1 x i y 1 ……y j-1 y j 2.x i aligns to a gap x 1 ……x i-1 x i y 1 ……y j - 3.y j aligns to a gap x 1 ……x i - y 1 ……y j-1 y j m, if x i = y j F(i,j) = F(i-1, j-1) + -s, if not F(i,j) = F(i-1, j) - d F(i,j) = F(i, j-1) - d

Dynamic Programming (cont’d) How do we know which case is correct? Inductive assumption: F(i, j-1), F(i-1, j), F(i-1, j-1) are optimal Then, F(i-1, j-1) + s(x i, y j ) F(i, j) = max F(i-1, j) – d F( i, j-1) – d Where s(x i, y j ) = m, if x i = y j ;-s, if not

Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 F(i,j) i = j = Optimal Alignment: F(4,3) = 2 AGTA A - TA

The Needleman-Wunsch Matrix x 1 ……………………………… x M y 1 ……………………………… y N Every nondecreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences Can think of it as a divide-and-conquer algorithm

The Needleman-Wunsch Algorithm 1.Initialization. a.F(0, 0) = 0 b.F(0, j) = - j  d c.F(i, 0)= - i  d 2.Main Iteration. Filling-in partial alignments a.For each i = 1……M For eachj = 1……N F(i-1,j-1) + s(x i, y j ) [case 1] F(i, j) = max F(i-1, j) – d [case 2] F(i, j-1) – d [case 3] DIAG, if [case 1] Ptr(i,j)= LEFT,if [case 2] UP,if [case 3] 3.Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Performance Time: O(NM) Space: O(NM) Later we will cover more efficient methods

A variant of the basic algorithm: Maybe it is OK to have an unlimited # of gaps in the beginning and end: CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG Then, we don’t want to penalize gaps in the ends

Different types of overlaps

The Overlap Detection variant Changes: 1.Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2.Termination max i F(i, N) F OPT = max max j F(M, j) x 1 ……………………………… x M y 1 ……………………………… y N

The local alignment problem Given two strings x = x 1 ……x M, y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g.x = aaaacccccgggg y = cccgggaaccaacc

Why local alignment Genes are shuffled between genomes Portions of proteins (domains) are often conserved

Cross-species genome similarity 98% of genes are conserved between any two mammals >70% average similarity in protein sequence hum_a : 57331/ mus_a : 78560/ rat_a : / fug_a : 36008/68174 hum_a : 57381/ mus_a : 78610/ rat_a : / fug_a : 36058/68174 hum_a : 57431/ mus_a : 78659/ rat_a : / fug_a : 36084/68174 hum_a : 57481/ mus_a : 78708/ rat_a : / fug_a : 36097/68174 “atoh” enhancer in human, mouse, rat, fugu fish

The Smith-Waterman algorithm Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization:F(0, j) = F(i, 0) = 0 0 Iteration:F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + s(x i, y j )

The Smith-Waterman algorithm Termination: 1.If we want the best local alignment… F OPT = max i,j F(i, j) 2.If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

Scoring the gaps more accurately Current model: Gap of length n incurs penaltyn  d However, gaps usually occur in bunches Convex gap penalty function:  (n): for all n,  (n + 1) -  (n)   (n) -  (n – 1)  (n)

General gap dynamic programming Initialization:same Iteration: F(i-1, j-1) + s(x i, y j ) F(i, j) = maxmax k=0…i-1 F(k,j) –  (i-k) max k=0…j-1 F(i,k) –  (j-k) Termination: same Running Time: O(N 2 M)(assume N>M) Space:O(NM)

Compromise: affine gaps  (n) = d + (n – 1)  e | | gap gap open extend To compute optimal alignment, At position i,j, need to “remember” best score if gap is open best score if gap is not open F(i, j):score of alignment x 1 …x i to y 1 …y j if if x i aligns to y j if G(i, j):score if x i, or y j, aligns to a gap d e  (n)

Needleman-Wunsch with affine gaps Initialization:F(i, 0) = d + (i – 1)  e F(0, j) = d + (j – 1)  e Iteration: F(i – 1, j – 1) + s(x i, y j ) F(i, j) = max G(i – 1, j – 1) + s(x i, y j ) F(i – 1, j) – d F(i, j – 1) – d G(i, j) = max G(i, j – 1) – e G(i – 1, j) – e Termination: same