Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.

Slides:



Advertisements
Similar presentations
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.
Advertisements

Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004.
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Multiple Sequence Alignment
CS262 Lecture 9, Win07, Batzoglou History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present  human (3Gbp), mouse.
Linear-Space Alignment. Subsequences and Substrings Definition A string x’ is a substring of a string x, if x = ux’v for some prefix string u and suffix.
Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j.
Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm.
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication.
Lecture 8: Multiple Sequence Alignment
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
CS273a Lecture 11, Aut 08, Batzoglou Multiple Sequence Alignment.
Some new sequencing technologies. Molecular Inversion Probes.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
Sequence Alignment Cont’d. Needleman-Wunsch with affine gaps Initialization:V(i, 0) = d + (i – 1)  e V(0, j) = d + (j – 1)  e Iteration: V(i, j) = max{
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction
Sequence Alignment.
Sequence Alignment Lecture 2, Thursday April 3, 2003.
Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA:
CS273a Lecture 10, Aut 08, Batzoglou Multiple Sequence Alignment.
Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia.
Sequence Alignment Cont’d. Evolution Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function:
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)
CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.
CS 6293 Advanced Topics: Current Bioinformatics Lectures 3-4: Pair-wise Sequence Alignment.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.
Sequence Alignment Cont’d. Linear-space alignment Iterate this procedure to the left and right! N-k * M/2 k*k*
Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.
CS262 Lecture 4, Win07, Batzoglou Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs.
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
Sequence Alignment Lecture 2, Thursday April 3, 2003.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Sequence Alignment. CS262 Lecture 3, Win06, Batzoglou Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition.
Sequence Alignment. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N,
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Sequence Alignments
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Minimum Edit Distance Definition of Minimum Edit Distance.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Computational Genomics I: Sequence Alignment Eric Xing Lecture.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Multiple Sequence Alignment
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Step 3: Tools Database Searching
Sequence Similarity.
Multiple Sequence Alignment
CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment.
Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Genomic Sequence Alignment

Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence alignment Rearrangements in genomic sequences

Biology in One Slide – Twentieth Century …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT… …and today

Complete DNA Sequences About 300 complete genomes have been sequenced

Evolution

Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication

Evolutionary Rates OK X X Still OK? next generation

Sequence conservation implies function Alignment is the key to Finding important regions Determining function Uncovering the evolutionary forces

Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N, an alignment is an assignment of gaps to positions 0,…, N in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

What is a good alignment? Alignment: The “best” way to match the letters of one sequence with those of the other How do we define “best”? Alignment: A hypothesis that the two sequences come from a common ancestor through sequence edits Parsimonious explanation: Find the minimum number of edits that transform one sequence into the other

Scoring Function Sequence edits:AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function: Match: +m Mismatch: -s Gap:-d Score F = (# matches)  m - (# mismatches)  s – (#gaps)  d

How do we compute the best alignment? AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Too many possible alignments: O( 2 N )

Dynamic Programming Given two sequences x = x 1 ……x M and y = y 1 ……y N Let F(i, j) = Score of best alignment of x 1 ……x i to y 1 ……y j Then, F(M, N) == Score of best alignment Idea:  Compute F(i, j) for all i and j  Do this by using F(i–1, j), F(i, j–1), F(i–1, j–1)

Dynamic Programming (cont’d) Notice three possible cases: 1.x i aligns to y j x 1 ……x i-1 x i y 1 ……y j-1 y j 2.x i aligns to a gap x 1 ……x i-1 x i y 1 ……y j - 3.y j aligns to a gap x 1 ……x i - y 1 ……y j-1 y j m, if x i = y j F(i,j) = F(i-1, j-1) + -s, if not F(i,j) = F(i-1, j) - d F(i,j) = F(i, j-1) - d

Dynamic Programming (cont’d) How do we know which case is correct? Inductive assumption: F(i, j-1), F(i-1, j), F(i-1, j-1) are optimal Then, F(i-1, j-1) + s(x i, y j ) F(i, j) = maxF(i-1, j) – d F( i, j-1) – d Where s(x i, y j ) = m, if x i = y j ;-s, if not i-1, j-1i-1, j i, j-1i, j

Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 F(i,j) i = j = Optimal Alignment: F(4,3) = 2 AGTA A - TA

The Needleman-Wunsch Algorithm 1.Initialization. a.F(0, 0) = 0 b.F(0, j) = - j  d c.F(i, 0)= - i  d 2.Main Iteration. Filling-in partial alignments a.For each i = 1……M For eachj = 1……N F(i-1,j) – d [case 1] F(i, j) = max F(i, j-1) – d [case 2] F(i-1, j-1) + s(x i, y j ) [case 3] UP if [case 1] Ptr(i,j)= LEFTif [case 2] DIAGif [case 3] 3.Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Performance Time: O(NM) Space: O(NM)

Alignment on a Large Scale Given a gene that we care about, how can we compare it to all existing DNA? Assume we use Dynamic Programming: The entire genomic database gene of interest ~10 5 ~10 11

Index-based Local Alignment Main idea: 1.Construct a dictionary of all the words in the query 2.Initiate a local alignment for each word match between query and DB Running Time: Theoretical worst case: O(MN) Fast in practice query DB

Index-based Local Alignment — BLAST Dictionary: All words of length k (~11) Alignment initiated between exact-matching words (more generally, between words of alignment score  T) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… query DB query scan

Index-based Local Alignment — BLAST A C G A A G T A A G G T C C A G T C C C T T C C T G G A T T G C G A Example: k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC

Gapped BLAST A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A Added features: Pairs of words can initiate alignment Nearby alignments are merged Extensions with gaps until score < T below best score so far Output: GTAAGGTCCAGT GTTAGGTC-AGT

Example Query: gattacaccccgattacaccccgattaca (29 letters) [2 mins] Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi| |gb|AC | Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plusgi| |gb|AC | Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: tacacccagattacaccccga Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: tacacccagattacaccccga >gi| |gb|AC | Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plusgi| |gb|AC | Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 3891 tacacccagattacaccccga

Efficient global alignment

Global alignment with the chaining approach 1.Find local alignments 2.Chain them into a rough global map 3.Align regions in-between

LAGAN: 1. FIND Local Alignments 1.Find Local Alignments 2.Chain Local Alignments 3.Restricted DP Mike Brudno, Chuong Do, et al.

LAGAN: 2. CHAIN Local Alignments 1.Find Local Alignments 2.Chain Local Alignments 3.Restricted DP Mike Brudno, Chuong Do, et al.

LAGAN: 3. Restricted DP 1.Find Local Alignments 2.Chain Local Alignments 3.Restricted DP Mike Brudno, Chuong Do, et al.

Multiple Alignment

Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that  All sequences have the same length L  Score of the global map is maximum A faint similarity between two sequences becomes significant if present in many Multiple alignments can help improve the pairwise alignments

Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG Given sequences x 1, …, x N, aligned in a multiple alignment m, S(m) =  k<l w kl s(x k, x l )

A Profile Representation Given a multiple alignment M = m 1 …m n  Replace each column m i with profile entry p i  Frequency of each letter in   # gaps  Can think of this as a “likelihood” of each letter in each position - A G G C T A T C A C C T G T A G – C T A C C A G C A G – C T A C C A G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A C G T

Multiple Sequence Alignments Algorithms

Generalization of Needleman-Wunsh: S(m) =  i S(m i ) (sum of column scores) F(i 1,i 2,…,i N ): Optimal alignment up to (i 1, …, i N ) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr)) Multidimensional DP

Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(x i, x j, x k ), F(i-1,j-1,k )+S(x i, x j, - ), F(i-1,j,k-1)+S(x i, -, x k ), F(i-1,j,k )+S(x i, -, - ), F(i,j-1,k-1)+S( -, x j, x k ), F(i,j-1,k )+S( -, x j, x k ), F(i,j,k-1)+S( -, -, x k ) } Multidimensional DP

Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP

Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP

Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z p xy p zw p xyzw

Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z

Progressive Alignment When evolutionary tree is unknown:  Perform all pairwise alignments  Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment  Construct a tree  Align on the tree x w y z ?

Some useful sites Genome browsers  Ensembl:  UCSC:genome.ucsc.edu/cgi-bin/hgGateway Genomic alignment  LAGAN: lagan.stanford.edu  MAVID: baboon.math.berkeley.edu/mavid Protein multiple alignment  MUSCLE:  ProbCons: probcons.stanford.edu

Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication

Local & Global Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC Local Global

Glocal Alignment Problem Find least cost transformation of one sequence into another using shuffle operations Sequence edits Inversions Translocations Duplications Combinations of above AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

SLAGAN: 1. Find Local Alignments 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts Mike Brudno, Sanket Malde, et al.

SLAGAN: 2. Build Homology Map 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts Mike Brudno, Sanket Malde, et al.

SLAGAN: 3. Global Alignment 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts Mike Brudno, Sanket Malde, et al.

SLAGAN Example: Chromosome 20 Human Chromosome 20 versus Mouse Chromosome Segments of conserved synteny 70 Inversions

SLAGAN example: HOX cluster 10 paralogous genes Conserved order in Human/Mouse/Rat

SLAGAN example: HOX cluster 10 paralogous genes Conserved order in Human/Mouse/Rat

Examples of shuffled regions Hum/Mus Hum/Rat

Examples of shuffled regions Hum/Mus Hum/Rat

Examples of shuffled regions Hum/Mus Hum/Rat

Examples of shuffled regions Hum/MusHum/Rat

Examples of shuffled regions Hum/Mus Hum/Rat

More DNA is coming…