Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Slides:



Advertisements
Similar presentations
Multiple Sequence Alignment
Advertisements

Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
CS262 Lecture 9, Win07, Batzoglou History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present  human (3Gbp), mouse.
Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j.
Sequence Similarity Searching Class 4 March 2010.
Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm.
Lecture 8: Multiple Sequence Alignment
CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
1 Protein Multiple Alignment by Konstantin Davydov.
CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)
CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable.
Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree.
CS262 Lecture 12, Win07, Batzoglou Some new sequencing technologies.
Sequence Comparison Introduction Comparison Homogy -- Analogy
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Sequence Alignments
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Multiple sequence alignment
Multiple Sequence Alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Protein Sequence Alignment and Database Searching.
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Multiple sequence alignment
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Multiple Sequence Alignment
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Multiple Sequence Alignment
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment.
Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Multiple sequence alignment (msa)
Multiple Sequence Alignment
Introduction to Bioinformatics
Presentation transcript:

Rapid Global Alignments How to align genomic sequences in (more or less) linear time

Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

Sparse DP for rectangle chaining 1,…, N: rectangles (h j, l j ): y-coordinates of rectangle j w(j):weight of rectangle j V(j): optimal score of chain ending in j L: list of triplets (l j, V(j), j)  L is sorted by l j  L is implemented as a balanced binary tree y h l

Sparse DP for rectangle chaining Main idea: Sweep through x- coordinates To the right of b, anything chainable to a is chainable to b Therefore, if V(b) > V(a), rectangle a is “useless” – remove it In L, keep rectangles j sorted with increasing l j - coordinates  sorted with increasing V(j) V(b) V(a)

Sparse DP for rectangle chaining Go through rectangle x-coordinates, from left to right: 1.When on the leftmost end of rectangle i, compute V(i) a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i, possibly store V(i) in L: a.j: rectangle in L, with largest l j  l i b.If V(i) > V(j): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l k, V(k), k) with V(k)  V(i) & l k  l i i j

Example x y 1: 5 3: 3 2: 6 4: 4 5:

Time Analysis 1.Sorting the x-coords takes O(N log N) 2.Going through x-coords: N steps 3.Each of N steps requires O(log N) time: Searching L takes log N Inserting to L takes log N All deletions are consecutive, so log N per deletion Each element is deleted at most once: N log N for all deletions Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

Putting it All Together: Fast Global Alignment Algorithms 1.FIND local alignments 2.CHAIN local alignments FINDCHAIN GLASS: k-mers hierarchical DP MumMer:Suffix Treesparse DP Avid:Suffix Treehierarchical DP LAGANCHAOSsparse DP

LAGAN: Pairwise Alignment 1.FIND local alignments 2.CHAIN local alignments 3.DP restricted around chain

LAGAN 1.Find local alignments 2.Chain -O(NlogN) L.I.S. 3.Restricted DP

LAGAN: recursive call What if a box is too large?  Recursive application of LAGAN, more sensitive word search

A trick to save on memory “necks” have tiny tracebacks …only store tracebacks

Multiple Sequence Alignments

Sequence Comparison Introduction Comparison  Homogy -- Analogy  Identity -- Similarity  Pairwise -- Multiple  Scoring Matrixes  Gap -- indel  Global -- Local Manual alignment, dot plot  visual inspection Dynamic programming  Needleman-Wunsch exhaustive global alignment  Smith-Waterman exhaustive local alignment Multiple alignment Database search  BLAST  FASTA

Sequence Comparison Multiple alignment (Multiple sequence alignment: MSA) ApplicationProcedure ExtrapolationAllocation of an uncharacterized sequence to a protein family. Phylogenetic analysisReconstruction of the history of closely related proteins and protein families. Pattern identificationIdentification of regions characteristic of a function by conserved positions. Domain identificationTurning MSA into a domain or protein family specific profile may be useful in identifying new or remote family members. DNA regulatory elementsTurning DNA-MSAs of a binding site into a weight matrix may be used in scanning other DNA sequences for potential similar binding sites. Structure predictionGood MSAs yield high quality prediction of secondary structure and help building 3D models. PCR analysisIdentification of less degenerated regions of a protein family are useful in fishing out new members by PCR (primer design).

Overview Definition Scoring Schemes Algorithms

Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global map is maximum A faint similarity between two sequences becomes significant if present in many Multiple alignments can help improve the pairwise alignments

Scoring Function Ideally:  Find alignment that maximizes probability that sequences evolved from common ancestor, according to some phylogenetic model More on phylogenetic models later x y z w v ?

Scoring Function A comprehensive model would have too many parameters, too inefficient to optimize Possible simplifications  Ignore phylogenetic tree  Statistically independent columns: S(m) = G(m) +  i S(m i ) m: alignment matrix G: function penalizing gaps

Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum Of Pairs (cont’d) The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) =  k<l s(m k, m l ) s(m k, m l ):score of induced alignment (k,l)

Sum Of Pairs (cont’d) Heuristic way to incorporate evolution tree: Human Mouse Chicken Weighted SOP: S(m) =  k<l w kl s(m k, m l ) w kl : weight decreasing with distance Duck

Consensus -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Find optimal consensus string m * to maximize S(m) =  i s(m *, m i ) s(m k, m l ):score of pairwise alignment (k,l)

Multiple Sequence Alignments Algorithms

Multiple sequence alignment - Computational complexity V S N S ANSANS _ S N A Sequence Comparison Multiple alignment

Alignment of protein sequences with 200 amino acids using dynamic programming # of sequences CPU time (approx.) 2 1 sec sec – 2,8 hours sec – 11,6 days sec – 3,2 years sec – 371 years Multiple sequence alignment - Computational complexity Sequence Comparison Multiple alignment

Approximate methods for MSA Sequence Comparison Multiple alignment Multidimensional dynamic programming (MSA, Lipman 1988) Progressive alignments (Clustalw, Higgins 1996; PileUp, Genetics Computer Group (GCG)) Local alignments (e.g. DiAlign, Morgenstern 1996; lots of others) Iterative methods (e.g. PRRP, Gotoh 1996) Statistical methods (e.g. Bayesian Hidden Markov Models)

Multiple sequence alignment - Programs Sequence Comparison Multiple alignment OMA Combalign DCA T-Coffee Clustal Dalign MSA Interalign Prrp Sam HMMER GA SAGA Multidimentional Dynamic programming Progressive Iterative HMMS GAs Non tree based Tree based

Multiple sequence alignment - Computational complexity Sequence Comparison Multiple alignment ProgramSeq type AlignmentMethodeComment ClustalWProt/DNA GlobalProgressiveNo format limitation Run on Windows too! PileUpProt/DNA GlobalProgressiveLimited by the format and UNIX based MultAlinProt/DNA GlobalProgressive/IterativLimited by the format T-COFFEEProt/DNA Global/localProgressiveCan be slow

1. Multidimensional Dynamic Programming Generalization of Needleman-Wunsh: S(m) =  i S(m i ) (sum of column scores) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr))

Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(x i, x j, x k ), F(i-1,j-1,k )+S(x i, x j, - ), F(i-1,j,k-1)+S(x i, -, x k ), F(i-1,j,k )+S(x i, -, - ), F(i,j-1,k-1)+S( -, x j, x k ), F(i,j-1,k )+S( -, x j, x k ), F(i,j,k-1)+S( -, -, x k ) } 1. Multidimensional Dynamic Programming

Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) 1. Multidimensional Dynamic Programming

2. Progressive Alignment Multiple Alignment is NP-complete Most used heuristic: Progressive Alignment Algorithm: 1.Align two of the sequences x i, x j 2.Fix that alignment 3.Align a third sequence x k to the alignment x i,x j 4.Repeat until all sequences are aligned Running Time: O( N L 2 )

2. Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) x w y z

CLUSTALW: progressive alignment CLUSTALW: most popular multiple protein alignment Algorithm: 1.Find all d ij : alignment dist (x i, x j ) 2.Construct a tree (Neighbor-joining hierarchical clustering) 3.Align nodes in order of decreasing similarity + a large number of heuristics

CLUSTALW & the CINEMA viewer

MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree (LAGAN) Human Baboon Mouse Rat

MLAGAN: main steps Given a collection of sequences, and a phylogenetic tree 1.Find local alignments for every pair of sequences x, y 2.Find anchors between every pair of sequences, similar to LAGAN anchoring 3.Progressive alignment Multi-Anchoring based on reconciling the pairwise anchors LAGAN-style limited-area DP 4.Optional refinement steps

MLAGAN: multi-anchoring X Z Y Z X/Y Z To anchor the (X/Y), and (Z) alignments:

Heuristics to improve multiple alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …

Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = GA-CTT

Iterative Refinement Algorithm (Barton-Stenberg): 1.Align most similar x i, x j 2.Align x k most similar to (x i x j ) 3.Repeat 2 until (x 1 …x N ) are aligned 4.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 5.Repeat 4 until convergence Note: Guaranteed to converge

Iterative Refinement For each sequence y 1.Remove y 2.Realign y (while rest fixed) x y z x,z fixed projection allow y to vary

Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA

Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing

Restricted MDP Here is another way to improve a multiple alignment: 1.Construct progressive multiple alignment m 2.Run MDP, restricted to radius R from m Running Time: O(2 N R N-1 L)

Restricted MDP Run MDP, restricted to radius R from m x y z Running Time: O(2 N R N-1 L)

Restricted MDP x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Within radius 1 of the optimal  Restricted MDP will fix it.

Optional refinement steps in MLAGAN Limited-area iterative refinement Radius-r 3-sequence refinement on each node of the tree