Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Rapid Global Alignments How to align genomic sequences in (more or less) linear time

Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

Sparse DP for rectangle chaining 1,…, N: rectangles (h j, l j ): y-coordinates of rectangle j w(j):weight of rectangle j V(j): optimal score of chain ending in j L: list of triplets (l j, V(j), j)  L is sorted by l j  L is implemented as a balanced binary tree y h l

Sparse DP for rectangle chaining Main idea: Sweep through x- coordinates To the right of b, anything chainable to a is chainable to b Therefore, if V(b) > V(a), rectangle a is “useless” – remove it In L, keep rectangles j sorted with increasing l j - coordinates  sorted with increasing V(j) V(b) V(a)

Sparse DP for rectangle chaining Go through rectangle x-coordinates, from left to right: 1.When on the leftmost end of rectangle i, compute V(i) a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i, possibly store V(i) in L: a.j: rectangle in L, with largest l j  l i b.If V(i) > V(j): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l k, V(k), k) with V(k)  V(i) & l k  l i i j

Example x y 1: 5 3: 3 2: 6 4: 4 5: 2 2 5 6 9 10 11 12 14 15 16

Time Analysis 1.Sorting the x-coords takes O(N log N) 2.Going through x-coords: N steps 3.Each of N steps requires O(log N) time: Searching L takes log N Inserting to L takes log N All deletions are consecutive, so log N per deletion Each element is deleted at most once: N log N for all deletions Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

Putting it All Together: Fast Global Alignment Algorithms 1.FIND local alignments 2.CHAIN local alignments FINDCHAIN GLASS: k-mers hierarchical DP MumMer:Suffix Treesparse DP Avid:Suffix Treehierarchical DP LAGANCHAOSsparse DP

LAGAN: Pairwise Alignment 1.FIND local alignments 2.CHAIN local alignments 3.DP restricted around chain

LAGAN 1.Find local alignments 2.Chain -O(NlogN) L.I.S. 3.Restricted DP

LAGAN: recursive call What if a box is too large?  Recursive application of LAGAN, more sensitive word search

A trick to save on memory “necks” have tiny tracebacks …only store tracebacks

Multiple Sequence Alignments

Sequence Comparison Introduction Comparison  Homogy -- Analogy  Identity -- Similarity  Pairwise -- Multiple  Scoring Matrixes  Gap -- indel  Global -- Local Manual alignment, dot plot  visual inspection Dynamic programming  Needleman-Wunsch exhaustive global alignment  Smith-Waterman exhaustive local alignment Multiple alignment Database search  BLAST  FASTA

Sequence Comparison Multiple alignment (Multiple sequence alignment: MSA) ApplicationProcedure ExtrapolationAllocation of an uncharacterized sequence to a protein family. Phylogenetic analysisReconstruction of the history of closely related proteins and protein families. Pattern identificationIdentification of regions characteristic of a function by conserved positions. Domain identificationTurning MSA into a domain or protein family specific profile may be useful in identifying new or remote family members. DNA regulatory elementsTurning DNA-MSAs of a binding site into a weight matrix may be used in scanning other DNA sequences for potential similar binding sites. Structure predictionGood MSAs yield high quality prediction of secondary structure and help building 3D models. PCR analysisIdentification of less degenerated regions of a protein family are useful in fishing out new members by PCR (primer design).

Overview Definition Scoring Schemes Algorithms

Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global map is maximum A faint similarity between two sequences becomes significant if present in many Multiple alignments can help improve the pairwise alignments

Scoring Function Ideally:  Find alignment that maximizes probability that sequences evolved from common ancestor, according to some phylogenetic model More on phylogenetic models later x y z w v ?

Scoring Function A comprehensive model would have too many parameters, too inefficient to optimize Possible simplifications  Ignore phylogenetic tree  Statistically independent columns: S(m) = G(m) +  i S(m i ) m: alignment matrix G: function penalizing gaps

Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum Of Pairs (cont’d) The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) =  k<l s(m k, m l ) s(m k, m l ):score of induced alignment (k,l)

Sum Of Pairs (cont’d) Heuristic way to incorporate evolution tree: Human Mouse Chicken Weighted SOP: S(m) =  k<l w kl s(m k, m l ) w kl : weight decreasing with distance Duck

Consensus -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Find optimal consensus string m * to maximize S(m) =  i s(m *, m i ) s(m k, m l ):score of pairwise alignment (k,l)

Multiple Sequence Alignments Algorithms

Multiple sequence alignment - Computational complexity V S N S ANSANS _ S N A Sequence Comparison Multiple alignment

Alignment of protein sequences with 200 amino acids using dynamic programming # of sequences CPU time (approx.) 2 1 sec 4 10 4 sec – 2,8 hours 5 10 6 sec – 11,6 days 6 10 8 sec – 3,2 years 7 10 10 sec – 371 years Multiple sequence alignment - Computational complexity Sequence Comparison Multiple alignment

Approximate methods for MSA Sequence Comparison Multiple alignment Multidimensional dynamic programming (MSA, Lipman 1988) Progressive alignments (Clustalw, Higgins 1996; PileUp, Genetics Computer Group (GCG)) Local alignments (e.g. DiAlign, Morgenstern 1996; lots of others) Iterative methods (e.g. PRRP, Gotoh 1996) Statistical methods (e.g. Bayesian Hidden Markov Models)

Multiple sequence alignment - Programs Sequence Comparison Multiple alignment OMA Combalign DCA T-Coffee Clustal Dalign MSA Interalign Prrp Sam HMMER GA SAGA Multidimentional Dynamic programming Progressive Iterative HMMS GAs Non tree based Tree based

Multiple sequence alignment - Computational complexity Sequence Comparison Multiple alignment ProgramSeq type AlignmentMethodeComment ClustalWProt/DNA GlobalProgressiveNo format limitation Run on Windows too! PileUpProt/DNA GlobalProgressiveLimited by the format and UNIX based MultAlinProt/DNA GlobalProgressive/IterativLimited by the format T-COFFEEProt/DNA Global/localProgressiveCan be slow

1. Multidimensional Dynamic Programming Generalization of Needleman-Wunsh: S(m) =  i S(m i ) (sum of column scores) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr))

Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(x i, x j, x k ), F(i-1,j-1,k )+S(x i, x j, - ), F(i-1,j,k-1)+S(x i, -, x k ), F(i-1,j,k )+S(x i, -, - ), F(i,j-1,k-1)+S( -, x j, x k ), F(i,j-1,k )+S( -, x j, x k ), F(i,j,k-1)+S( -, -, x k ) } 1. Multidimensional Dynamic Programming

Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) 1. Multidimensional Dynamic Programming

2. Progressive Alignment Multiple Alignment is NP-complete Most used heuristic: Progressive Alignment Algorithm: 1.Align two of the sequences x i, x j 2.Fix that alignment 3.Align a third sequence x k to the alignment x i,x j 4.Repeat until all sequences are aligned Running Time: O( N L 2 )

2. Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) x w y z

CLUSTALW: progressive alignment CLUSTALW: most popular multiple protein alignment Algorithm: 1.Find all d ij : alignment dist (x i, x j ) 2.Construct a tree (Neighbor-joining hierarchical clustering) 3.Align nodes in order of decreasing similarity + a large number of heuristics

CLUSTALW & the CINEMA viewer

MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree (LAGAN) Human Baboon Mouse Rat

MLAGAN: main steps Given a collection of sequences, and a phylogenetic tree 1.Find local alignments for every pair of sequences x, y 2.Find anchors between every pair of sequences, similar to LAGAN anchoring 3.Progressive alignment Multi-Anchoring based on reconciling the pairwise anchors LAGAN-style limited-area DP 4.Optional refinement steps

MLAGAN: multi-anchoring X Z Y Z X/Y Z To anchor the (X/Y), and (Z) alignments:

Heuristics to improve multiple alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …

Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = GA-CTT

Iterative Refinement Algorithm (Barton-Stenberg): 1.Align most similar x i, x j 2.Align x k most similar to (x i x j ) 3.Repeat 2 until (x 1 …x N ) are aligned 4.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 5.Repeat 4 until convergence Note: Guaranteed to converge

Iterative Refinement For each sequence y 1.Remove y 2.Realign y (while rest fixed) x y z x,z fixed projection allow y to vary

Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA

Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing

Restricted MDP Here is another way to improve a multiple alignment: 1.Construct progressive multiple alignment m 2.Run MDP, restricted to radius R from m Running Time: O(2 N R N-1 L)

Restricted MDP Run MDP, restricted to radius R from m x y z Running Time: O(2 N R N-1 L)

Restricted MDP x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Within radius 1 of the optimal  Restricted MDP will fix it.

Optional refinement steps in MLAGAN Limited-area iterative refinement Radius-r 3-sequence refinement on each node of the tree

Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Similar presentations

Presentation on theme: "Rapid Global Alignments How to align genomic sequences in (more or less) linear time."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Similar presentations

Presentation on theme: "Rapid Global Alignments How to align genomic sequences in (more or less) linear time."— Presentation transcript:

Similar presentations

About project

Feedback