Download presentation
Presentation is loading. Please wait.
1
Rapid Global Alignments How to align genomic sequences in (more or less) linear time
2
Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
3
The Problem: Find a Chain of Local Alignments (x,y) (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight
4
Sparse DP for rectangle chaining 1,…, N: rectangles (h j, l j ): y-coordinates of rectangle j w(j):weight of rectangle j V(j): optimal score of chain ending in j L: list of triplets (l j, V(j), j) L is sorted by l j L is implemented as a balanced binary tree y h l
5
Sparse DP for rectangle chaining Main idea: Sweep through x- coordinates To the right of b, anything chainable to a is chainable to b Therefore, if V(b) > V(a), rectangle a is “useless” – remove it In L, keep rectangles j sorted with increasing l j - coordinates sorted with increasing V(j) V(b) V(a)
6
Sparse DP for rectangle chaining Go through rectangle x-coordinates, from left to right: 1.When on the leftmost end of rectangle i, compute V(i) a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i, possibly store V(i) in L: a.j: rectangle in L, with largest l j l i b.If V(i) > V(j): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l k, V(k), k) with V(k) V(i) & l k l i i j
7
Example x y 1: 5 3: 3 2: 6 4: 4 5: 2 2 5 6 9 10 11 12 14 15 16
8
Time Analysis 1.Sorting the x-coords takes O(N log N) 2.Going through x-coords: N steps 3.Each of N steps requires O(log N) time: Searching L takes log N Inserting to L takes log N All deletions are consecutive, so log N per deletion Each element is deleted at most once: N log N for all deletions Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree
9
Putting it All Together: Fast Global Alignment Algorithms 1.FIND local alignments 2.CHAIN local alignments FINDCHAIN GLASS: k-mers hierarchical DP MumMer:Suffix Treesparse DP Avid:Suffix Treehierarchical DP LAGANCHAOSsparse DP
10
LAGAN: Pairwise Alignment 1.FIND local alignments 2.CHAIN local alignments 3.DP restricted around chain
11
LAGAN 1.Find local alignments 2.Chain -O(NlogN) L.I.S. 3.Restricted DP
12
LAGAN: recursive call What if a box is too large? Recursive application of LAGAN, more sensitive word search
13
A trick to save on memory “necks” have tiny tracebacks …only store tracebacks
14
Multiple Sequence Alignments
15
Sequence Comparison Introduction Comparison Homogy -- Analogy Identity -- Similarity Pairwise -- Multiple Scoring Matrixes Gap -- indel Global -- Local Manual alignment, dot plot visual inspection Dynamic programming Needleman-Wunsch exhaustive global alignment Smith-Waterman exhaustive local alignment Multiple alignment Database search BLAST FASTA
16
Sequence Comparison Multiple alignment (Multiple sequence alignment: MSA) ApplicationProcedure ExtrapolationAllocation of an uncharacterized sequence to a protein family. Phylogenetic analysisReconstruction of the history of closely related proteins and protein families. Pattern identificationIdentification of regions characteristic of a function by conserved positions. Domain identificationTurning MSA into a domain or protein family specific profile may be useful in identifying new or remote family members. DNA regulatory elementsTurning DNA-MSAs of a binding site into a weight matrix may be used in scanning other DNA sequences for potential similar binding sites. Structure predictionGood MSAs yield high quality prediction of secondary structure and help building 3D models. PCR analysisIdentification of less degenerated regions of a protein family are useful in fishing out new members by PCR (primer design).
18
Overview Definition Scoring Schemes Algorithms
19
Definition Given N sequences x 1, x 2,…, x N : Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global map is maximum A faint similarity between two sequences becomes significant if present in many Multiple alignments can help improve the pairwise alignments
20
Scoring Function Ideally: Find alignment that maximizes probability that sequences evolved from common ancestor, according to some phylogenetic model More on phylogenetic models later x y z w v ?
21
Scoring Function A comprehensive model would have too many parameters, too inefficient to optimize Possible simplifications Ignore phylogenetic tree Statistically independent columns: S(m) = G(m) + i S(m i ) m: alignment matrix G: function penalizing gaps
22
Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
23
Sum Of Pairs (cont’d) The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(m k, m l ) s(m k, m l ):score of induced alignment (k,l)
24
Sum Of Pairs (cont’d) Heuristic way to incorporate evolution tree: Human Mouse Chicken Weighted SOP: S(m) = k<l w kl s(m k, m l ) w kl : weight decreasing with distance Duck
25
Consensus -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Find optimal consensus string m * to maximize S(m) = i s(m *, m i ) s(m k, m l ):score of pairwise alignment (k,l)
26
Multiple Sequence Alignments Algorithms
27
Multiple sequence alignment - Computational complexity V S N S ANSANS _ S N A Sequence Comparison Multiple alignment
28
Alignment of protein sequences with 200 amino acids using dynamic programming # of sequences CPU time (approx.) 2 1 sec 4 10 4 sec – 2,8 hours 5 10 6 sec – 11,6 days 6 10 8 sec – 3,2 years 7 10 10 sec – 371 years Multiple sequence alignment - Computational complexity Sequence Comparison Multiple alignment
29
Approximate methods for MSA Sequence Comparison Multiple alignment Multidimensional dynamic programming (MSA, Lipman 1988) Progressive alignments (Clustalw, Higgins 1996; PileUp, Genetics Computer Group (GCG)) Local alignments (e.g. DiAlign, Morgenstern 1996; lots of others) Iterative methods (e.g. PRRP, Gotoh 1996) Statistical methods (e.g. Bayesian Hidden Markov Models)
30
Multiple sequence alignment - Programs Sequence Comparison Multiple alignment OMA Combalign DCA T-Coffee Clustal Dalign MSA Interalign Prrp Sam HMMER GA SAGA Multidimentional Dynamic programming Progressive Iterative HMMS GAs Non tree based Tree based
31
Multiple sequence alignment - Computational complexity Sequence Comparison Multiple alignment ProgramSeq type AlignmentMethodeComment ClustalWProt/DNA GlobalProgressiveNo format limitation Run on Windows too! PileUpProt/DNA GlobalProgressiveLimited by the format and UNIX based MultAlinProt/DNA GlobalProgressive/IterativLimited by the format T-COFFEEProt/DNA Global/localProgressiveCan be slow
32
1. Multidimensional Dynamic Programming Generalization of Needleman-Wunsh: S(m) = i S(m i ) (sum of column scores) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr))
33
Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(x i, x j, x k ), F(i-1,j-1,k )+S(x i, x j, - ), F(i-1,j,k-1)+S(x i, -, x k ), F(i-1,j,k )+S(x i, -, - ), F(i,j-1,k-1)+S( -, x j, x k ), F(i,j-1,k )+S( -, x j, x k ), F(i,j,k-1)+S( -, -, x k ) } 1. Multidimensional Dynamic Programming
34
Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) 1. Multidimensional Dynamic Programming
35
2. Progressive Alignment Multiple Alignment is NP-complete Most used heuristic: Progressive Alignment Algorithm: 1.Align two of the sequences x i, x j 2.Fix that alignment 3.Align a third sequence x k to the alignment x i,x j 4.Repeat until all sequences are aligned Running Time: O( N L 2 )
36
2. Progressive Alignment When evolutionary tree is known: Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) x w y z
37
CLUSTALW: progressive alignment CLUSTALW: most popular multiple protein alignment Algorithm: 1.Find all d ij : alignment dist (x i, x j ) 2.Construct a tree (Neighbor-joining hierarchical clustering) 3.Align nodes in order of decreasing similarity + a large number of heuristics
38
CLUSTALW & the CINEMA viewer
39
MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree (LAGAN) Human Baboon Mouse Rat
40
MLAGAN: main steps Given a collection of sequences, and a phylogenetic tree 1.Find local alignments for every pair of sequences x, y 2.Find anchors between every pair of sequences, similar to LAGAN anchoring 3.Progressive alignment Multi-Anchoring based on reconciling the pairwise anchors LAGAN-style limited-area DP 4.Optional refinement steps
41
MLAGAN: multi-anchoring X Z Y Z X/Y Z To anchor the (X/Y), and (Z) alignments:
42
Heuristics to improve multiple alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …
43
Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = GA-CTT
44
Iterative Refinement Algorithm (Barton-Stenberg): 1.Align most similar x i, x j 2.Align x k most similar to (x i x j ) 3.Repeat 2 until (x 1 …x N ) are aligned 4.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 5.Repeat 4 until convergence Note: Guaranteed to converge
45
Iterative Refinement For each sequence y 1.Remove y 2.Realign y (while rest fixed) x y z x,z fixed projection allow y to vary
46
Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA
47
Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing
48
Restricted MDP Here is another way to improve a multiple alignment: 1.Construct progressive multiple alignment m 2.Run MDP, restricted to radius R from m Running Time: O(2 N R N-1 L)
49
Restricted MDP Run MDP, restricted to radius R from m x y z Running Time: O(2 N R N-1 L)
50
Restricted MDP x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Within radius 1 of the optimal Restricted MDP will fix it.
51
Optional refinement steps in MLAGAN Limited-area iterative refinement Radius-r 3-sequence refinement on each node of the tree
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.