Download presentation
Presentation is loading. Please wait.
1
Multiple Sequence Alignments Algorithms
2
MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree (LAGAN) Human Baboon Mouse Rat
3
MLAGAN: main steps Given a collection of sequences, and a phylogenetic tree 1.Find local alignments for every pair of sequences x, y 2.Find anchors between every pair of sequences, similar to LAGAN anchoring 3.Progressive alignment Multi-Anchoring based on reconciling the pairwise anchors LAGAN-style limited-area DP 4.Optional refinement steps
4
MLAGAN: multi-anchoring X Z Y Z X/Y Z To anchor the (X/Y), and (Z) alignments:
5
Whole-genome alignment Human/Mouse/Rat
6
Insertion/Deletion Rate Analysis
7
Heuristics to improve multiple alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …
8
Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = GA-CTT
9
Iterative Refinement Algorithm (Barton-Stenberg): 1.Align most similar x i, x j 2.Align x k most similar to (x i x j ) 3.Repeat 2 until (x 1 …x N ) are aligned 4.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 5.Repeat 4 until convergence Note: Guaranteed to converge
10
Iterative Refinement For each sequence y 1.Remove y 2.Realign y (while rest fixed) x y z x,z fixed projection allow y to vary
11
Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA
12
Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing
13
Restricted MDP Run MDP, restricted to radius R from m x y z Running Time: O(2 N R N-1 L)
14
Tree Refinement Run 3D-DP, restricted to radius R from m, for each tree node x y z Running Time: ~7R 2 LN R: Radius L: Alignment Length N: Number of Sequences
15
A* for Multiple Alignments Review of the A* algorithm v START GOAL g(v) h(v) g(v) is the cost so far h(v) is an estimate of the minimum cost from v to GOAL f(v) ≥ g(v) + h(v) is the minimum cost of a path passing by v 1. Expand v with the smallest f(v) 2. Never expand v, if f(v) ≥ shortest path to the goal found so far
16
A* for Multiple Alignments Nodes: Cells in the DP matrix g(v): alignment cost so far h(v): sum-of-pairs of individual pairwise alignments Initial minimum alignment cost estimate: sum-of-pairs of global pairwise alignments v START GOAL g(v) h(v)
17
Consistency – T-Coffee z x y xixi yjyj y j’ zkzk
19
T – Coffee Layout LALIGNCLUSTALW
20
Generating Primary Library A AA B BB A C B BB C CC ClustalW Primary Library Lalign Primary Library (10 top scoring non–intersecting Local (Global pairwise alignment)(Pairwise alignment) Library has information for each N(N-1)/2 sequence pairs.
21
Primary Library Seq A GARFIELD THE LAST FAT CAT Seq B GARFIELD THE FAST CAT - - - Prim. Weight = 88 Seq A GARFIELD THE LAST FA-T CAT Seq C GARFIELD THE VERY FAST --- Prim. Weight = 77 Seq A GARFIELD THE LAST FAT CAT Seq D - - - -- - -- THE - - - - FAT CAT Prim. Weight = 100 Seq B GARFIELD THE - - - - FAST CAT Seq C GARFIELD THE VERY FAST CAT Prim. Weight = 100 Seq B GARFIELD THE FAST CAT Seq D - - - - -- - THE FA-T CAT Prim. Weight = 100 Seq C GARFIELD THE VERY FAST CAT Seq D - - - -- -- - THE - - - - FA-T CAT Prim. Weight = 100
22
Combining the libraries Seq A GARFIELD THE LAST FAT CAT Seq B GARFIELD THE FAST CAT - - - Primary weight(ClustalW)=88 Primary Weight(Lalign)=88 W(A(G),B(G)) = 88 + 88 = 176 If a pair is duplicated across the two libraries, it is merged into single entry with weight = sum of two weights pairs of residue that did not occur are not present ( weight 0 )
23
Library Extension Complete extension requires examination of all triplets. Not all bring information ( eg. A and B through D ). Weight of a pair = weights gathered through examination of all triplets involving that pair.
24
Running Time Complexity of entire procedure: O(N 2 * L 2 ) + O(N 3 *L) + O(N 3 ) + O(N*L 2 ) O(N 2 L 2 ) - pair-wise library computation O(N 3 L) - library extension O(N 3 ) - computation of NJ tree O(N L 2 ) - progressive alignment computation Where: L – average sequence length N – number of sequences
25
T-Coffee compared with other methods MethodCat1(81)Cat2(23)Cat3(4)Cat4(12)Cat5(11)Total(141) Dialign71.025.235.174.780.461.5 ClustalW78.532.242.565.774.366.4 Prrp78.632.550.251.182.766.4 T-Coffee80.737.352.983.288.772.1
26
Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov
27
Reading GENSCAN EasyGene SLAM Twinscan Optional: Chris Burge’s Thesis
28
Gene expression Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA
29
Gene structure exon1 exon2exon3 intron1intron2 transcription translation splicing exon = coding intron = non-coding
30
Finding genes Start codon ATG 5’ 3’ Exon 1 Exon 2 Exon 3 Intron 1Intron 2 Stop codon TAG/TGA/TAA Splice sites
33
Approaches to gene finding Homology BLAST, Procrustes. Ab initio Genscan, Genie, GeneID. Hybrids GenomeScan, GenieEST, Twinscan, SGP, ROSETTA, CEM, TBLASTX, SLAM.
34
HMMs for single species gene finding: Generalized HMMs
35
HMMs for gene finding GTCAGAGTAGCAAAGTAGACACTCCAGTAACGC exon intron intergene
36
GHMM for gene finding TAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTGGGGGGGGGGGGGGGCCCCCCC Exon1Exon2Exon3 duration
37
Observed duration times
38
Better way to do it: negative binomial EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3
39
Biology of Splicing (http://genes.mit.edu/chris/)
40
Consensus splice sites (http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html) Donor: 7.9 bits Acceptor: 9.4 bits (Stephens & Schneider, 1996)
41
Splice site detection 5’ 3’ Donor site Position
42
Splice Site Models WMM: weight matrix model = PSSM (Staden 1984) WAM: weight array model = 1 st order Markov (Zhang & Marr 1993) MDD: maximal dependence decomposition (Burge & Karlin 1997) decision-tree like algorithm to take significant pairwise dependencies into account
43
atg tga ggtgag caggtg cagatg cagttg caggcc ggtgag
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.