Multiple Sequence Alignment

Slides:



Advertisements
Similar presentations
Multiple Sequence Alignment
Advertisements

Optimal Sum of Pairs Multiple Sequence Alignment David Kelley.
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
CS262 Lecture 9, Win07, Batzoglou History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present  human (3Gbp), mouse.
BNFO 602 Multiple sequence alignment Usman Roshan.
Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.
Lecture 8: Multiple Sequence Alignment
1 Protein Multiple Alignment by Konstantin Davydov.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction
Multiple Alignment. Outline Problem definition Can we use Dynamic Programming to solve MSA? Progressive Alignment ClustalW Scoring Multiple Alignments.
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Multiple alignment: heuristics
Multiple sequence alignment
BNFO 602 Multiple sequence alignment Usman Roshan.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Introduction to Bioinformatics Algorithms Multiple Alignment.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 16/11/06 Multiple sequence alignment 1 Sequence analysis 2006 Multiple.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Sequence Alignments
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple Sequence Alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple sequence alignment Dr Alexei Drummond Department of Computer Science Semester 2, 2006.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Multiple Sequence Alignment
Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment.
Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Multiple alignment One of the most essential tools in molecular biology Finding highly conserved subregions or embedded patterns of a set of biological.
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Sequence Alignment 11/24/2018.
Multiple Alignment.
Multiple Sequence Alignment
Multiple Sequence Alignment (I)
Introduction to Bioinformatics
Multiple Sequence Alignment
Computational Genomics Lecture #3a
Multiple Sequence Alignment
Presentation transcript:

Multiple Sequence Alignment CS 5263 & CS 4233 Bioinformatics Multiple Sequence Alignment

Multiple Sequence Alignment Motivation: A faint similarity between two sequences becomes very significant if present in many sequences Definition Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that All sequences have the same length L Score of the alignment is maximum Two issues How to score an alignment? How to find a (nearly) optimal alignment?

Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG - - - -

Sum Of Pairs (cont’d) The sum-of-pairs (SP) score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(mk, ml) s(mk, ml): score of induced alignment (k,l)

Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG A C G T - 1 -1 (A,A) + (A,G) x 2 = -1 (G,G) x 3 = 3 (-,A) x 2 + (A,A) = -1 Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5

Multiple Sequence Alignments Algorithms Can also be global or local We only talk about global for now A simple method Do pairwise alignment between all pairs Combine the pairwise alignments into a single multiple alignment Is this going to work?

Compatible pairwise alignments AAAATTTT AAAATTTT---- ----TTTTGGGG AAAATTTT---- AAAA----GGGG AAAATTTT---- ----TTTTGGGG AAAA----GGGG TTTTGGGG AAAAGGGG ----TTTTGGGG AAAA----GGGG

Incompatible pairwise alignments AAAATTTT AAAATTTT---- ----TTTTGGGG ----AAAATTTT GGGGAAAA---- ? TTTTGGGG GGGGAAAA TTTTGGGG---- ----GGGGAAAA

Multidimensional Dynamic Programming (MDP) Generalization of Needleman-Wunsh: Find the longest path in a high-dimensional cube As opposed to a two-dimensional grid Uses a N-dimensional matrix As apposed to a two-dimensional array Entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik] F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current))

Multidimensional Dynamic Programming (MDP) Example: in 3D (three sequences): 23 – 1 = 7 neighbors/cell (i-1,j-1,k-1) (i-1,j,k-1) (i-1,j-1,k) (i-1,j,k) F(i-1,j-1,k-1) + S(xi, yj, zk), F(i-1,j-1,k ) + S(xi, yj, -), F(i-1,j ,k-1) + S(xi, -, zk), F(i,j,k) = max F(i ,j-1,k-1) + S(-, yj, zk), F(i-1,j ,k ) + S(xi, -, -), F(i ,j-1,k ) + S(-, yj, -), F(i ,j ,k-1) + S(-, -, zk) (i,j-1,k-1) (i,j,k-1) (i,j-1,k) (i,j,k)

Multidimensional Dynamic Programming (MDP) Running Time: Size of matrix: LN; Where L = length of each sequence N = number of sequences Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)

Faster MDP Carrillo & Lipman, 1988 Implemented in a tool called MSA Branch and bound Other heuristics Implemented in a tool called MSA Practical for about 6 sequences of length about 200-300.

Faster MDP Basic idea: bounds of the optimal score of a multiple alignment can be pre-computed Upper-bound: sum of optimal pair-wise alignment scores, i.e. S(m) = k<l s(mk, ml)  k<l s(k, l) lower-bounded: score computed by any approximate algorithm (such as the ones we’ll talk next) For any partial path, if Scurrent + Sperspective < lower-bound, can give up that path Guarantees optimality Optimal msa Score of the alignment between k and l induced by m Score of optimal alignment between k and l

Progressive Alignment Multiple Alignment is NP-hard Most used heuristic: Progressive Alignment Algorithm: Align two of the sequences xi, xj Fix that alignment Align a third sequence xk to the alignment xi,xj Repeat until all sequences are aligned Running Time: O(NL2) Each alignment takes O(L2) Repeat N times

Progressive Alignment x y z w When evolutionary tree is known: Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw)

Progressive Alignment: CLUSTALW CLUSTALW: most popular multiple protein alignment Algorithm: Find all dij: alignment dist (xi, xj) High alignment score => short distance Construct a tree (similar to hierarchical clustering. Will discuss in future) Align nodes in order of decreasing similarity + a large number of heuristics

CLUSTALW example S1 ALSK S2 TNSD S3 NASK S4 NTSD

CLUSTALW example S1 ALSK S2 TNSD S3 NASK S4 NTSD s1 s2 s3 s4 9 4 7 8 3 9 4 7 8 3 Distance matrix

CLUSTALW example S1 ALSK S2 TNSD S3 NASK S4 NTSD s1 s1 s2 s3 s4 9 4 7 9 4 7 8 3 s3 s2 s4

CLUSTALW example S1 ALSK S2 TNSD S3 NASK S4 NTSD s1 s1 s2 s3 s4 9 4 7 9 4 7 8 3 s3 s2 s4

CLUSTALW example S1 ALSK S2 TNSD S3 NASK S4 NTSD s1 s1 s2 s3 s4 9 4 7 9 4 7 8 3 s3 s2 s4

CLUSTALW example S1 ALSK S2 TNSD S3 NASK S4 NTSD -ALSK -TNSD NA-SK 9 4 7 8 3 s3 s2 s4

Iterative Refinement Problems with progressive alignment: Depend on pair-wise alignments If sequences are very distantly related, much higher likelihood of errors Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear: correct y should be GA-CTT

Iterative Refinement Algorithm (Barton-Stenberg): Align most similar xi, xj Align xk most similar to (xixj) Repeat 2 until (x1…xN) are aligned For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN Repeat 4 until convergence Progressive alignment

Iterative Refinement (cont’d) For each sequence y Remove y Realign y (while rest fixed) z x allow y to vary y x,z fixed projection Note: Guaranteed to converge (why?) Running time: O(kNL2), k: number of iterations

Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: y: G-ACTTA + 3 matches

Iterative Refinement Example not handled well: x: GAAGTTA y1: GAC-TTA z: GAACTGA w: GTACTGA Realigning any single yi changes nothing

Restricted MDP Similar to bounded DP in pair-wise alignment Construct progressive multiple alignment m Run MDP, restricted to radius R from m z x y Running Time: O(2N RN-1 L)

Restricted MDP Within radius 1 of the optimal x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA Within radius 1 of the optimal  Restricted MDP will fix it.

Other approaches Statistical learning methods Profile Hidden Markov Models Consistency-based methods Still rely on pairwise alignment But consider a third seq when aligning two seqs If block A in seq x aligns to block B in seq y, and both aligns to block C in seq z, we have higher confidence to say that the alignment between A-B is reliable Essentially: change scoring system according to consistency Then apply DP as in other approaches Pioneered by a tool called T-Coffee

Multiple alignment tools Clustal W (Thompson, 1994) Most popular T-Coffee (Notredame, 2000) Another popular tool Consistency-based Slower than clustalW, but generally more accurate for more distantly related sequences MUSCLE (Edgar, 2004) Iterative refinement More efficient than most others DIALIGN (Morgenstern, 1998, 1999, 2005) “local” Align-m (Walle, 2004) PROBCONS (Do, 2004) Probabilistic consistency-based Best accuracy on benchmarks ProDA (Phuong, 2006) Allow repeated and shuffled regions

In summary Multiple alignment scoring functions Sum of pairs Other funcs exist, but less used Multiple alignment algorithms: MDP Optimal too slow Branch & Bound doesn’t solve the problem entirely Progressive alignment: clustalW Iterative refinement Restricted MDP Consistency-based Heuristic