1 Protein Multiple Alignment by Konstantin Davydov.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012.
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
CS262 Lecture 9, Win07, Batzoglou History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present  human (3Gbp), mouse.
BNFO 602 Multiple sequence alignment Usman Roshan.
Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Expected accuracy sequence alignment
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
BNFO 602, Lecture 3 Usman Roshan Some of the slides are based upon material by David Wishart of University.
CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable.
Multiple sequence alignment
BNFO 602 Multiple sequence alignment Usman Roshan.
Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Marina Sirota CS374 October 19, 2004 P ROTEIN M ULTIPLE S EQUENCE A LIGNMENT.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment Monday, December 6, 2010 Bioinformatics J. Pevsner
Multiple Sequence Alignment
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Progressive multiple sequence alignments from triplets by Matthias Kruspe and Peter F Stadler Presented by Syed Nabeel.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
Expected accuracy sequence alignment Usman Roshan.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
Sequence Alignment.
Expected accuracy sequence alignment Usman Roshan.
Sequence Similarity. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERTXINSERTY MATCH xixixixi yjyjyjyj ― yjyjyjyj xixixixi―
Multiple Sequence Alignment
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Overview of Multiple Sequence Alignment Algorithms
Presentation transcript:

1 Protein Multiple Alignment by Konstantin Davydov

2 Papers MUSCLE: a multiple sequence alignment method with reduced time and space complexity by Robert C Edgar ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment by Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou

3 Outline Introduction Introduction Background MUSCLE ProbCons Conclusion

4 Introduction What is multiple protein alignment? Given N sequences of amino acids x 1, x 2 … x N : Insert gaps in each of the x i s so that – All sequences have the same length – Score of the global map is maximum ACCTGCA ACTTCAA ACCTGCA-- AC--TTCAA

5 Introduction Motivation Phylogenetic tree estimation Secondary structure prediction Identification of critical regions

6 Outline Introduction Background Background MUSCLE ProbCons Conclusion

7 Background Aligning two sequences ACCTGCA ACTTCAA ACCTGCA-- AC--TTCAA

8 Background AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z

9 Background Unfortunately, this can get very expensive Aligning N sequences of length L requires a matrix of size L N, where each square in the matrix has 2 N -1 neighbors This gives a total time complexity of O(2 N L N )

10 Outline Introduction Background MUSCLE MUSCLE ProbCons Conclusion

11 MUSCLE

12 MUSCLE Basic Strategy: A progressive alignment is built, to which horizontal refinement is applied Three stages At end of each stage, a multiple alignment is available and the algorithm can be terminated

13 Three Stages Draft Progressive Improved Progressive Refinement

14 Stage 1: Draft Progressive Similarity Measure – Calculated using k-mer counting. ACCATGCGAATGGTCCACAATG k-mer: ATG CCA score: 3 2

15 Stage 1: Draft Progressive Distance estimate – Based on the similarities, construct a triangular distance matrix. XXXX 0.6 XXX XX X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4

16 Stage 1: Draft Progressive Tree construction – From the distance matrix we construct a tree XXXX 0.6 XXX XX X X1X1 X1X1 X2X2 X3X3 X4X4 X2X2 X3X3 X4X4 X1X1 X4X4 X2X2 X3X3 X3X3 X2X2 X4X4 X1X1 X3X3 X2X2 X4X4 X1X1

17 Stage 1: Draft Progressive

18 Stage 1: Draft Progressive Progressive alignment – A progressive alignment is built by following the branching order of the tree. This yields a multiple alignment of all input sequences at the root. X1X1 X4X4 X2X2 X3X3 X3X3 X3X3 X2X2 X2X2 X4X4 X4X4 X1X1 X1X1 Alignment of X 1, X 2, X 3, X 4

19 Stage 2: Improved Progressive Attempts to improve the tree and uses it to build a new progressive alignment. This stage may be iterated. X1X1 X4X4 X3X3 X2X2 XXXX XXX XX X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4

20 Stage 2: Improved Progressive Similarity Measure – Similarity is calculated for each pair of sequences using fractional identity computed from their mutual alignment in the current multiple alignment TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC TCC--AA TCA--AA

21 Stage 2: Improved Progressive Tree construction – A tree is constructed by computing a Kimura distance matrix and applying a clustering method to it XXXX XXX XX X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4

22 Stage 2: Improved Progressive Tree comparison – The new tree is compared to the previous tree by identifying the set of internal nodes for which the branching order has changed

23 Stage 2: Improved Progressive Progressive alignment – A new progressive alignment is built X2X2 X4X4 X1X1 X3X3 X3X3 X3X3 X1X1 X1X1 X4X4 X4X4 X2X2 X2X2 New Alignment

24 Stage 3: Refinement Performs iterative refinement

25 Stage 3: Refinement Choice of bipartition – An edge is removed from the tree, dividing the sequences into two disjoint subsets X5X5 X4X4 X2X2 X3X3 X1X1 X5X5 X1X1 X4X4 X2X2 X3X3

26 Stage 3: Refinement Profile Extraction – The multiple alignment of each subset is extracted from current multiple alignment. Columns made up of indels only are removed TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC X1X1 X3X3 X4X4 X5X5 X2X2 TCC--AA TCA--AA TCA--GA T--CTGC G--ATAC TCCAA TCAAA

27 Stage 3: Refinement Re-alignment – The two profiles are then realigned with each other using profile-profile alignment. TCA--GA T--CTGC G--ATAC T--CCAA T--CAAA TCA--GA T--CTGC G--ATAC TCCAA TCAAA

28 Stage 3: Refinement Accept/Reject – The score of the new alignment is computed, if the score is higher than the old alignment, the new alignment is retained, otherwise it is discarded. T--CCAA T--CAAA TCA--GA T--CTGC G--ATAC TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC NewOld OR

29 MUSCLE Review Performance – For alignment of N sequences of length L – Space complexity: O(N 2 +L 2 ) – Time complexity: O(N 4 +NL 2 ) – Time complexity without refinement: O(N 3 +NL 2 )

30 Outline Introduction Background MUSCLE ProbCons Conclusion

31 Hidden Markov Models (HMMs) M JI AGCC-AGC -GCCCAGT IMMMJMMM X Y -- Y X :X :Y

32 Pairwise Alignment Viterbi Algorithm – Picks the alignment that is most likely to be the optimal alignment – However: the most likely alignment is not the most accurate – Alternative: find the alignment of maximum expected accuracy

33 Lazy Teacher Analogy 10 students take a 10-question true-false quiz How do you make the answer key? Viterbi Approach: Use the answer sheet of the best student MEA Approach: Weighted majority vote 4. F 4. T 4. F 4. T A-AB A B+B+B- C

34 Viterbi vs MEA Viterbi – Picks the alignment with the highest chance of being completely correct Maximum Expected Accuracy – Picks the alignment with the highest expected number of correct predictions

35 ProbCons Basic Strategy: Uses Hidden Markov Models (HMM) to predict the probability of an alignment. Uses Maximum Expected Accuracy instead of the Viterbi alignment. 5 steps

36 Notation Given N sequences S = {s 1, s 2, … s N } a* is the optimal alignment

37 ProbCons Step 1: Computation of posterior- probability matrices Step 2: Computation of expected accuracies Step 3: Probabilistic consistency transformation Step 4: Computation of guide tree Step 5: Progressive alignment Post-processing step: Iterative refinement

38 Step 1: Computation of posterior-probability matrices For every pair of sequences x,y  S, compute the matrix P xy P xy (i, j) = P(x i ~y j  a* | x, y), which is the probability that x i and y j are paired in a*

39 Step 2: Computation of expected accuracies For a pairwise alignment a between x and y, define the accuracy as: accuracy(a, a*) = # of correct predicted matches length of shorter sequence

40 Step 2: Computation of expected accuracies (continued) MEA alignment is found by finding the highest summing path through the matrix M xy [i, j] = P(x i is aligned to y j | x, y)

41 Consistency z x y xixi yjyj y j’ zkzk

42 Step 3: Probabilistic consistency transformation Re-estimate the match quality scores P(x i ~y j  a* | x, y) by applying the probabilistic consistency transformation which incorporates similarity of x and y to other sequences from S into the x-y comparison: P(x i ~y j  a* | x, y)P(x i ~y j  a* | x, y, z)

43 Step 3: Probabilistic consistency transformation (continued)

44 Step 3: Probabilistic consistency transformation (continued) Since most of the values of P xz and P zy will be very small, we ignore all the entries in which the value is smaller than some threshold w. Use sparse matrix multiplication May be repeated

45 Step 4: Computation of guide tree Use E(x,y) as a measure of similarity Define similarity of two clusters by the sum- of-pairs XXXX XXX XX X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4

46 Step 5: Progressive alignment Align sequence groups hierarchically according to the order specified in the guide tree. Alignments are scored using sum-of-pairs scoring function. Aligned residues are scored according to the match quality scores P(x i ~y j  a* | x, y) Gap penalties are set to 0.

47 Post-processing step: iterative refinement Much like in MUSCLE Randomly partition alignment into two groups of sequences and realign May be repeated X5X5 X4X4 X2X2 X3X3 X1X1 X5X5 X1X1 X4X4 X2X2 X3X3

48 ProbCons overview ProbCons demonstrated dramatic improvements in alignment accuracy Longer running time Doesn’t use protein-specific alignment information, so can be used to align DNA sequences with improved accuracy over the Needleman-Wunsch algorithm.

49 Outline Introduction Background MUSCLE ProbCons Conclusion

50 Conclusion MUSCLE demonstrated poor accuracy, but very short running time. ProbCons demonstrated dramatic improvements in alignment accuracy, however, is much slower than MUSCLE.

51 Results

52 Reliability Scores

53 Questions?

54 References Robert C Edgar – MUSCLE: a multiple sequence alignment method with reduced time and space complexity Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou – ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment

55 References (continued) Slides on Multiple Sequence Alignment, CS262 Slides on Sequence similarity, CS273 Slides on Protein Multiple Alignment, Marina Sirota