Marina Sirota CS374 October 19, 2004 P ROTEIN M ULTIPLE S EQUENCE A LIGNMENT.

Slides:



Advertisements
Similar presentations
Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
CS262 Lecture 9, Win07, Batzoglou History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present  human (3Gbp), mouse.
BNFO 602 Multiple sequence alignment Usman Roshan.
Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
1 Protein Multiple Alignment by Konstantin Davydov.
CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments.
Expected accuracy sequence alignment
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
BNFO 602, Lecture 3 Usman Roshan Some of the slides are based upon material by David Wishart of University.
CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
BNFO 602 Multiple sequence alignment Usman Roshan.
Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple Sequence Alignment
Introduction to Profile Hidden Markov Models
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
H IDDEN M ARKOV M ODELS. O VERVIEW Markov models Hidden Markov models(HMM) Issues Regarding HMM Algorithmic approach to Issues of HMM.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment Che-Lun Hung,
Chapter 3 Computational Molecular Biology Michael Smith
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Expected accuracy sequence alignment Usman Roshan.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Expected accuracy sequence alignment Usman Roshan.
Sequence Similarity. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERTXINSERTY MATCH xixixixi yjyjyjyj ― yjyjyjyj xixixixi―
Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Presentation transcript:

Marina Sirota CS374 October 19, 2004 P ROTEIN M ULTIPLE S EQUENCE A LIGNMENT

O UTLINE Introduction Alignments Pairwise vs. Multiple DNA vs. Protein MUSCLE P ROB C ONS Conclusion

I NTRODUCTION 1. Sequence Analysis – Look at DNA and protein sequences, searching for clues about structure, function and control 2. Structure Analysis – Examine biological structures, to learn more about structure, function and control 3. Functional Analysis – Understand how the sequences and structures lead to the biological function

P AIRWISE S EQUENCE A LIGNMENT ( R EVIEW) The Problem: Given two sequences of letters, and a scoring scheme for evaluating matching letters, find the optimal pairing of letters from one sequence to letters of the other sequence Basic Idea: The score of the best possible alignment that ends at a given pair of positions (i, j) is the score of the best alignment previous to those two positions plus the score for aligning those two positions.

P AIRWISE S EQUENCE A LIGNMENT ( R EVIEW)

P AIRWISE vs. M ULTIPLE PAIRWISE Evaluated by addition of match or mismatch scores for aligned pairs and affine gap penalties for unaligned pairs O(L 2 ) time and O(L) space via dynamic programming MULTIPLE Lack of proper objective scoring functions to measure alignment quality High computational cost and no efficient algorithm that can be applied L = sequence length

P ROTEIN VS. DNA DNA (4 characters) Protein (20 characters) DNA – 50% similarity Protein – 20% similarity DNA – fewer sequences to compare Protein – many sequences to compare DNA aligners need to be able to handle long sequences, protein aligners do not

P ROTEIN M ULTIPLE S EQUENCE A LIGNMENT Note that areas that are considered very similar don’t necessarily contain the same amino acids

M OTIVATION Find similarity between known and unknown sequences Protein sequence similarity implies divergence from a common ancestor and functional similarity P ROBLEM Given n sequences and a scoring scheme for evaluating matching letters, find the optimal pairing of letters between the sequences Can be done using dynamic programming with time and space complexity O(L n ) which is not practical!!! Need new algorithms and approaches

A PPLICATIONS Evolutionary research Isolation of most relevant regions Characterization of protein families

M ORE A PPLICATIONS 3Dimentional structure prediction Phylogenetic Studies

P APERS MUSCLE: a Multiple Sequence Alignment Method with Reduced Time and Space Complexity by Robert C. Edgar ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment by Chuong B. Do, Michael Brudno, and Serafim Batzoglou

M USCLE – O VERVIEW Basic Idea: A progressive alignment is built, to which horizontal refinement is applied 3 stages of the algorithm At the completion of each, a multiple alignment is available and the algorithm can be terminated Significant improvement in accuracy and speed

M USCLE – T HE A LGORITHM Stage 1: Draft Progressive – Builds a progressive alignment Similarity of each pair of sequences is computed using K-mer counting Constructing a global alignment and determining fractional identity of the sequences A tree is constructed and a root is identified A progressive alignment is built by following the branching order of the tree, yielding a multiple alignment

M USCLE – P ROGRESSIVE A LIGNMENT

M USCLE – P ROFILE-PROFILE A LIGNMENT

M USCLE – T HE A LGORITHM Stage 2: Improved Progressive – Improves the tree Similarity of each pair of sequences is computed using fractional identity from the mutual alignment A tree is constructed by applying a clustering method to the distance matrix The trees are compared; a set of nodes for which the branching order has changed is identified A new alignment is built, the existing one is retained if the order is unchanged

M USCLE – T REE C OMPARISON

M USCLE – T HE A LGORITHM Stage 3: Refinement – Iterative Refinement is performed An edge is deleted from a tree, dividing the sequences into two disjoint subsets The profile (MA) of each subset is extracted The profiles are re-aligned to each other The score is computed, if the score has increased, the alignment is retained, otherwise it is discarded Algorithm terminates at convergence

M USCLE – I TERATIVE R EFINEMENT S T U X Z Delete this edge Realign these resulting profiles to each other S T U X Z

M USCLE Results: O(N 2 + L 2 ) Space and O(N 4 + NL 2 ) Time Complexity Improvements in selection of heuristics Close attention paid to implementation details Enables high-throughput applications to achieve good accuracy

P ROB C ONS - O VERVIEW Alignment generation can be directly modeled as a first order Markov process involving state emissions and transitions Uses maximum expected accuracy alignment method Probabilistic consistency used as a scoring function Model parameters obtained using unsupervised maximum likelihood methods Incorporate multiple sequence information in scoring pairwise alignments

P ROB C ONS – H IDDEN M ARKOV M ODEL Deletion penalties on Match => Gap transitions Extension penalties on Gap => Gap transitions Match/Mismatch penalties on Match emissions

INSERT XINSERT Y MATCH ABRACA-DABRA AB-ACARDI--- x y xixixixi yjyjyjyj ― yjyjyjyj xixixixi― Basic HMM for sequence alignment between two sequences M emits two letters, one from each sequence I x emits a letter from x that aligns to a gap I y emits a letter from y that aligns to a gap P ROB C ONS – H IDDEN M ARKOV M ODEL

P ROB C ONS - M AXIMUM E XPECTED A CCURACY L AZY T EACHER A NALOGY 10 students take a 10 question true/false quiz How do you make up the answer key? 1. Use the answers of the single best student (Viterbi Algorithm) 2. Use weighted majority rule (Maximum Expected Accuracy)

P ROB C ONS – M AXIMUM E XPECTED A CCURACY Viterbi Picks a single alignment with the highest chance of being completely correct (analogous to Needleman- Wunch) Mathematically, finds the alignment a which maximizes Ea*[1{a = a*}] (maximum probability alignment) Maximum Expected Accuracy Picks alignment with the highest expected number of correct predictions Mathematically, finds the alignment a which maximizes Ea*[accuracy(a, a*)]

P ROB C ONS – C OMPUTING MEA Define accuracy (a, a*) = the expected number of correctly aligned pairs of letters divided by the length of the shorter sequence The MEA alignment is found by finding the highest summing path through the matrix Mxy[i, j] = P(x i is aligned to y j | x, y) We just need to compute these terms! Can use dynamic programming

z x y xixi yjyj y j’ zkzk P ROB C ONS – P ROBABILISTIC C ONSISTENCY

Compute P(x i is aligned to y j | x, y) P(x i is aligned to y j | x, y, z) We can re-estimate Mxy as (Mxz)(Mzy) where z is a third sequence to which x and y are aligned Mxy[i,j] = ∑ Mxz[i.k] Mzy(k,j), where n is the length of z We follow the alignment from position i of x, to position j of y, through all intermediate positions k of a third sequence z P ROB C ONS – P ROBABILISTIC C ONSISTENCY k = 1 n

A straightforward generalization –sum-of-pairs –tree-based progressive alignment –iterative refinement ABRACA-DABRA AB-ACARDI--- ABRA---DABI- AB-ACARDI--- ABRA---DABI- ABRACADABRA ABRA--DABI- ABRACA-DABRA AB-ACARDI--- ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABACARDIABRACADABRA ABRACA-DABRA AB-ACARDI--- ABRADABI ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABACARDI ABRACADABRA ABRA--DABI- ABRACA-DABRA AB-ACARD--I- ABRA---DABI- P ROB C ONS – M ULTIPLE A LIGNMENT

P ROB C ONS – T HE A LGORITHM Step 1: Computation of posterior-probability matrices For every pair of sequences x, y compute the probability that letters x i y j are paired in a*, an alignment of x and y that is randomly generated by the model Step 2: Computation of expected accuracies Define the expected accuracy of a pairwise alignment a xy to be the expected number of correctly aligned pairs of letters divided by the length of the shorter sequence Compute the alignment a xy that maximizes expected accuracy E(x,y) using dynamic programming

P ROB C ONS – T HE A LGORITHM Step 3: Probabilistic consistency transformation Re-estimate the scores with probabilistic consistency transformation by incorporating similarity of x and y to other sequences into the pairwise comparison of x and y Computed efficiently using sparse matrix multiplication ignoring all entries smaller than some threshold Step 4: Computation of a guide tree Construct a tree by hierarchical clustering using E(x, y). Cluster similarity is defined by a weighted average of pairwise similarities between the clusters

P ROB C ONS – T HE A LGORITHM Step 5: Progressive Alignment Align sequence groups hierarchically according to the order specified in the guide tree Score using a sum of pairs function in which the aligned residues are scored according to the match quality scores and the gap penalties are set to 0 Step 6: Iterative Refinement Randomly partition alignment into two groups of sequences and realign. May be repeated as necessary

P ROB C ONS Results: Best results so far Longer in running time due to the computation of posterior probability matrices (Step 1) Doesn’t incorporate biological information Could provide improved accuracy in DNA multiple alignment

P ROB C ONS

C ONCLUSION Protein multiple alignment is a current research problem (both papers published in 2004) Many applications including evolutionary and phylogenetic studies, protein structure and classification Currently, there is some collaboration between the authors of MUSCLE and P ROB C ONS to create a new program which will combine the speed of MUSCLE-based tree construction and the accuracy that comes from using MEA and probabilistic consistency

R EFERENCES Do, C.B., Brudno, M., and Batzoglou, S. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Amino Acid Sequences. Submitted. Edgar, Robert C. (2004), MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research 32(5), Some of these slides were adapted from Tom Do’s ISMB presentation on PROBCONS.

T HANKS!