Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation.

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Measuring the degree of similarity: PAM and blosum Matrix
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
Lecture 8 Alignment of pairs of sequence Local and global alignment
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Analysis Tools
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Pairwise alignment Computational Genomics and Proteomics.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
CS 177 Sequence Alignment Classification of sequence alignments
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Copyright OpenHelix. No use or reproduction without express written consent1.
Comparing Sequences AND Multiple Sequence Alignment Bioinformatics
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Manually Adjusting Multiple Alignments Chris Wilton.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Step 3: Tools Database Searching
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Lab 4.11 Lab 4.1: Multiple Sequence Alignment Jennifer Gardy Molecular Biology & Biochemistry Simon Fraser University.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Multiple sequence alignment (msa)
Presentation transcript:

Multiple Sequence Alignment

Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation site etc. n Pattern: a qualitative motif description based on a regular expression-like syntax n Profile: a quantitative motif description - assigns a degree of similarity to a potential match

Global Alignment n Global algorithms are often not effective for highly diverged sequences and do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. n Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared.

What is Multiple Sequence Alignment (MSA)? Multiple sequence alignment (MSA) can be seen as a generalization of Pairwise Sequence Alignment - instead of aligning two sequences, n sequences are aligned simultaneously, where n is > 2 Definition: A multiple sequence alignment is an alignment of n > 2 sequences obtained by inserting gaps (“-”) into sequences such that the resulting sequences have all length L and can be arranged in a matrix of N rows and L columns where each column represents a homologous position (each column corresponds to a specific residue in the 'prototypical' protein)

Multiple Sequence Alignment n MSA applies both to nucleotide and amino acid sequences n To construct a multiple alignment, one may have to introduce gaps in sequences at positions where there were no gaps in the corresponding pairwise alignment. n This means that multiple alignments typically contain more gaps than any given pair of aligned sequences

How to optimize alignment algorithms? n Use structural information: u reading frame u protein structure n Sequence elements are not truly independent but related by phylogenic descent n Sequences often contain highly conserved regions

Optimize alignment algorithms Sequences often contain highly conserved regions These regions can be used for an initial alignment By analyzing a number of small, independent fragments, the algorithmic complexity can be drastically reduced!

Pairwise Alignment n The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem.

The big-O notation One of the most important properties of an algorithm is how its execution time increases as the problem is made larger. By a larger problem, we mean more sequences to align, or longer sequences to align. This is the so-called algorithmic (or computational) complexity of the algorithm There is a notation to describe the algorithmic complexity, called the big-O notation. If we have a problem size (number of input data points) n, then an algorithm takes O(n) time if the time increases linearly with n. If the algorithm needs time proportional to the square of n, then it is O(n 2 )

The big-O notation n It is important to realize that an algorithm that is quick on small problems may be totally useless on large problems if it has a bad O() behavior. n As a rule of thumb one can use the following characterizations, where n is the size of the problem, and c is a constant: O(c) utopian O(log n) excellent O(n) very good O(n2) not so good O(n3) pretty bad O(cn) disaster

The big-O notation To compute a N-wise alignment, the algorithmic complexity is something like O(c2n), where c is a constant, and n is the number of sequences. This is a big-O disaster!

The best solution is Dynamic Programming.

Multiple Sequence Alignment n In pairwise alignments, you have a two- dimensional matrix with the sequences on each axis. n The number of operations required to locate the best “path” through the matrix is approximately proportional to the product of the lengths of the two sequences n A possible general method would be to extend the pairwise alignment method into a simultaneous N-wise alignment, using a complete dynamical-programming algorithm in N dimensions. n Algorithmically, this is not difficult to do

Dynamic Programming n Dynamic Programming is a very general programming technique. n It is applicable when a large search space can be structured into a succession of stages, such that: u the initial stage contains trivial solutions to sub- problems u each partial solution in a later stage can be calculated by recurring a fixed number of partial solutions in an earlier stage u the final stage contains the overall solution

Multiple Alignments n In theory, making an optimal alignment between two sequences is computationally straightforward (Smith-Waterman algorithm), but aligning a large number of sequences using the same method is almost impossible. n The problem increases exponentially with the number of sequences involved (the product of the sequence lengths)

Optimal Alignment n For a given group of sequences, there is no single "correct" alignment, only an alignment that is "optimal" according to some set of calculations. n Determining what alignment is best for a given set of sequences is really up to the judgement of the investigator.

Why we do multiple alignments? u In order to characterize protein families, identify shared regions of homology in a multiple sequence alignment; (this happens generally when a sequence search revealed homologies to several sequences). u Determination of the consensus sequence of several aligned sequences. u Consensus sequences can help to develop a sequence “finger print” which allows the identification of members of distantly related protein family (motifs) u MSA can help us to reveal biological facts about proteins, like analysis of the secondary/tertiary structure)

Why we do multiple alignments? n Crucial for genome sequencing: u Random fragments of a large molecule are sequenced and those that overlap are found by a multiple sequence alignment program. u There should be one correct alignment that corresponds to the genomic sequence rather than a range of possibilities u Sequence may be from one strand of DNA or the other, so complements of each sequence must also be compared u Sequence fragments will usually overlap, but by an unknown amount and in some cases, one sequence may be included within another u All of the overlapping pairs of sequence fragments must be assembled into large composite genome sequence

An example of Multiple Alignment QLPG VTISCTGSSSNIGAG-NHVKWYQQLPG QLPG VTISCTGTSSNIGS--ITVNWYQQLPG QAPG LRLSCSSSGFIFSS--YAMYWVRQAPG YYSTWVRQPPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG--

Three Types of Algorithms n Progressive: ClustalW n Iterative: Muscle n Concistency Based: T-Coffee and Probcons

Progressive Multiple Alignment n The most practical and widely used method in multiple sequence alignment is the hierarchical extensions of pairwise alignment methods. n The principal is that multiple alignments is achieved by successive application of pairwise methods.

Choosing sequences for alignment n The more sequences to align the better. n Don’t include similar (>80%) sequences. n Sub-groups should be pre-aligned separately, and one member of each subgroup should be included in the final multiple alignment.

Progressive Pairwise Methods n Most of the available multiple alignment programs use some sort of incremental or progressive method that makes pairwise alignments, then adds new sequences one at a time to these aligned groups. n This is an approximate or heuristic method!

Multiple Alignment Method n Compare all sequences pairwise. n Perform cluster analysis on the pairwise data to generate a hierarchy for alignment. This may be in the form of a binary tree or a simple ordering n Build the multiple alignment by first aligning the most similar pair of sequences, then the next most similar pair and so on. Once an alignment of two sequences has been made, then this is fixed. Thus for a set of sequences A, B, C, D having aligned A with C and B with D the alignment of A, B, C, D is obtained by comparing the alignments of A and C with that of B and D using averaged scores at each aligned position.

Gap Penalties n In the MSA scoring scheme, a penalty is subtracted for each gap introduced into an alignment because the gap increases uncertainty into an alignment n The gap penalty is used to help decide whether or not to accept a gap or insertion in an alignment n Biologically, it should in general be easier for a sequence to accept a different residue in a position, rather than having parts of the sequence chopped away or inserted. Gaps/insertions should therefore be more rare than point mutations (substitutions) n In general, the lower the gapping penalties, the more gaps and more identities are detected but this should be considered in relation to biological significance n Most MSA programs allow for an adjustment of gap penalties

The PILEUP Algorithm n First, PILEUP calculates approximate pairwise similarity scores between all sequences to be aligned, and they are clustered into a dendrogram (tree structure). n Then the most similar pairs of sequences are aligned. n Averages (similar to consensus sequences) are calculated for the aligned pairs. n New sequences and clusters of sequences are added one by one, according to the branching order in the dendrogram.

Choosing sequences for MSA n As far as possible, try to align sequences of similar length. n Pileup can align sequences of up to 5000 residues, with 2000 gaps (total 7000 characters). n Pileup is a good program only for similar (close) sequences.

PileUp considerations n PileUp does global multiple alignment, and therefore is good for a group of similar sequences. n PileUp will fail to find the best local region of similarity (such as a shared motif) among distant related sequences. n PileUp always aligns all of the sequences you specified in the input file, even if they are not related. n The alignment can be degraded if some of the sequences are only distantly related.

PILEUP Considerations n Since the alignment is calculated on a progressive basis, the order of the initial sequences can affect the final alignment. n PILEUP parameters: 2 gap penalties (gap insert and gap extend) and an amino acid comparison matrix. n PILEUP will refuse to align sequences that require too many gaps or mismatches. n PILEUP will take quite a while to align more than about 10 sequences

CLUSTAL n CLUSTAL is a stand-alone (i.e. not integrated into GCG) multiple alignment program that is superior in some respects to PILEUP n Works by progressive alignment: it aligns a pair of sequences then aligns the next one onto the first pair n Most closely related sequences are aligned first, and then additional sequences and groups of sequences are added, guided by the initial alignments n Uses alignment scores to produce a phylogenetic tree

CLUSTAL n Aligns the sequences sequentially, guided by the phylogenetic relationships indicated by the tree n Gap penalties can be adjusted based on specific amino acid residues, regions of hydrophobicity, proximity to other gaps, or secondary structure n Is available with a great web interface: n Also available in Biology Workbench

Multiple Alignment tools on the Web n There are a variety of multiple alignment tools available for free on the web. n CLUSTAL is available from a number of sites (with a variety of restrictions) n Other algorithms are available too

Muscle Algorithm: Using The Iteration

Consistency Based Algorithms: T-Coffee n Gotoh (1990) u Iterative strategy using concistency n Martin Vingron (1991) u Dot Matrices Multiplications u Accurate but too stringeant n Dialign (1996, Morgenstern) u Concistency u Agglomerative Assembly n T-Coffee (2000, Notredame) u Concistency u Progressive algorithm n ProbCons (2004, Do) u T-Coffee with a Bayesian Treatment

T-Coffee and Consistency…

APPROXIMATE FAST ACCURATE SLOW

Some URLs u EMBL-EBI u BCM Search Launcher: Multiple Alignment align.html u Multiple Sequence Alignment for Proteins (Wash. U. St. Louis)

Editing and displaying alignments n Sequence editors are used for: u manual alignment/editing of sequences u visualization of data u data management u import/export of data u graphical enhancement of data for presentations

Editing Multiple Alignments n There are a variety of tools that can be used to modify a multiple alignment. n These programs can be very useful in formatting and annotating an alignment for publication. n An editor can also be used to make modifications by hand to improve biologically significant regions in a multiple alignment created by one of the automated alignment programs.

Displaying a multiple alignment in GCG n There are several programs to display the multiple alignment prettily. n The Pretty program prints sequences with their columns aligned and can display a consensus for the alignment, allowing you to look at relationships among the sequences. n The PrettyBox program displays the alignment graphically with the conserved regions of the alignment as shaded boxes. The output is in Postscript format.

Example of PrettyBox Output

GCG alignment editors n Alignments produced with PILEUP (or CLUSTAL) can be adjusted with LINEUP. n Nicely shaded printouts can be produced with PRETTYBOX n GCG's SeqLab X-Windows interface has a superb multiple sequence editor - the best editor of any kind.

Other editors n The MACAW and SeqVu program for Macintosh and GeneDoc and DCSE for PCs are free and provide excellent editor functionality. n Many “comprehensive” molecular biology programs include multiple alignment functions: u MacVector, OMIGA, Vector NTI, and GeneTool/PepTool all include a built-in version of CLUSTAL

SeqVu

CINEMA n CINEMA (Colour INteractive Editor for Multiple Alignments) u It is an editor created completely in JAVA (old browsers beware) u It includes a fully functional version of CLUSTAL, BLAST, and a DotPlot module

Informative Colors n By default, the alignment is coloured crudely according to residue type (proline and glycine have special structural properties, particularly in membrane proteins, so are grouped separately; similarly for cysteine, which is often involved in disulphide bond formation): n Polar positive H, K, R Blue n Polar negative D, E Red n Polar neutral S, T, N, Q Green n Non-polar aliphatic A, V, L, I, M White n Non-polar aromatic F, Y, W Purple n P, G Brown n C Yellow n Special characters B, Z, X, - Grey