© Wiley Publishing. 2007. All Rights Reserved. Building Multiple- Sequence Alignments.

Slides:

Advertisements

Similar presentations

Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir.

Advertisements

Motivation “Nothing in biology makes sense except in the light of evolution” Christian Theodosius Dobzhansky.

© Wiley Publishing All Rights Reserved. How Most People Use Bioinformatics.

Multiple Sequence Alignment

BLAST Sequence alignment, E-value & Extreme value distribution.

COFFEE: an objective function for multiple sequence alignments

© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.

© Wiley Publishing All Rights Reserved. Phylogeny.

BNFO 602 Multiple sequence alignment Usman Roshan.

Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

11 Ch6 multiple sequence alignment methods 1 Biologists produce high quality multiple sequence alignment by hand using knowledge of protein sequence evolution.

Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.

Sequence Alignment III CIS 667 February 10, 2004.

Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Multiple Sequence Alignments

Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –

CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Sequencing a genome and Basic Sequence Alignment

© Wiley Publishing All Rights Reserved. Searching Sequence Databases.

Chapter 5 Multiple Sequence Alignment.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

© Wiley Publishing All Rights Reserved.

Multiple sequence alignment

Biology 4900 Biocomputing.

Multiple Sequence Alignment

Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.

Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.

Protein Sequence Alignment and Database Searching.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.

© Wiley Publishing All Rights Reserved. Protein 3D Structures.

BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.

Sequencing a genome and Basic Sequence Alignment

Construction of Substitution Matrices

© Wiley Publishing All Rights Reserved. RNA Analysis.

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

Integrating Biological Information In Multiple Sequence Alignments Confronting Bits and Pieces of Information Cédric Notredame CNRS-Marseille, France

Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Copyright OpenHelix. No use or reproduction without express written consent1.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Manually Adjusting Multiple Alignments Chris Wilton.

T-Coffee tutorial ACGT Retreat 2012 Jean-François Taly, Ionas Erb and Cedrik Magis.

Basic Local Alignment Search Tool BLAST Why Use BLAST?

Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:

Sequence Alignment.

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique CN+LF An introduction to multiple alignments © Cédric Notredame.

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Multiple alignments, PATTERNS, PSI-BLAST.

Construction of Substitution matrices

MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.

1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.

Clustering [Idea only, Chapter 10.1, 10.2, 10.4].

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Multiple sequence alignment (msa)

Learning Sequence Motif Models Using Expectation Maximization (EM)

Explore Evolution: Instrument for Analysis

Homology Modeling.

Presentation transcript:

© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments

Learning Objectives Recognize situations where a multiple-sequence alignment may help Build the right kind of multiple-sequence alignments Become able to estimate the biological quality of your multiple alignment Know about the strengths and weaknesses of the progressive algorithm

Outline Why build a multiple sequence alignment? Choosing the right sequences Choosing the right MSA method Computing a progressive alignment Interpreting your alignment Comparing sequences that are difficult to align

What Is a Multiple- Sequence Alignment ? The alignment of more than two sequences MSAs = multiple-sequence alignments The goal of an MSA is twofold: Aligning corresponding regions of the sequences Revealing positions that are conserved The main steps to a useful MSA require Choosing the right sequences Choosing the right MSA method Interpreting the alignment

Evolution in a Nutshell Amino acids mutate randomly Mutations are then selected (accepted) or counter-selected (rejected) If a mutation is harmful, it is counter-selected It disappears from the genome You never see it Mutations of important positions (such as active sites) are almost always harmful You can recognize important positions because they never mutate! MSAs reveal these conserved positions

An Example of Conserved Positions: The Serine Proteases Active Site Active Site

Why Build an MSA ? The main reason to build an MSA is to use it for further applications Most biological modeling methods require an MSA at some point Next 2 slides list the 8 most common applications that require an MSA (there are many more)

4 Ways of Using MSAs...

4 More Ways of Using MSAs

Choosing the Right Sequences When building an alignment, it is your job to select the sequences Two main factors when selecting sequences: Number of sequences Nature of the sequences A reasonable number of sequences: 20 to 50 Ideal for most methods Small alignments are easy to display and analyze Types of sequences Well-selected sequences  informative alignment

Some Guidelines for Choosing the Right Sequences

DNA or Proteins? DNA sequences are harder to align than proteins DNA-comparison models are less sophisticated Most methods work for both DNA and proteins The results are less useful for DNA If your DNA is coding, work on the translated proteins If sequences are homologous... Along their entire length  use progressive alignment methods (next slide) In terms of local similarity  use motif-discovery methods (end of chapter)

Choosing Sequences That Are Different Enough An alignment is useful if... The sequences are correctly aligned It can be used to produce trees, profiles, and structure predictions To obtain this result, the sequences must be Not too similar Not too different Sequences that are very similar... Are easy to align correctly Are not informative  useless trees and profiles, bad predictions Sequences that are very different... Are difficult to align Are very informative  good trees and profiles, good predictions

Gathering Sequences with BLAST The most convenient way to select your sequences is to use a BLAST server Some BLAST servers are integrated with multiple-alignment methods: srs.ebi.ac.uk npsa-pbil.ibcp.fr

Gathering Sequences with BLAST Select some of the top sequences Evenly select some sequences down to the bottom The idea is to have many intermediate sequences

Aligning Your Sequences Aligning sequences correctly is very difficult Remember the Twilight Zone? It’s hard to align protein sequences with less than 25% identity (70% identity for DNA) All methods are approximate Alignment methods use the progressive algorithm Compares the sequences two by two Builds a guide tree Aligns the sequences in the order indicated by the tree

The Progressive Algorithm Sequences are grouped by similarity (guide tree) Sequences are aligned 2 by 2 The intermediate alignments are then aligned 2 by 2 You align 2 sequences by using dynamic programming Dynamic programming using a substitution matrix

The Progressive Algorithm (cont’d.) Its main strength is its speed Its main weakness is its greed Sequences aligned at the beginning are never realigned Early mistakes cannot be corrected Assemble datasets with lots of intermediate sequences Imagine each sequence is part of a stone bridge across a river: Doesn’t matter how wide the river is, if the stones are close enough together Doesn’t matter how diverse your sequences are, if each sequence has a close relative

Selecting a Method Many alternative methods exist for MSAs Most of them use the progressive algorithm They all are approximate methods None is guaranteed to deliver the best alignments All existing methods have pros and cons ClustalW is the most popular (21,000 citations) T-Coffee and ProbCons are more accurate but slower MUSCLE is very fast, ideal for very large datasets

Selecting a Method (cont’d.) It’s impossible to guess in advance which method will do best. Accuracy is merely an average estimation Methods are tested on reference datasets Their accuracy is the average accuracy obtained on the reference The most accurate method can always be outperformed by a less accurate method on a given dataset. An alternative: Use consensus methods such as MCOFFEE

Running Many Methods at Once MCOFFEE is a a meta-method It runs all the individual MSA methods It gathers all the produced MSAs It combines the MSAs into a single MSA MCOFFEE is more accurate than any individual method Its color output lets you estimate the reliability of your MSA MCOFFEE is available on

MCOFFEE Color Output Red and orange residues are probably well aligned Yellow should be treated with caution Green and blue are probably incorrectly aligned

Aligning Your Sequences Correctly It can be difficult to align sequences correctly They evolve too fast For proteins, the best alternative is to use 3D structure 3D Structures change slower than sequences Unfortunately few sequences have a known structure Expresso lets you find the structures that correspond to your sequences and use them to build an MSA

Using the EXPRESSO Web server The EXPRESSO Web server aligns sequences using 3D information The EXPRESSO tool... Looks in PDB for the structure of your sequences Aligns your sequences using structural information Returns a multiple-sequence alignment based on structure If your sequences have a known structure, EXPRESSO is the most accurate MSA method Expresso is available on

When Building MSAs, Remember... Building MSAs is not an exact science Your sequences make up the most important ingredient If your alignment looks bad, change the sequences Assembling a good MSA is an iterative (try-again) process Select a few sequences Align them Evaluate visual the MSA Add or remove sequences Re-align them

Interpreting Your MSA Don’t put blind trust in the output of the servers Specialists always edit their MSAs by hand You must always estimate the biological accuracy of your MSA Use the color code of Tcoffee Use the conservation patterns of ClustalW: ‘*’ Completely conserved position ‘:’ Highly conserved position ‘.’ Conserved position Use experimental knowledge of your proteins

Understanding Conserved Positions

When Sequences Are Hard to Align Most MSA programs assume your sequences are related along their whole length When this assumption is not true, the progressive approach will not work The only alternative is to compare multiple sequences locally

Local Multiple-Comparison Methods Gibbs Sampler Will make a local multiple alignment Will ignore unrelated segments of your sequences Ideal for finding DNA patterns such as promoters Motif discovery methods Will look for motifs conserved in your sequences The sequences do not need to be aligned The most popular motif-discovery methods: TEIRESIAS, MEME, SMILE, PRATT

Going Farther Assembling MSAs is a bit of an art Experience is a key factor Most methods are now available online Make sure you know which method to use: ClustalW-like method to align homologous sequences Motif method to look for conserved regions