Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Multiple Sequence Alignment & Phylogenetic Trees.
COFFEE: an objective function for multiple sequence alignments
Molecular Evolution Revised 29/12/06
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Multiple alignment: heuristics
Multiple sequence alignment
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
CS 177 Sequence Alignment Classification of sequence alignments
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple sequence alignment
Biology 4900 Biocomputing.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Introduction to Profile Hidden Markov Models
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple Sequence Alignment School of B&I TCD May 2010.
Protein Sequence Alignment and Database Searching.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Multiple sequence alignment
Copyright OpenHelix. No use or reproduction without express written consent1.
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Alignment.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment Carlow IT Bioinformatics November 2006.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Topic 3: MSA Iterative Algorithms in Multiple Sequence Alignment Prepared By: 1. Chan Wei Luen 2. Lim Chee Chong 3. Poon Wei Koot 4. Xu Jin Mei 5. Yuan.
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Overview of Multiple Sequence Alignment Algorithms
Multiple Sequence Alignment
Sequence Based Analysis Tutorial
Presentation transcript:

Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

What is MSA? MSA is an alignment generated from three or more sequences. MSA is usually a more global alignment, i.e., the aim is to align homologous residues (nucleotides or amino acids) in columns across the length of the whole sequences. GA--GTACA CAC-GTATA CACGGTAT- G-CGGTCTA

What is MSA? Picture shows protein multiple sequence alignment

Why MSA ”MSA emphasises signal observed in the pairwise alignment” (Liisa Holm) Improved alignments!! Alignment of more distant sequences with the help from intermediate sequences Highlight the conserved regions in sequences

Why MSA MSA is input to many analysis tasks: Detection of active site Generation sequence profiles Detection of protein domains and motifs Phylogenetics …

Remember First step of MSA: Good selection of sequences to the analysis Sequences need to be functionally/evolutionarily related Sometimes it is good to have some variation in the sequences (depends on the analysis task) Alternative: Rubbish in → Rubbish out

MSA methods Finding optimal multiple sequence alignment is computationally hard task “Correct” answer would always come by extending dynamic algorithm to multiple sequences In practice dynamic algorithm cannot be applied to MSA problems We need approximate solutions (heuristics) computational_complexity

MSA methods: heuristics Progressive Alignment (not much used) Iterative Alignment (most popular) Hidden Markov Models Pattern Based methods

Progressive alignment Divide unsolvable task into subtasks that can be solved Align first most similar pairs of sets of sequences –Sequence sets can have 1 or many sequences –First the sets include only single sequences Move progressively to more bigger sets and to more difficult pairs of sets Always align only two pairs of sets at the time

Progressive alignment Produce pairwise alignments between all the sequences you want to align with MSA. –Dynamic programming, ktup-methods.. Produce a “guide tree” on the basis of the pairwise distances calculated from pairwise alignments –UPGMA, neighbor joining Produce an MSA using the “guide tree”. –Sequences are aligned in the same order as the guide tree instructs.

Set of sequences All against all pairwise alignment Here demonstrated for 1. sequence Get pairwise similarities from alignments Create a cluster tree from similarities Join sequences in the order obtained From the cluster tree

Guide tree construction: UPGMA Unweighted Pair Group Method with Arithmetic mean One of the fastest tree construction methods

An example: Pairwise alignments

Pairwise distances, based on pairwise alignments Number of nucleotide differences Absolute distances, used in Pileup/ Clustal JC-distance

UPGMA based on JC-distances* 0,107 / 2 JC-distances = Jukes-Cantor distances. The observed distances, D, are corrected for multiple substitutions via correction function –(3/4)*ln(1-(4/3)D)

UPGMA, distance updates d(human,chimp),gorilla = [d(human, gorilla) + d(chimp, gorilla)] / 2 = [0, ,232] / 2 = 0,3075

UPGMA

U d(human & chimp),U = 0,3923/2 = 0,1962 d(gorilla & orangutan),U = 0,3923/2 = 0,1962 0, ,0537 = 0,1426 0, ,116 = 0,080

UPGMA / 2 0, , ,0537 0, , ,116 or

Constructing MSA human ACGTACGTCC chimp ACCTACGTCC gorilla ACCACCGTCC orangutan ACCCCCCTCC maqaque CCCCCCCCCC human ACGTACGTCC chimp ACCTACGTCC gorilla ACCACCGTCC orangutan ACCCCCCTCC human ACGTACGTCC chimp ACCTACGTCC gorilla ACCACCGTCC orangutan ACCCCCCTCC

Alignment score 1234 ACGT match=1 ACGA mismatch=0 AGGA 1: A-A + A-A + A-A = = 3 2: C-C + C-G + C-G =1+0+0 = 1 3: G-G + G-G + G-G = = 3 4: T-A + T-A + A-A = =1 S(alignment) = S(1) + S(2) + S(3) + S(4) = = 8 The higher the score, the better the alignment

Progressive alignment - pros and cons Pros: –Fast Cons: –Once gaps are opened they can never be closed –Errors in the alignment of the first few sequences can have catastrophic effects on the whole alignment –Not much used (to my knowledge)

Iterative alignment Create a progressive alignment After obtaining the alignment calculate a quality score REPEAT THE FOLLOWING STEPS: –Redo the cluster tree –Realign the sequences using the new cluster tree –Calculate a quality score Loop above can be stopped when a maximum number is reached or when quality score is not improved

Iterative alignment Allows correction of errors that was not possible in progressive alignment Very popular among the MSA methods Increases the running time of the method

Diagram of typical iterative MSA program workflow. Figure from Do & Katoh Iterative alignment Iteration loop

What MSA program(s) to use? Depends on the application –Phylogenetic studies –Structure based studies Depends on the size of the data –Some programs cannot handle large dataset Remember to evaluate the alignment by eye

What MSA program(s) to use? Collection of MSA programs at EBI

Summary of MSA MSA is relevant for many analysis tasks –Improved signal from the alignment Solving MSA requires heuristics Selection of MSA methods depends on the application Results should be evaluated by eye –And the errors should be corrected with MSA editors

Manual editing of MSAs? Let’s say that your performed an MSA witn computer. However, biologically, it has some faults - needs manual editing -> Editors: Jalview and Seaview Input data can be in any of the most common MSA formats (Mase, Phylip, Clustal, MSF, Fasta, NEXUS, PIR and BCL)