Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Lecture 8 Alignment of pairs of sequence Local and global alignment
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
Bioinformatics and Phylogenetic Analysis
Kun Huang Department of Biomedical Informatics Ohio State University
Multiple Sequence Alignment
Multiple sequence alignment
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Multiple Sequence Alignment
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Bioinformatics Sequence Analysis III
Chapter 5 Multiple Sequence Alignment.
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
Chapter 3 Computational Molecular Biology Michael Smith
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
2005MEE Software Engineering Lecture 11 – Optimisation Techniques.
Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Multiple sequence alignment (msa)
Learning Sequence Motif Models Using Expectation Maximization (EM)
Multiple Sequence Alignment
Introduction to Bioinformatics
MULTIPLE SEQUENCE ALIGNMENT
Presentation transcript:

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Why do we care about sequence alignment? It can tell us something about the evolution of organisms. We can see which regions of a gene (or its derived protein) are susceptible to mutation and which can have one residue replaced by another without changing function. Homologous genes (genes with share evolutionary origin) have similar sequences. Orthologs are genes that are evolutionarily related, have a similar function, but now appear in different species. Paralogs are evolutionarily related (share an origin) but no longer have the same function. You can uncover either orthologs or paralogs through sequence alignment.

Multiple Sequence Alignment Often applied to proteins Proteins that are similar in sequence are often similar in structure and function Sequence changes more rapidly in evolution than does structure and function.

Overview of Methods Dynamic programming – too computationally expensive to do a complete search; uses heuristics Progressive – starts with pair-wise alignment of most similar sequences; adds to that Iterative – make an initial alignment of groups of sequences, adds to these (e.g. genetic algorithms) Locally conserved patterns Statistical and probabilistic methods

Dynamic Programming Computational complexity – even worse than for pair-wise alignment because we’re finding all the paths through an n- dimensional hyperspace (We can picture this in 2 or 3 dimensions.) Can align about 7 relatively short ( ) protein sequences in a reasonable amount of time; not much beyond that

A Heuristic for Reducing the Search Space in Dynamic Programming Let’s picture this in 3 dimensions (pp in book). It generalizes to n. Consider the pair-wise alignments of each pair of sequences. Create a phylogenetic tree from these scores. Consider a multiple sequence alignment built from the phylogenetic tree. These alignments circumscribe a space in which to search for a good (but not necessarily optimal) alignment of all n sequences.

Phylogenetic Tree Dynamic programming uses a phylogenetic tree to build a “first-cut” msa The tree shows how protein could have evolved from shared origins over evolutionary time. See page 143 in Bioinformatics by Mount. Chapter 6 goes into detail on this.

Dynamic Programming -- MSA Create a phylogenetic tree based on pair-wise alignments (Pairs of sequences that have the best scores are paired first in the tree.) Do a “first-cut” msa by incrementally doing pair-wise alignments in the order of “alikeness” of sequences as indicated by the tree. Most alike sequences aligned first. Use the pair-wise alignments and the “first-cut” msa to circumscribe a space within which to do a full msa that searches through this solution space. The score for a given alignment of all the sequences is the sum of the scores for each pair, where each of the pair-wise scores is multiplied by a weight є indicating how far the pair-wise score differs from the first-cut msa alignment score.

Heuristic Dynamic Programming Method for MSA Does not guarantee an optimal alignment of all the sequences in the group. Does get an optimal alignment within the space chosen.

Progressive Methods Similar to dynamic programming method in that it uses the first step (i.e., it creates a phylogenetic tree, aligns the most-alike pair, and incrementally adds sequences to the alignment in order of “alikeness” as indicated by the tree.) Differs from dynamic programming method for MSA in that it doesn’t refine the “first-cut” MSA by doing a full search through the reduced search space. (This is the computationally expensive part of DP MSA in that, even though we’ve cut down the search space, it’s still big when we have many sequences to align.)

Progressive Method Generally proceeds as follows:  Choose a starting pair of sequences and align them  Align each next sequence to those already aligned, one at a time Heuristic method – doesn’t guarantee an optimal alignment Details vary in implementation:  How to choose the first sequence to align?  Align all subsequence sequences cumulatively or in subfamilies?  How to score?

ClustalW Based on phylogenetic analysis A phylogenetic tree is created using a pairwise distance matrix and nearest-neighbor algorithm The most closely-related pairs of sequences are aligned using dynamic programming Each of the alignments is analyzed and a profile of it is created Alignment profiles are aligned progressively for a total alignment W in ClustalW refers to a weighting of scores depending on how far a sequence is from the root on the phylogenetic tree (See p. 154 of Bioinformatics by Mount.)

Problems with Progressive Method Highly sensitive to the choice of initial pair to align. If they aren’t very similar, it throws everything off. It’s not trivial to come up with a suitable scoring matrix or gap penaties.

Iterative Methods for Multiple Sequence Alignment Get an alignment. Refine it. Repeat until one msa doesn’t change significantly from the next. An example is genetic algorithm approach.

Genetic Algorithms A general problem solving method modeled on evolutionary change. Create a set of candidate solutions to your problem, and cause these solutions to evolve and become more and more fit over repeated generations. Use survival of the fittest, mutation, and crossover to guide evolution.

Evolutionary Change in Genetic Algorithms survival of the fittest – the best solutions survive and reproduce to the next generation mutation – some solutions mutate in random ways (but they must always remain viable solutions) crossover – solutions “exchange parts”

Laying Out the Problem What would a candidate solution look like in a multiple sequence alignment program? (an msa of ~20 proteins) How many candidate solutions should there be? (~100)

Evolving to a Next Generation Which candidate solutions should survive to the next generation?  First, take the top half based on best sum of pairs scores  Then randomly select second half, giving more chance to an msa’s being selected in proportion to how good its score is

How would mutation work? Can’t change a sequence in the msa. Otherwise you would be created a solution that isn’t really a solution. You can only insert or rearrange gaps.

How would crossover work? See page 160 in Bioinformatics by Mount.

Profiles and Motifs A sequence motif is a relatively short pattern that appears consistently with a family of proteins. (Motifs can also appear in families of DNA or RNA molecules.) Frequently, motif-based analysis is used to detect patterns of amino acids in proteins that correspond to structural or functional features. Motifs are generated during multiple sequence alignment. They can be displayed as patterns of amino acids, as sequence logos, or as profile scoring matrices.