Multiple Sequence Alignment BMI/CS 576 Colin Dewey Fall 2010.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Profiles for Sequences
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Multiple Sequence Comparison.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Phylogenetic Trees Presenter: Michael Tung
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Multiple sequence alignments and motif discovery Tutorial 5.
Multiple sequence alignment
Multiple Sequence alignment Chitta Baral Arizona State University.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Multiple Sequence Alignment
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Phylogenetic Tree Construction and Related Problems Bioinformatics.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Probabilistic Sequence Alignment BMI 877 Colin Dewey February 25, 2014.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Introduction to Profile Hidden Markov Models
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models for Sequence Analysis 4
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
Introduction to Phylogenetic Trees
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Chapter 3 Computational Molecular Biology Michael Smith
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
Multiple Sequence Alignment Colin Dewey BMI/CS 576 Fall 2015.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Hidden Markov Models BMI/CS 576
INTRODUCTION TO BIOINFORMATICS
Multiple sequence alignment (msa)
Learning Sequence Motif Models Using Expectation Maximization (EM)
Entry Task: Educated Guess!
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Presentation transcript:

Multiple Sequence Alignment BMI/CS Colin Dewey Fall 2010

Multiple Sequence Alignment: Task Definition Given –a set of more than 2 sequences –a method for scoring an alignment Do: –determine the correspondences between the sequences such that the alignment score is maximized

Motivation for MSA establish input data for phylogenetic analyses determine evolutionary history of a set of sequences –At what point in history did certain mutations occur? discovering a common motif in a set of sequences (e.g. DNA sequences that bind the same protein) characterizing a set of sequences (e.g. a protein family) building profiles for sequence-database searching –PSI-BLAST generalizes a query sequence into a profile to search for remote relatives

Multiple Alignment of SH3 Domain Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences

Scoring a Multiple Alignment key issue: how do we assess the quality of a multiple sequence alignment? usually, the assumption is made that the individual columns of an alignment are independent we’ll discuss two methods –sum of pairs (SP) –minimum entropy gap functionscore of i th column

Scoring an Alignment: Sum of Pairs compute the sum of the pairwise scores character of the kth sequence in the i th column substitution matrix

Scoring an Alignment: Minimum Entropy basic idea: try to minimize the entropy of each column another way of thinking about it: columns that can be communicated using few bits are good information theory tells us that an optimal code uses bits to encode a message of probability p

Scoring an Alignment: Minimum Entropy the messages in this case are the characters in a given column the entropy of a column is given by: the i th column of an alignment m count of character a in column i probability of character a in column i

Dynamic Programming Approach can find optimal alignments using dynamic programming generalization of methods for pairwise alignment –consider k-dimension matrix for k sequences (instead of 2-dimensional matrix) –each matrix element represents alignment score for k subsequences (instead of 2 subsequences) given k sequences of length n –space complexity is

Dynamic Programming Approach max score of alignment for subsequences

if column scores can be computed in, as with entropy Dynamic Programming Approach given k sequences of length n –time complexity is if we use sum of pairs

Heuristic Alignment Methods since time complexity of DP approach is exponential in the number of sequences, heuristic methods are usually used progressive alignment: construct a succession of pairwise alignments –star approach –tree approaches, like CLUSTALW –etc. iterative refinement –given a multiple alignment (say from a progressive method) remove a sequence, realign it to profile of other sequences repeat until convergence

Star Alignment Approach given: k sequences to be aligned –pick one sequence as the “center” –for each determine an optimal alignment between and –merge pairwise alignments return: multiple alignment resulting from aggregate

Star Alignments: Approaches to Picking the Center Two possible approaches: 1.try each sequence as the center, return the best multiple alignment 2.compute all pairwise alignments and select the string that maximizes:

Star Alignments: Aggregating Pairwise Alignments “once a gap, always a gap” shift entire columns when incorporating gaps

Star Alignment Example ATTGCCATT ATGGCCATT ATTGCCATT ATC-CAATTTT ATTGCCATT-- ATCTTC-TT ATTGCCATT ATTGCCGATT ATTGCC-ATT ATTGCCATT ATGGCCATT ATCCAATTTT ATCTTCTT ATTGCCGATT Given:

Star Alignment Example ATTGCCATT ATGGCCATT ATTGCCATT 1. ATC-CAATTTT ATTGCCATT-- ATGGCCATT-- ATC-CAATTTT 2. merging pairwise alignments alignmentpresent pair

Star Alignment Example ATCTTC-TT ATTGCCATT ATTGCCATT-- ATGGCCATT-- ATC-CAATTTT ATCTTC-TT-- 3. ATTGCCGATT ATTGCC-ATT ATTGCC- A TT-- ATGGCC- A TT-- ATC-CA- A TTTT ATCTTC- - TT-- ATTGCCG A TT-- 4. alignmentpresent pair shift entire columns when incorporating a gap

Tree Alignments basic idea: organize multiple sequence alignment using a guide tree –leaves represent sequences –internal nodes represent alignments determine alignments from bottom of tree upward –return multiple alignment represented at the root of the tree one common variant: the CLUSTALW algorithm [Thompson et al. 1994]

Doing the Progressive Alignment in CLUSTALW depending on the internal node in the tree, we may have to align a –a sequence with a sequence –a sequence with a profile (partial alignment) –a profile with a profile in all cases we can use dynamic programming –for the profile cases, use SP scoring

Tree Alignment Example TGTAACTGTAC ATGT--C ATGTGGC ATGTCATGTGGC TGTAAC TGT-AC -TGTAAC -TGT-AC ATGT--C ATGTGGC TGTTAAC -TGTTAAC -TGT-AAC -TGT--AC ATGT---C ATGT-GGC

Aligning Profiles aligning sequences/profiles to profiles is essentially pairwise alignment –shift entire columns when incorporating gaps -TGT AAC -TGT -AC ATGT --C ATGT GGC TGTTAAC -TGTTAAC -TGT-AAC -TGT--AC ATGT---C ATGT-GGC

Profile HMMs i 2 i 3 i 1 i 0 d 1 d 2 d 3 m 1 m 3 m 2 startend Match states represent key conserved positions Insert states account for extra characters in some sequences Delete states are silent; they Account for characters missing in some sequences profile HMMs are commonly used to model families of sequences A0.01 R0.12 D0.04 N0.29 C0.01 E0.03 Q0.02 G0.01 Insert and match states have emission distributions over sequence characters

Multiple Alignment with Profile HMMs Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences match states delete states (silent) insert states

Multiple Alignment with Profile HMMs given a set of sequences to be aligned –use Baum-Welch to learn parameters of model –may also adjust length of profile HMM during training to compute a multiple alignment given the profile HMM –run the Viterbi algorithm on each sequence –Viterbi paths indicate correspondences among sequences

Multiple Alignment with Profile HMMs

Classification w/ Profile HMMs To classify sequences according to family, we can train a profile HMM to model the proteins of each family of interest Given a sequence x, use Bayes’ rule to make classification β-galactosidase β-glucanase β-amylase α-amylase

Multiple Alignment Case Study: The Cystic Fibrosis Gene cystic fibrosis (CF) –recessive genetic disease caused by a defect in a single- gene –causes the body to produce abnormally thick mucus that clogs the lungs and the pancreas the cystic fibrosis conductance regulator (CFTR) gene –gene and its role in CF identified in 1989 [Riordan et al., Science] –most common mutation is called ΔF508; a deletion of a phenylalanine (F) at position 508 in the CFTR protein –the CFTR protein controls the movement of salt and water into and out of cells; mutations in CFTR block this movement, causing mucus problem

So What Does CFTR Do? A CFTR Multiple Alignment Figure from Riordan et al, Science 245: , 1989.

Multiple Alignment Case Study: the Cystic Fibrosis Gene two key features of the protein made apparent in multiple sequence alignment (and other analyses) –membrane-spanning domains –ATP-binding motifs these features indicated that CFTR is likely to be involved in transporting ions across the cell membrane Figure from Riordan et al, Science 245: , 1989.

Notes on Multiple Alignment as with pairwise alignment, can compute local and global multiple alignments dynamic programming is not feasible for most cases -- heuristic methods usually used instead

Summary: Some Methods for Multiple Sequence Alignment