Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Slides:



Advertisements
Similar presentations
Bioinformatics Multiple sequence alignments Scoring multiple sequence alignments Progressive methods ClustalW Other methods Hidden Markov Models Lecture.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Profiles for Sequences
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Profile-profile alignment using hidden Markov models Wing Wong.
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
1 Protein Multiple Alignment by Konstantin Davydov.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Multiple sequence alignment
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Bioinformatics Sequence Analysis III
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Multiple sequence alignment Monday, December 6, 2010 Bioinformatics J. Pevsner
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Introduction to Profile Hidden Markov Models
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Protein Sequence Alignment and Database Searching.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Multiple Sequence Alignment
Free for Academic Use. Jianlin Cheng.
Multiple sequence alignment (msa)
Overview of Multiple Sequence Alignment Algorithms
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Multiple Sequence Alignment
Sequence Based Analysis Tutorial
Introduction to Bioinformatics
Presentation transcript:

Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics. 19(11): CECS Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville What is Multiple Sequence Alignment (MSA) ? Taking more than two sequences and aligning based on similarity

Eric C. Rouchka, University of Louisville Globin Example >gamma_A MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTF AQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVASALSSRYH >alfa VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD LHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR >beta VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTF ATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH >delta VHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTF SQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVANALAHKYH >epsilon VHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFA KLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH >gamma_G MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTF AQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVASALSSRYH >myoglobin MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEI KPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG >teta1 ALSAEDRALVRALWKKLGSNVGVYTTEALERTFLAFPATKTYFSHLDLSPGSSQVRAHGQKVADALSLAVERLDDLPHALSALSHLH ACQLRVDPASFQLLGHCLLVTLARHYPGDFSPALQASLDKFLSHVISALVSEYR >zeta SLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYI LRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEKYR

Eric C. Rouchka, University of Louisville Globin Multiple Alignment

Eric C. Rouchka, University of Louisville Why do MSA? Homology Searching –Important regions conserved across (or within) species Genic Regions Regulatory Elements Phylogenetic Classification Subfamily classification Identification of critical residues

Eric C. Rouchka, University of Louisville MSA Approaches All columns alignable across all sequences –MSA –ClustalW Columns alignable throughout all sequences singled out (Profile HMM) –HMMER –SAM

Eric C. Rouchka, University of Louisville MSA N-dimensional dynamic programming Time consuming High memory usage Guaranteed to yield maximum alignment

Eric C. Rouchka, University of Louisville ClustalW Progressive Alignment –Sequences aligned in pair-wise fashion –Alignment scores produce phylogenetic tree –Enhanced dynamic programming approach

Eric C. Rouchka, University of Louisville Hidden Markov Models Match State, Insert State, Delete State

Eric C. Rouchka, University of Louisville HMMs Models conserved regions Successful at detecting and aligning critical motifs and conserved core structure Difficulty in aligning sequence outside of these regions

Eric C. Rouchka, University of Louisville SATCHMO Simultaneous Alignment and Tree Construction using Hidden Markov mOdels armstrong.htm

Eric C. Rouchka, University of Louisville SATCHMO Progressive Alignment –Built iteratively in pairs –Profile HMMs used Alignments of same sequences not same at each node Number of columns predicted smaller as structures diverge Output not represented by single matrix

Eric C. Rouchka, University of Louisville Why HMMs? Homologs ranked through scoring Accurate profiles from small numbers of sequences Accurately combines two alignments having low sequence similarity

Eric C. Rouchka, University of Louisville Bits saved relative to background K = 1..M: HMM node number a: amino acid type P k (a): emission probability of a in k th match state P 0 (a): approximation of background probability of a

Eric C. Rouchka, University of Louisville Sequence weights Sequences weighted such that b converges on a desired value Weights compensate for correlation in sequences

Eric C. Rouchka, University of Louisville HMM Construction Profile HMM constructed from multiple alignment Some columns alignable; others not

Eric C. Rouchka, University of Louisville HMM Construction Given an alignment a, a profile HMM is generated Each column in a is assigned to an emitter state – transition probabilities are calculated based on observed amino acids

Eric C. Rouchka, University of Louisville Transition Probabilities If we have a total of five match states, the probabilities can be stored in the following table:

Eric C. Rouchka, University of Louisville HMM Terminology  : Path through an HMM to produce a sequence s P(A|  ) =  P(s|  s )  + : maximum probability path through the HMM

Eric C. Rouchka, University of Louisville Aligning Two Alignments One alignment is converted to an HMM Second alignment is aligned to the HMM –Some columns remain alignable –Affinities (relative match scores) calculated New MSA results HMM Constructed from new MSA

Eric C. Rouchka, University of Louisville Aligning Two Alignments

Eric C. Rouchka, University of Louisville SATCHMO Algorithm Step 1: –Create a cluster for each input sequence and construct an HMM from the sequence Step 2: –Calculate the similarity of all pairs of clusters and identify a pair with highest similarity –align the target and template to produce a new node

Eric C. Rouchka, University of Louisville SATCHMO Algorithm Repeat set 2 until: –All sequences assigned to a cluster –Highest similarity between clusters is below a threshold –No alignable positions are predicted Output: A set of binary trees –Nodes are sequences –Each node contains an HMM aligning the sequences in the subtree

Eric C. Rouchka, University of Louisville Graphical Interface for SATCHMO

Eric C. Rouchka, University of Louisville Demonstration of SATCHMO

Eric C. Rouchka, University of Louisville Validation Set BAliBASE benchmark alignment set used –Ref1: equidistant sequences –Ref2: distantly related sequences –Ref3: subgroups of sequences; < 25% similarity between groups –Ref4: alignments with long extensions on the ends –Ref5: alignments with long insertions

Eric C. Rouchka, University of Louisville Comparision of Results SATCHMO compared to: –ClustalW (Progressive Pairwise Alignment) –SAM (HMM)

Eric C. Rouchka, University of Louisville

Discussion SATCHMO effective in identifying protein domains Comparison to T-Coffee and PRRP would be useful –Time and sensitivity Tree representation is unique, modeling structural similarity