1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

Slides:



Advertisements
Similar presentations
Bioinformatics Multiple sequence alignments Scoring multiple sequence alignments Progressive methods ClustalW Other methods Hidden Markov Models Lecture.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Hidden Markov Model.
Ab initio gene prediction Genome 559, Winter 2011.
Hidden Markov Models.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models Modified from:
Profiles for Sequences
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
1 Lesson 5 Protein Prediction and Classification.
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
1 Multiple sequence alignment Lesson 4. 2 VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
Linear-Space Alignment. Linear-space alignment Using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, N – k) PLUS the backpointers.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Similar Sequence Similar Function Charles Yan Spring 2006.
Prosite and UCSC Genome Browser Exercise 3. Protein motifs and Prosite.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Multiple Sequence Alignments
Single Motif Charles Yan Spring Single Motif.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Multiple Sequence Alignment
Introduction to Profile Hidden Markov Models
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Alignment.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
1 Homology and sequence alignment.. Homology Homology = Similarity between objects due to a common ancestry Hund = Dog, Schwein = Pig.
Pairwise Sequence Alignment. Three modifications for local alignment The scoring system uses negative scores for mismatches The minimum score for.
Sequence similarity, BLAST alignments & multiple sequence alignments
Multiple sequence alignment (msa)
Ab initio gene prediction
In Bioinformatics use a computational method - Dynamic Programming.
Sequence Based Analysis Tutorial
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Presentation transcript:

1 Multiple sequence alignment Lesson 3

2 1. What is a multiple sequence alignment?

3 VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG-- Similar to pairwise alignment BUT n sequences are aligned instead of just n=2 Multiple sequence alignment

4 MSA = Multiple Sequence Alignment Each row represents an individual sequence Each column represents the ‘same’ position VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG-- Multiple sequence alignment

5 Homo sapiens Pan troglodytes Mus musculus Canis familiaris Gallus gallus Anopheles gambiae Drosophila melanogaster Caenorhabditis elegans Arabidobsis thaliana Rattus norvegicus

6 Histone H4 protein

7 Multiple sequence alignment NADH dehydrogenase subunit 4 Histone H4 protein 4 ► Which is better – pairwise alignment of a pair of rows in MSA?

8 2. How MSAs are computed

9 Alignment – Dynamic Programming There is a dynamic programming algorithm for n sequences similar to the pairwise alignment Complexity : O(n |sequences| )

10 Alignment methods This is not practical complexity, therefore heuristics are used: Progressive/hierarchical alignment (Clustal) Iterative alignment (mafft, muscle)

11 ABCDEABCDE Compute the pairwise alignments for all against all (6 pairwise alignments). The similarities are converted to distances and stored in a table First step: Progressive alignment EDCBA A 8B 1715C D E

12 A D C B E Cluster the sequences to create a tree (guide tree): represents the order in which pairs of sequences are to be aligned represents the order in which pairs of sequences are to be aligned similar sequences are neighbors in the tree similar sequences are neighbors in the tree distant sequences are distant from each other in the tree distant sequences are distant from each other in the tree Second step: EDCBA A 8B 1715C D E The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences!

13 Third step: A D C B E 1. Align the most similar (neighboring) pairs sequence

14 Third step: A D C B E 2. Align pairs of pairs sequence profile

15 Third step: A D C B E sequence profile Main disadvantages: Sub-optimal tree topology Misalignments resulting from globally aligning pairs of sequences.

16 ABCDEABCDE Iterative alignment Guide tree MSA Pairwise distance table A D C B Iterate until the MSA does not change (convergence) E

17 3. MSA – What is it good for? A. Conserved positions B. Consensus C. Patterns D. Profiles E. Much more…

18 3. MSA – What is it good for? A. Conserved positions B. Consensus C. Patterns D. Profiles E. Much more…

19 Consensus sequence TGTTCTA TGTTCAA TCTTCAA TGTTCAA A consensus sequence holds the most frequent character of the alignment at each column

20 Consensus sequence – an example TAGCAT TAATAT TAATAT TCATAG TTGTAT TTGTAT The -10 region of six promoters. There are many variants to the “consensus”. TAGCAT TAATAT TAATAT TCATAG TAGTAT TTGTAT

21 Consensus sequence – an example TAGCAT TAATAT TAATAT TCATAG TAGTAT TTGTAT TAATAT 1. Strict majority. * In case of equal frequencies – choose one according to the alphabet order.

22 Consensus sequence – an example Had we searched the region upstream of genes for this consensus, we would have identified only 2 out of the 6 sequences. So we will miss many cases. By chance, we expect a “hit” every 4,096 bp. TAGCAT TAATAT TAATAT TCATAG TAGTAT TTGTAT TAATAT

23 Consensus sequence – an example We can search while allowing 1 mismatch. we would have identified 3 out of the 6 sequences. So we will miss less cases. By chance, we expect a “hit” every ~200bp → more “noise”. TAGCAT TAATAT TAATAT TCATAG TAGTAT TTGTAT TAATAT

24 Consensus sequence – an example We can search while allowing 2 mismatches. we would have identified all 6 sequences. So we won’t miss. By chance, we expect a “hit” every ~30bp → A LOT OF “noise”. TAGCAT TAATAT TAATAT TCATAG TAGTAT TTGTAT TAATAT

25 Consensus sequence – an example 2. Majority only when it is a clear case. In the remaining cases – use wildcards. Y = Pyrimidine R = Purine N = Any nucleotide TAGCAT TAATAT TAATAT TCATAG TAGTAT TTGTAT TNRTAT

26 Reminder: Purines & Pyrimidines Y = Pyrimidine R = Purine N = Any nucleotide

27 Consensus sequence – an example Had we searched the region upstream of genes with the redundant consensus, we would have identified 4/6 sequences. By chance, we expect a “hit” every ~500 bp. TAGCAT TAATAT TAATAT TCATAG TAGTAT TTGTAT TNRTAT

28 Consensus sequence – an example There is always a tradeoff between sensitivity and specificity. Sensitivity: the fraction of true positive predictions among all positive predictions. Specificity: the fraction of true negative predictions among all negative predictions. TNRTAT TAATAT

29 Consensus sequence – an example Sensitivity: the fraction of true positive predictions among all positive predictions Specificity: the fraction of true negative predictions among all negative predictions Permissive consensus: higher sensitivity, lower specificity (more true positives, more false positives  ↔ less true negatives , less false negatives ) Nonpermissive consensus: higher specificity, lower sensitivity (less true positives , less false positives ↔ more true negatives, more false negatives  )

30 3. MSA – What is it good for? A. Conserved positions B. Consensus C. Patterns D. Profiles E. Much more…

31 Patterns TAGCAT TAATAT TAATAT TCATAG TAGTAT TTGTAT [TG]-A-[TC]-[GA]-[CTA]-[T] Patterns are more informative than consensuses sequences. Pattern specify for each position the possible characters for this position.

32 Patterns - syntax The standard IUPAC one-letter codes. ‘x’ : any amino acid. ‘[]’ : residues allowed at the position. ‘{}’ : residues forbidden at the position. ‘()’ : repetition of a pattern element are indicated in parenthesis. X(n) or X(n,m) to indicate the number or range of repetition. ‘-’ : separates each pattern element. ‘‹’ : indicated a N-terminal restriction of the pattern. ‘›’ : indicated a C-terminal restriction of the pattern. ‘.’ : the period ends the pattern.

33 W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE] Patterns Any amino-acid, between 9-11 times F or Y or V WOPLASDFGYVWPPPLAWS ROPLASDFGYVWPPPLAWS WOPLASDFGYVWPPPLSQQQ  

34 3. MSA – What is it good for? A. Conserved positions B. Consensus C. Patterns D. Profiles E. Much more…

35 Profile = PSSM = Position Specific Score Matrix AACCCA GGCCAA TTCCAA A C 0000G 0000T

36 P(AACCAA)= 1 × 0.67 × 1 × 1 × 0.33 × 0.33 P(GACCAA)= 0 Sequences with higher probabilities → higher chance of being related to the PSSM A C 0000G 0000T Profiles / PSSMs

37 One compares each n-mer to the profile and computes the probabilities. Sequences with probabilities > threshold are considered as hits. Searching with PSSM GACGGTACGTAGCGGAGCGACCAA Computes the probability of the first 6-mer A C 0000G 0000T

38 6-mers with probabilities > threshold are considered as hits. Searching with PSSM P2 P3 P4 GACGGTACGTAGCGGAGCGACCAA P A C 0000G 0000T

39 Profile-pattern-consensus GTTCAA GCTGAA CTTCAC A.1000T C.0 00G GTTCAA [AC]-A-[GC]-T-[TC]-[GC] multiple alignment consensus pattern profile NNTNAN

40 4. HMM: Hidden Markov Models

41 Definitions & Uses A probabilistic model which deals with sequences of symbols. Uses: inferring hidden states. Originally used in speech recognition (the symbols being phonemes) Useful in biology – the sequence of symbols being the DNA\Proteins.

42 Markov Chains A sequence of random variables X 1,X 2,… where each present state depends only on the previous state. Weather example: The weather in day x depends only on day x-1: We can easily compute the probability of: Sunny  Sunny  Rainy  Sunny  Sunny

43 Markov Chains Similarly we can assume a DNA sequence is Markovian A  C  G  G  T  A…(vertical or horizontal!) These conditional probabilities can be illustrated as follows (in DNA) Each arrow has a transition probability: P CA = P(x i =A|X i-1 =C) Thus – the probability of a sequence x will be : AT C G

44 Hidden Markov Models The state sequence itself follows a simple Markov chain. But- In a HMM it is no longer possible to know the state by looking at the symbols – the state is hidden. P B PPP BB S i+1 SiSi S i-1 K i+1 KiKi K i-1 S1S1 K1K1 SnSn KnKn...

45 The weather HMM example In this weather example only the actions are observable and the weather is hidden:

46 {S, K, Π, P, B} S : {s 1 …s N } are the values for the hidden states K : {k 1 …k M } are the values for the observations The hidden states emit/generate the symbols (observations) Π = {Π i } are the initial state probabilities P = {P ij } are the state transition probabilities B = {b ik } are the emission probabilities HMM formalities P B PPP BB S i+1 SiSi S i-1 K i+1 KiKi K i-1 S1S1 K1K1 SnSn KnKn...

47 Another HMM example – the dishonest casino In a casino, they use a fair dice most of the time, but occasionally switch to an unfair dice. The switch between dice can be represented by an HMM: 1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6 1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 FAIRUNFAIR : 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6 1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/ UNFAIR FAIR

48 Dishonest casino - continued The symbols (observations) are the sequence of rolls: … What is hidden? If the die is fair or unfair: f f f f u u u f f This is a Markov chain. Except for that, we have: Emission probabilities: Given a state, we have 6 possible matching symbols, each with an emission probability. 1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6 1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 FAIRUNFAIR

49 HMM of MSA MSA can be represented by an HMM – Insertion of A/C/G/T – Match or Mismatch – Deletion

50 HMM of MSA MSA can be represented by an HMM – Insertion of A/C/G/T – Match or Mismatch – Deletion

51 HMM of MSA can get more complex…

52 Questions where HMM’s are used: Does this sequence belong to a particular family? Can we identify regions in a sequence (for instance – alpha helices, beta sheets)? Pairwise/multiple sequence alignment Searching databases for protein families (building profiles).