Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Hidden Markov models and its application to bioinformatics.
Measuring the degree of similarity: PAM and blosum Matrix
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Profiles for Sequences
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Heuristic alignment algorithms and cost matrices
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Lecture 1 BNFO 240 Usman Roshan. Course overview Perl progamming language (and some Unix basics) Sequence alignment problem –Algorithm for exact pairwise.
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models—Variants Conditional Random Fields 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models 1 2 K … x1 x2 x3 xK.
Sequence similarity.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC.
Similar Sequence Similar Function Charles Yan Spring 2006.
Hidden Markov Models.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Protein structure Classification Ole Lund, Associate professor, CBS, DTU.
Class 3: Estimating Scoring Rules for Sequence Alignment.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Protein Classification. PDB Growth New PDB structures.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Hidden Markov Models for Sequence Analysis 4
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Pairwise Sequence Analysis-III
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Sequence Similarity.
Protein Classification
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Dynamic programming.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Chapter 14 Protein Structure Classification
Variants of HMMs.
CS273 Final Project Stephane Laidebeure
Presentation transcript:

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2

CS262 Lecture 7, Win07, Batzoglou Variants of HMMs

CS262 Lecture 7, Win07, Batzoglou Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 = j)a jkl … A second order HMM with K states is equivalent to a first order HMM with K 2 states state Hstate T a HT (prev = H) a HT (prev = T) a TH (prev = H) a TH (prev = T) state HHstate HT state THstate TT a HHT a TTH a HTT a THH a THT a HTH

CS262 Lecture 7, Win07, Batzoglou Similar Algorithms to 1 st Order P(  i+1 = l |  i = k,  i -1 = j)  V lk (i) = max j { V kj (i – 1) + … }  Time? Space?

CS262 Lecture 7, Win07, Batzoglou Modeling the Duration of States Length distribution of region X: E[l X ] = 1/(1-p) Geometric distribution, with mean 1/(1-p) This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions XY 1-p 1-q pq

CS262 Lecture 7, Win07, Batzoglou Example: exon lengths in genes

CS262 Lecture 7, Win07, Batzoglou Solution 1: Chain several states XY 1-p 1-q p q X X Disadvantage: Still very inflexible l X = C + geometric with mean 1/(1-p)

CS262 Lecture 7, Win07, Batzoglou Solution 2: Negative binomial distribution Duration in X: m turns, where  During first m – 1 turns, exactly n – 1 arrows to next state are followed  During m th turn, an arrow to next state is followed m – 1 P(l X = m) = n – 1 (1 – p) n-1+1 p (m-1)-(n-1) = n – 1 (1 – p) n p m-n X (n) p X (2) X (1) p 1 – p p …… Y 1 – p

CS262 Lecture 7, Win07, Batzoglou Example: genes in prokaryotes EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3

CS262 Lecture 7, Win07, Batzoglou Solution 3:Duration modeling Upon entering a state: 1.Choose duration d, according to probability distribution 2.Generate d letters according to emission probs 3.Take a transition to next state according to transition probs Disadvantage: Increase in complexity of Viterbi: Time: O(D) Space: O(1) where D = maximum duration of state F d<D f x i …x i+d-1 PfPf Warning, Rabiner’s tutorial claims O(D 2 ) & O(D) increases

CS262 Lecture 7, Win07, Batzoglou Viterbi with duration modeling Recall original iteration: Vl(i) = max k V k (i – 1) a kl  e l (x i ) New iteration: V l (i) = max k max d=1…Dl V k (i – d)  P l (d)  a kl   j=i-d+1…i e l (x j ) FL transitions emissions d<D f x i …x i + d – 1 emissions d<D l x j …x j + d – 1 PfPf PlPl Precompute cumulative values

CS262 Lecture 7, Win07, Batzoglou Proteins, Pair HMMs, and Alignment

CS262 Lecture 7, Win07, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII M (+1,+1) I (+1, 0) J (0, +1) Alignments correspond 1-to-1 with sequences of states M, I, J

CS262 Lecture 7, Win07, Batzoglou Let’s score the transitions -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII M (+1,+1) I (+1, 0) J (0, +1) Alignments correspond 1-to-1 with sequences of states M, I, J s(x i, y j ) -d -e

CS262 Lecture 7, Win07, Batzoglou Alignment with affine gaps – state version Dynamic Programming: M(i, j):Optimal alignment of x 1 …x i to y 1 …y j ending in M I(i, j): Optimal alignment of x 1 …x i to y 1 …y j ending in I J(i, j): Optimal alignment of x 1 …x i to y 1 …y j ending in J The score is additive, therefore we can apply DP recurrence formulas

CS262 Lecture 7, Win07, Batzoglou Alignment with affine gaps – state version Initialization: M(0,0) = 0; M(i, 0) = M(0, j) = - , for i, j > 0 I(i,0) = d + i  e;J(0, j) = d + j  e Iteration: M(i – 1, j – 1) M(i, j) = s(x i, y j ) + max I(i – 1, j – 1) J(i – 1, j – 1) e + I(i – 1, j) I(i, j) = max d + M(i – 1, j) e + J(i, j – 1) J(i, j) = max d + M(i, j – 1) Termination: Optimal alignment given by max { M(m, n), I(m, n), J(m, n) }

CS262 Lecture 7, Win07, Batzoglou Brief introduction to the evolution of proteins Protein sequence and structure Protein classification Phylogeny trees Substitution matrices

CS262 Lecture 7, Win07, Batzoglou Muscle cells and contraction

CS262 Lecture 7, Win07, Batzoglou Actin and myosin during muscle movement

CS262 Lecture 7, Win07, Batzoglou Actin structure

CS262 Lecture 7, Win07, Batzoglou Actin sequence Actin is ancient and abundant  Most abundant protein in cells  1-2 actin genes in bacteria, yeasts, amoebas  Humans: 6 actin genes  -actin in muscles;  -actin,  -actin in non-muscle cells ~4 amino acids different between each version MUSCLE ACTIN Amino Acid Sequence 1 EEEQTALVCD NGSGLVKAGF AGDDAPRAVF PSIVRPRHQG VMVGMGQKDS YVGDEAQSKR 61 GILTLKYPIE HGIITNWDDM EKIWHHTFYN ELRVAPEEHP VLLTEAPLNP KANREKMTQI 121 MFETFNVPAM YVAIQAVLSL YASGRTTGIV LDSGDGVSHN VPIYEGYALP HAIMRLDLAG 181 RDLTDYLMKI LTERGYSFVT TAEREIVRDI KEKLCYVALD FEQEMATAAS SSSLEKSYEL 241 PDGQVITIGN ERFRGPETMF QPSFIGMESS GVHETTYNSI MKCDIDIRKD LYANNVLSGG 301 TTMYPGIADR MQKEITALAP STMKIKIIAP PERKYSVWIG GSILASLSTF QQMWITKQEY 361 DESGPSIVHR KCF

CS262 Lecture 7, Win07, Batzoglou A related protein in bacteria

CS262 Lecture 7, Win07, Batzoglou Relation between sequence and structure

CS262 Lecture 7, Win07, Batzoglou Protein Phylogenies Proteins evolve by both duplication and species divergence

CS262 Lecture 7, Win07, Batzoglou Protein Phylogenies – Example

CS262 Lecture 7, Win07, Batzoglou Structure Determines Function What determines structure? Energy Kinematics How can we determine structure? Experimental methods Computational predictions The Protein Folding Problem

CS262 Lecture 7, Win07, Batzoglou Primary Structure: Sequence The primary structure of a protein is the amino acid sequence

CS262 Lecture 7, Win07, Batzoglou Primary Structure: Sequence Twenty different amino acids have distinct shapes and properties

CS262 Lecture 7, Win07, Batzoglou Primary Structure: Sequence A useful mnemonic for the hydrophobic amino acids is "FAMILY VW"

CS262 Lecture 7, Win07, Batzoglou Secondary Structure: , , & loops  helices and  sheets are stabilized by hydrogen bonds between backbone oxygen and hydrogen atoms

CS262 Lecture 7, Win07, Batzoglou Tertiary Structure: A Protein Fold

CS262 Lecture 7, Win07, Batzoglou PDB Growth New PDB structures

CS262 Lecture 7, Win07, Batzoglou Only a few folds are found in nature

CS262 Lecture 7, Win07, Batzoglou Protein classification Number of protein sequences grows exponentially Number of solved structures grows exponentially Number of new folds identified very small (and close to constant) Protein classification can  Generate overview of structure types  Detect similarities (evolutionary relationships) between protein sequences  Help predict 3D structure of new protein sequences SCOP release 1.71, Class# folds# superfamilies# families All alpha proteins All beta proteins Alpha and beta proteins (a/b) Alpha and beta proteins (a+b) Multi-domain proteins48 64 Membrane & cell surface Small proteins Total Classification of 27,599 protein structures in PDB

CS262 Lecture 7, Win07, Batzoglou Protein world Protein fold Protein structure classification Protein superfamily Protein family Morten Nielsen,CBS, BioCentrum, DTU

CS262 Lecture 7, Win07, Batzoglou Structure Classification Databases SCOP  Manual classification (A. Murzin)  scop.berkeley.edu scop.berkeley.edu CATH  Semi manual classification (C. Orengo)  FSSP  Automatic classification (L. Holm)  Morten Nielsen,CBS, BioCentrum, DTU

CS262 Lecture 7, Win07, Batzoglou Major classes in SCOP Classes  All  proteins  All  proteins   and  proteins (  /  )   and  proteins (  +  )  Multi-domain proteins  Membrane and cell surface proteins  Small proteins  Coiled coil proteins Morten Nielsen,CBS, BioCentrum, DTU

CS262 Lecture 7, Win07, Batzoglou All  : Hemoglobin (1bab) Morten Nielsen,CBS, BioCentrum, DTU

CS262 Lecture 7, Win07, Batzoglou All  : Immunoglobulin (8fab) Morten Nielsen,CBS, BioCentrum, DTU

CS262 Lecture 7, Win07, Batzoglou  Triosephosphate isomerase (1hti) Morten Nielsen,CBS, BioCentrum, DTU

CS262 Lecture 7, Win07, Batzoglou  : Lysozyme (1jsf) Morten Nielsen,CBS, BioCentrum, DTU

CS262 Lecture 7, Win07, Batzoglou Families Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity) Families are further subdivided into Proteins Proteins are divided into Species  The same protein may be found in several species Fold Family Superfamily Proteins Morten Nielsen,CBS, BioCentrum, DTU

CS262 Lecture 7, Win07, Batzoglou Superfamilies Proteins which are (remotely) evolutionarily related  Sequence similarity low  Share function  Share special structural features Relationships between members of a superfamily may not be readily recognizable from the sequence alone Fold Family Superfamily Proteins Morten Nielsen,CBS, BioCentrum, DTU

CS262 Lecture 7, Win07, Batzoglou Folds >~50% secondary structure elements arranged in the same order in sequence and in 3D No evolutionary relation Fold Family Superfamily Proteins Morten Nielsen,CBS, BioCentrum, DTU

CS262 Lecture 7, Win07, Batzoglou Substitutions of Amino Acids Mutation rates between amino acids have dramatic differences!

CS262 Lecture 7, Win07, Batzoglou Substitution Matrices BLOSUM matrices: 1.Start from BLOCKS database (curated, gap-free alignments) 2.Cluster sequences according to > X% identity 3.Calculate A ab : # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes 4.Estimate P(a) = (  b A ab )/(  c≤d A cd ); P(a, b) = A ab /(  c≤d A cd )

CS262 Lecture 7, Win07, Batzoglou Probabilistic interpretation of an alignment An alignment is a hypothesis that the two sequences are related by evolution Goal: Produce the most likely alignment Assert the likelihood that the sequences are indeed related

CS262 Lecture 7, Win07, Batzoglou A Pair HMM for alignments M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2  1 –      This model generates two sequences simultaneously Match/Mismatch state M: P(x, y) reflects substitution frequencies between pairs of amino acids Insertion states I, J: P(x), P(y) reflect frequencies of each amino acid  : set so that 1/2  is avg. length before next gap  : set so that 1/(1 –  ) is avg. length of a gap M Model M optional

CS262 Lecture 7, Win07, Batzoglou A Pair HMM for unaligned sequences I P(x i ) J P(y j ) 11 Two sequences are independently generated from one another P(x, y | R) = P(x 1 )…P(x m ) P(y 1 )…P(y n ) =  i P(x i )  j P(y j ) R Model R

CS262 Lecture 7, Win07, Batzoglou To compare ALIGNMENT vs. RANDOM hypothesis Every pair of letters contributes:M (1 – 2  ) P(x i, y j ) when matched  P(x i ) P(y j ) when gappedR P(x i ) P(y j ) in random model Focus on comparison of P(x i, y j ) vs. P(x i ) P(y j ) M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2  1 –     I P(x i ) J P(y j ) 1 1

CS262 Lecture 7, Win07, Batzoglou To compare ALIGNMENT vs. RANDOM hypothesis Every pair of letters contributes:M (1 – 2  ) P(x i, y j ) when matched  P(x i ) P(y j ) when gappedR P(x i ) P(y j ) in random model Focus on comparison of P(x i, y j ) vs. P(x i ) P(y j ) M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2   (1 –  ) (1 – 2  )  I P(x i ) J P(y j ) – 2  Equivalent!

CS262 Lecture 7, Win07, Batzoglou To compare ALIGNMENT vs. RANDOM hypothesis Idea: We will divide alignment score by the random score, and take logarithms Let P(x i, y j ) s(x i, y j ) = log ––––––––– + log (1 – 2  ) P(x i ) P(y j )  (1 –  ) P(x i ) d = – log ––––––––––––– (1 – 2  ) P(x i )  P(x i ) e = – log –––––– P(x i ) = Defn substitution score = Defn gap initiation penalty = Defn gap extension penalty

CS262 Lecture 7, Win07, Batzoglou The meaning of alignment scores The Viterbi algorithm for Pair HMMs corresponds exactly to global alignment DP with affine gaps V M (i, j) = max { V M (i – 1, j – 1), V I ( i – 1, j – 1) – d, V j ( i – 1, j – 1) } + s(x i, y j ) V I (i, j) = max { V M (i – 1, j) – d, V I ( i – 1, j) – e } V J (i, j) = max { V M (i – 1, j) – d, V I ( i – 1, j) – e }  s(.,.) (1 – 2  ) ~how often a pair of letters substitute one another   1/mean length of next gap   (1 –  ) / (1 – 2  ) 1/mean arrival time of next gap

CS262 Lecture 7, Win07, Batzoglou The meaning of alignment scores Match/mismatch scores: P(x i, y j ) s(a, b)  log –––––––––– (ignore log(1 – 2  ) for the moment) P(x i ) P(y j ) Example: DNA regions between human and mouse genes have average conservation of 80% 1.What is the substitution score for a match? P(a, a) + P(c, c) + P(g, g) + P(t, t) = 0.8  P(x, x) = 0.2 P(a) = P(c) = P(g) = P(t) = 0.25 s(x, x) = log [ 0.2 / ] = What is the substitution score for a mismatch? P(a, c) +…+P(t, g) = 0.2  P(x, y  x) = 0.2/12 = s(x, y  x) = log[ / ] = What ratio matches/(matches + mism.) gives score 0? x(#match) – y(#mism) = (#match) – (#mism) = 0 #match = 1.137(#mism) matches = 53.2%

CS262 Lecture 7, Win07, Batzoglou Substitution Matrices BLOSUM matrices: 1.Start from BLOCKS database (curated, gap-free alignments) 2.Cluster sequences according to > X% identity 3.Calculate A ab : # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes 4.Estimate P(a) = (  b A ab )/(  c≤d A cd ); P(a, b) = A ab /(  c≤d A cd )

CS262 Lecture 7, Win07, Batzoglou BLOSUM matrices BLOSUM 50 BLOSUM 62 (The two are scaled differently)