Center for Genes, Environment, and Health Biological Sequences and Hidden Markov Models CBPS7711 Sept 9, 2010 Sonia Leach, PhD Assistant Professor Center.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Hidden Markov Model.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Hidden Markov Models Eine Einführung.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models Modified from:
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Statistical NLP: Lecture 11
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models in Bioinformatics Applications
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Multiple Sequence Alignment (MSA) BIOL 7711 Computational Bioscience
Biochemistry and Molecular Genetics Computational Bioscience Program Consortium for Comparative Genomics University of Colorado School of Medicine
Lecture 6, Thursday April 17, 2003
Heuristic alignment algorithms and cost matrices
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models: an Introduction by Rachel Karchin.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Sequence similarity.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Similar Sequence Similar Function Charles Yan Spring 2006.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Multiple Sequence Alignments
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Introduction to Profile Hidden Markov Models
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Biochemistry and Molecular Genetics Human Medical Genetics and Genomics Computational Bioscience Consortium for Comparative Genomics University of Colorado.
Sequence Alignment.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Other Models for Time Series. The Hidden Markov Model (HMM)
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models BMI/CS 576
Gil McVean Department of Statistics, Oxford
Hidden Markov Models - Training
Hidden Markov Models Part 2: Algorithms
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Center for Genes, Environment, and Health Biological Sequences and Hidden Markov Models CBPS7711 Sept 9, 2010 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and Health National Jewish Health Slides created from David Pollock’s slides from last year 7711 and current reading list from CBPS711 website

Andrey Markov Center for Genes, Environment, and Health Introduction Despite complex 3-D structure, biological molecules have primary linear sequence (DNA, RNA, protein) or have linear sequence of features (CpG islands, models of exons, introns, regulatory regions, genes) Hidden Markov Models (HMMs) are probabilistic models for processes which transition through a discrete set of states, each emitting a symbol (probabilistic finite state machine) HMMs exhibit the ‘Markov property:’ the conditional probability distribution of future states of the process depends only upon the present state (memory-less) Linear sequence of molecules/features is modelled as a path through states of the HMM which emit the sequence of molecules/features Actual state is hidden and observed only through output symbols

Hidden Markov Model Finite set of N states X Finite set of M observations O Parameter set λ = (A, B, π ) –Initial state distribution π i = Pr(X 1 = i) –Transition probability a ij = Pr(X t =j | X t-1 = i) –Emission probability b ik = Pr(O t =k | X t = i) 3 Center for Genes, Environment, and Health 12 3 N=3, M=2 π=(0.25, 0.55, 0.2) A = B = Example:

Hidden Markov Model Finite set of N states X Finite set of M observations O Parameter set λ = (A, B, π ) –Initial state distribution π i = Pr(X 1 = i) –Transition probability a ij = Pr(X t =j | X t-1 = i) –Emission probability b ik = Pr(O t =k | X t = i) 4 Center for Genes, Environment, and Health Hidden Markov Model (HMM) O t O X t X t N=3, M=2 π=(0.25, 0.55, 0.2) A = B = Example:

5 Center for Genes, Environment, and Health Probabilistic Graphical Models Time ObservabilityUtility Observability and Utility Markov Decision Process (MDP) Partially Observable Markov Decision Process (POMDP) Markov Process (MP) Hidden Markov Model (HMM) O t O X t X t-1

Three basic problems of HMMs 1.Given the observation sequence O=O 1,O 2,…,O n, how do we compute Pr(O| λ )? 2.Given the observation sequence, how do we choose the corresponding state sequence X=X 1,X 2,…,X n which is optimal? 3.How do we adjust the model parameters to maximize Pr(O| λ )? 6 Center for Genes, Environment, and Health

Probability of O is sum over all state sequences Pr(O| λ ) = ∑ all X Pr(O|X, λ ) Pr(X| λ ) = ∑ all X π x 1 b x 1 o 1 a x 1 x 2 b x 2 o 2... a x n-1 x n b x n o n Efficient dynamic programming algorithm to do this: Forward algorithm (Baum and Welch) 7 Center for Genes, Environment, and Health 12 3 N=3, M=2 π=(0.25, 0.55, 0.2) A = B = Example: π i = Pr(X 1 = i) a ij = Pr(X t =j | X t-1 = i) b ik = Pr(O t =k | X t = i)

A Simple HMM CpG Islands where in one state, much higher probability to be C or G G.1 C.1 A.4 T.4 G.3 C.3 A.2 T CpG Non-CpG From David Pollock

The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.1 C.1 A.4 T.4 G.3 C.3 A.2 T Non-CpG CpG Adapted from David Pollock’s G G.3 G.1 Pr(G|λ) = π C b CG + π N b NG =.5*.3 +.5*.1 For convenience, let’s drop the 0.5 for now and add it in later

The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.1 C.1 A.4 T.4 G.3 C.3 A.2 T Non-CpG CpG Adapted from David Pollock’s G G.3 G.1 For O=GC have 4 possible state sequences CC,NC, CN,NN (.3*.8+.1*.1)*.3 =.075 (.3*.2+.1*.9)*.1 =.015 C

The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.1 C.1 A.4 T.4 G.3 C.3 A.2 T Non-CpG CpG Adapted from David Pollock’s G G.3 G.1 (.3*.8+.1*.1) *.3 =.075 (.3*.2+.1*.9) *.1 =.015 C (. 075* *.1 ) *.3 =.0185 (. 075* *.9 ) *.1 =.0029 G For O=GCG have possible state sequences CCC, CCN NCC, NCN NNC, NNN CNC, CNN

The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.1 C.1 A.4 T.4 G.3 C.3 A.2 T Non-CpG CpG Adapted from David Pollock’s G G.3 G.1 (.3*.8+.1*.1) *.3 =.075 (.3*.2+.1*.9) *.1 =.015 C (. 075* *.1 ) *.3 =.0185 (. 075* *.9 ) *.1 =.0029 G For O=GCG have possible state sequences CCC, CCN NCC, NCN NNC, NNN CNC, CNN

The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.1 C.1 A.4 T.4 G.3 C.3 A.2 T Non-CpG CpG Adapted from David Pollock’s G G.3 G.1 (.3*.8+.1*.1) *.3 =.075 (.3*.2+.1*.9) *.1 =.015 C (. 075* *.1 ) *.3 =.0185 (. 075* *.9 ) *.1 =.0029 G (. 0185* *.1 )*.2 =. 003 (. 0185* *.9) *.4 = A (. 003* *.1 ) *.2 =.0005 (.003*.2+|.0025*.9 ) *.4 =.0011 A

The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.1 C.1 A.4 T.4 G.3 C.3 A.2 T Non-CpG CpG G G.3 G.1 (.3*.8+.1*.1) *.3 =.075 (.3*.2+.1*.9) *.1 =.015 C (. 075* *.1 ) *.3 =.0185 (. 075* *.9 ) *.1 =.0029 G (. 0185* *.1 )*.2 =. 003 (. 0185* *.9) *.4 = A (. 003* *.1 ) *.2 =.0005 (.003*.2+|.0025*.9 ) *.4 =.0011 A Problem 1: Pr(O| λ )=0.5* *.0011= 8e-4

The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.1 C.1 A.4 T.4 G.3 C.3 A.2 T Non-CpG CpG G G.3 G.1 (.3*.8+.1*.1) *.3 =.075 (.3*.2+.1*.9) *.1 =.015 C (. 075* *.1 ) *.3 =.0185 (. 075* *.9 ) *.1 =.0029 G (. 0185* *.1 )*.2 =. 003 (. 0185* *.9) *.4 = A (. 003* *.1 ) *.2 =.0005 (.003*.2+|.0025*.9 ) *.4 =.0011 A Problem 2: What is optimal state sequence?

The Forward Algorithm Probability of a Sequence is the Sum of All Paths that Can Produce It G.1 C.1 A.4 T.4 G.3 C.3 A.2 T Non-CpG CpG Adapted from David Pollock’s G G.3 G.1 (.3*.8+.1*.1) *.3 =.075 (.3*.2+.1*.9) *.1 =.015 C (. 075* *.1 ) *.3 =.0185 (. 075* *.9 ) *.1 =.0029 G (. 0185* *.1 )*.2 =. 003 (. 0185* *.9) *.4 = A (. 003* *.1 ) *.2 =.0005 (.003*.2+|.0025*.9 ) *.4 =.0011 A

The Viterbi Algorithm Most Likely Path (use max instead of sum) G.1 C.1 A.4 T.4 G.3 C.3 A.2 T Non-CpG CpG Adapted from David Pollock’s (note error in formulas on his) G G.3 G.1 max(.3*.8,.1*.1) *.3 =.072 max(.3*.2,.1*.9) *.1 =.009 C max(.072*.8,.009*.1) *.3 =.0173 max(.072*.2,.009*.9) *.1 =.0014 G max(.0173*.8,.0014*.1) *.2 =.0028 max(.0173* *.9) *.4 =.0014 A max(.0028*.8,.0014*.1) *.2 = max(.0028*.2,.0014*.9 )*.4 =.0005 A

The Viterbi Algorithm Most Likely Path: Backtracking G.1 C.1 A.4 T.4 G.3 C.3 A.2 T Non-CpG CpG Adapted from David Pollock’s G G.3 G.1 max(.3*.8,.1*.1) *.3 =.072 max(.3*.2,.1*.9) *.1 =.009 C max(.072*.8,.009*.1) *.3 =.0173 max(.072*.2,.009*.9) *.1 =.0014 G max(.0173*.8,.0014*.1) *.2 =.0028 max(.0173* *.9) *.4 =.0014 A max(.0028*.8,.0014*.1) *.2 = max(.0028*.2,. 0014*.9 )*.4 =.0005 A

Forward-backward algorithm G.1 C.1 A.4 T.4 G.3 C.3 A.2 T Non-CpG CpG G G.3 G.1 (.3*.8+.1*.1) *.3 =.075 (.3*.2+.1*.9) *.1 =.015 C (. 075* *.1 ) *.3 =.0185 (. 075* *.9 ) *.1 =.0029 G (. 0185* *.1 )*.2 =. 003 (. 0185* *.9) *.4 = A (. 003* *.1 ) *.2 =.0005 (.003*.2+|.0025*.9 ) *.4 =.0011 A Problem 3: How to learn model? Forward algorithm calculated Pr(O 1..t,X t =i| λ)

Parameter estimation by Baum-Welch Forward Backward Algorithm Forward variable α t (i) = Pr(O 1..t,X t =i | λ) Backward variable β t (i) = Pr(O t+1..N |X t =i, λ) Rabiner 1989

Homology HMM Gene recognition, classify to identify distant homologs Common Ancestral Sequence –Parameter set λ = (A, B, π), strict left-right model –Specially defined set of states: start, stop, match, insert, delete –For initial state distribution π, use ‘start’ state –For transition matrix A use global transition probabilities –For emission matrix B Match, site-specific emission probabilities Insert (relative to ancestor), global emission probs Delete, emit nothing Multiple Sequence Alignments Adapted from David Pollock’s

Homology HMM start insert match delete match end insert Adapted from David Pollock’s

Homology HMM Example A.1 C.05 D.2 E.08 F.01 A.04 C.1 D.01 E.2 F.02 A.2 C.01 D.05 E.1 F.06 insert delete insert match insert delete match delete

24 Center for Genes, Environment, and Health Eddy, 1998 Ungapped blocks Ungapped blocks where insertion states model intervening sequence between blocks Insert/delete states allowed anywhere Allow multiple domains, sequence fragments

Homology HMM Uses –Find homologs to profile HMM in database Score sequences for match to HMM –Not always Pr(O| λ ) since some areas may highly diverge –Sometimes use ‘highest scoring subsequence’ Goal is to find homologs in database –Classify sequence using library of profile HMMs Compare alternative models –Alignment of additional sequences –Structural alignment when alphabet is secondary structure symbols so can do fold-recognition, etc Adapted from David Pollock’s

Why Hidden Markov Models for MSA? Multiple sequence alignment as consensus –May have substitutions, not all AA are equal –Could use regular expressions but how to handle indels? –What about variable-length members of family? 26 Center for Genes, Environment, and Health FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPSTGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQSAGAYARAGMV 112 FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPS-TGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQS-AGAYARAGMV 112 FOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQNRG-HPYGVPAPAPPAAYSRPAVL 112 FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQ TRAPHPYGLPTPS-TGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQ TRAPHPYGLPTQS-AGAYARAGMV 112 FOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQ NRG-HPYGVPAPAPPAAYSRPAVL 112 FOSB_MOUSE VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTS----YSTPGLS 110 FOSB_HUMAN VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTS----YSTPGMS 110

Why Hidden Markov Models? Rather than consensus sequence which describes the most common amino acid per position, HMMs allow more than one amino acid to appear at each position Rather than profiles as position specific scoring matrices (PSSM) which assign a probability to each amino acid in each position of the domain and slide fixed-length profile along a longer sequence to calculate score, HMMs model probability of variable length sequences Rather than regular expressions which can capture variable length sequences yet specify a limited subset of amino acids per position, HMMs quantify difference among using different amino acids at each position 27 Center for Genes, Environment, and Health

Model Comparison Based on –For ML, take Usually to avoid numeric error –For heuristics, “score” is –For Bayesian, calculate –Uses ‘prior’ information on parameters Adapted from David Pollock’s

Parameters, Types of parameters –Amino acid distributions for positions (match states) –Global AA distributions for insert states –Order of match states –Transition probabilities –Phylogenetic tree topology and branch lengths –Hidden states (integrate or augment) Wander parameter space (search) –Maximize, or move according to posterior probability (Bayes) Adapted from David Pollock’s

Expectation Maximization (EM) Classic algorithm to fit probabilistic model parameters with unobservable states Two Stages –Maximize If know hidden variables (states), maximize model parameters with respect to that knowledge –Expectation If know model parameters, find expected values of the hidden variables (states) Works well even with e.g., Bayesian to find near-equilibrium space Adapted from David Pollock’s

Homology HMM EM Start with heuristic MSA (e.g., ClustalW) Maximize –Match states are residues aligned in most sequences –Amino acid frequencies observed in columns Expectation –Realign all the sequences given model Repeat until convergence Problems: Local, not global optimization –Use procedures to check how it worked Adapted from David Pollock’s

Model Comparison Determining significance depends on comparing two models (family vs non-family) –Usually null model, H 0, and test model, H 1 –Models are nested if H 0 is a subset of H 1 –If not nested Akaike Information Criterion (AIC) [similar to empirical Bayes] or Bayes Factor (BF) [but be careful] Generating a null distribution of statistic –Z-factor, bootstrapping,, parametric bootstrapping, posterior predictive Adapted from David Pollock’s

Z Test Method Database of known negative controls –E.g., non-homologous (NH) sequences –Assume NH scores i.e., you are modeling known NH sequence scores as a normal distribution –Set appropriate significance level for multiple comparisons (more below) Problems –Is homology certain? –Is it the appropriate null model? Normal distribution often not a good approximation –Parameter control hard: e.g., length distribution Adapted from David Pollock’s

Bootstrapping and Parametric Models Random sequence sampled from the same set of emission probability distributions –Same length is easy –Bootstrapping is re-sampling columns –Parametric uses estimated frequencies, may include variance, tree, etc. More flexible, can have more complex null Pseudocounts of global frequencies if data limit Insertions relatively hard to model –What frequencies for insert states? Global? Adapted from David Pollock’s

Homology HMM Resources UCSC (Haussler) –SAM: align, secondary structure predictions, HMM parameters, etc. WUSTL/Janelia (Eddy) –Pfam: database of pre-computed HMM alignments for various proteins –HMMer: program for building HMMs Adapted from David Pollock’s

36 Center for Genes, Environment, and Health

Why Hidden Markov Models? Multiple sequence alignment as consensus –May have substitutions, not all AA are equal –Could use regular expressions but how to handle indels? –What about variable-length members of family? –(but don’t accept everything – typically introduce gap penalty) 37 Center for Genes, Environment, and Health FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPSTGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQSAGAYARAGMV 112 FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPS-TGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQS-AGAYARAGMV 112 FOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQNRG-HPYGVPAPAPPAAYSRPAVL 112 FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQ TRAPHPYGLPTPS-TGAYARAGVV 112 FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQ TRAPHPYGLPTQS-AGAYARAGMV 112 FOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQ NRG-HPYGVPAPAPPAAYSRPAVL 112 FOSB_MOUSE VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTS----YSTPGLS 110 FOSB_HUMAN VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTS----YSTPGMS 110

Why Hidden Markov Models? Rather than consensus sequence which describes the most common amino acid per position, HMMs allow more than one amino acid to appear at each position Rather than profiles as position specific scoring matrices (PSSM) which assign a probability to each amino acid in each position of the domain and slide fixed-length profile along a longer sequence to calculate score, HMMs model probability of variable length sequences Rather than regular expressions which can capture variable length sequences yet specify a limited subset of amino acids per position, HMMs quantify difference among using different amino acids at each position 38 Center for Genes, Environment, and Health

39 Center for Genes, Environment, and Health Acknowledgements