1 Chapter 5 Profile HMMs for Sequence Families
2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG islands), or on pairwise alignment of sequences Functional biological sequences typically come in families Many of the most powerful sequence analysis methods are based on identifying the relationship of an individual sequence to a sequence family
3 Sequence Families Sequences in a family have diverged from each other in their primary sequence during evolution, but normally maintain the same or a related function Thus, identifying that a sequence belongs to a family, and aligning it to the other members, often tells about its function
4 Sequence Family If we have a set of sequences belonging to a family, we can perform a database search for more members using pairwise alignment with one of the known family members as the query sequence We could even search with the known members one by one However, pairwise searching with any one of the members may not find sequences distantly related to the ones we have already
5 Sequence Family An alternative approach is to use statistical features of the whole set of sequences in the search Similarly, even when family membership is clear, accurate alignment can be often improved significantly by concentrating on features that are conserved in the whole family
6 Sequence Family A pairwise alignment captures much of the relationship between two sequences A multiple alignment can show how the sequences in a family relate to each other
7 Sequences From A Globin Family
8 Globin Family It is clear that some positions in the globin alignment are more conserved than others The helices are more conserved than the loop regions between them Certain residues are particularly strongly conserved (two conserved histidines H) When identifying a new sequence as a globin, it would be desirable to concentrate on checking that these more conserved features are present
9 HMM Profile Consensus modeling of the family using a probabilistic model Built from a given multiple alignment (assumed to be correct!) Develop a particular type of hidden Markov model well suited to modelling multiple alignments We call these profile HMMs
10 HMM Profile We will assume that we are given a correct multiple alignment, from which we will build a model that can be used to find and score potential matches to new sequences
11 Ungapped Score Matrices A natural probabilistic model for a conserved region would be to specify independent probabilities e i (a) of observing amino acid a in position i The probability of a new sequence x according to this model is
12 Log-odds ratio We are interested in the ratio of the probability to the probability of x under the random model The values log(e i (a)/q a ) behave like elements in a score matrix s(a,b), where the second index is position i, rather than amino acid b Position specific score matrix (PSSM)
13 Adding Insert and Delete States A PSSM might capture some conservation information, it is clearly an inadequate representation of all the information in a multiple alignment of a protein family. We have to find some way to take account of gaps
14 Components of profile HMMs Consideration of gaps –Henikoff & Henikoff [1991] Combining the multiple ungapped block models –Allowing gaps at each position using the same gap scores (g) at each position (where gaps are more or less likely?) Profile HMMs –Repetitive structure of states –Different probabilities in each position –Full probabilistic model for sequences in the sequence family
15 Components of profile HMMs The PSSM can be viewed as a trival HMM with a series of identical states that we will call match states, separated by transitions of probability 1 Alignment is trivial since there is no choice of transitions. Match states –Emission probabilities Begin MjMj End....
16 Ungapped profiles and the corresponding HMMs BegMjMj End …… Example AGAAACT AGGAATT TGAATCT P( AGAAACT )=16/81 P( TGGATTT )=1/ A2/ T1/ C000002/30 G011/30000 Each blue square represents a match state that “emits” each letter with certain probability e j (a) which is defined by frequency of a at position j: Typically, pseudo-counts are added in HMMs to avoid zero probabilities.
17 Insertions and deletions in profile HMMs BegMjMj End IjIj Insert states emit symbols just like the match states, however, the emission probabilities are typically assumed to follow the background distribution and thus do not contribute to log-odds scores. Transitions I j -> I j are allowed and account for an arbitrary number of inserted residues that are effectively unaligned (their order within an inserted region is arbitrary).
18 Components of profile HMMs Insert states –Emission prob. Usually back ground distribution q a. –Transition prob. M i to I i, I i to itself, I i to M i+1 –Log-odds score of a gap of length k (no logg-odds from emission) Begin MjMj End IjIj
19 Insertions and deletions in profile HMMs BegMjMj End DjDj Deletions are represented by silent states which do not emit any letters. A sequence of deletions (with D -> D transitions) may be used to connect any two match states, accounting for segments of the multiple alignment that are not aligned to any symbol in a query sequence (string). The total cost of a deletion is the sum of the costs of individual transitions (M->D, D->D, D->M) that define this deletion. As in case of insertions, both linear and affine gap penalties can be easily incorporated in this scheme.
20 Components of profile HMMs Delete states –No emission prob. –Cost of a deletion M→D, D→D, D→M Each D→D might be different, each I→I will have the same cost Begin MjMj End DjDj
21 Gap penalties: evolutionary and computational considerations Linear gap penalties: (k) = - k d for a gap of length k and constant d Affine gap penalties: (k) = - [ d + (k -1) e ] where d is opening gap penalty and e an extension gap penalty.
22 Components of profile HMMs Combining all parts Begin MjMj End IjIj DjDj Figure 5.2 The transition structure of a profile HMM.
23 Profile HMMs as a model for multiple alignments BegMjMj End IjIj DjDj Example AG---C A-AG-C AG-AA- --AAAC AG---C ** *
24
25 Profile HMMs generalize pairwise alignment M j →M, I j →X (insertion), D j →Y (deletion) e Mi (a)=p yia /q a the conditional probabilities of seeing a given y i in a pairwise alignment a MiIi = a MiDi+1 =δ, a IiIi = a DiDi+1 =ε
26 Deriving Profile HMMs From Multiple Alignments The key idea behind profile HMMs is that we can use the same structure as shown in Figure 5.2, but set the transition and emission probabilities to capture specific information about each position in the multiple alignment of the whole family Essentially, we want to build a model representing the consensus sequence for a family, rather than the sequence of any particular member Non-probabilistic profiles and profile HMMs
27 Non-probabilistic Profiles A model similar to the profile HMM was first introduced by Gribskov, McLachlan and Eisenberg 1987 No underlying probabilistic model, but rather assigned position specific scores for each match state and gap penalty The score for each consensus position is set to the average of the standard substitution scores from all the residues in the corresponding multiple sequence alignment column
28 HMMs from multiple alignments Key idea behind profile HMMs –Model representing the consensus for the family –Not the sequence of any particular member HBA_HUMAN...VGA--HAGEY... HBB_HUMAN...V----NVDEV... MYG_PHYCA...VEA--DVAGH... GLB3_CHITP...VKG------D... GLB5_PETMA...VYS--TYETS... LGB2_LUPLU...FNA--NIPKH... GLB1_GLYDI...IAGADNGAGV... *** ***** Figure 5.3 Ten columns from the multiple alignment of seven globin protein sequences shown in Figure 5.1 The starred columns are ones that will be treated as ‘matches’ in the profile HMM.
29 Non-probabilistic Profiles s(a,b) : standard substitution matrix The score for residue ‘a’ in column 1
30 HMMs from multiple alignments Non-probabilistic profiles –Gribskov, Mclachlan & Eisenberg [1987] Score for residue a in column 1 –Disadvantages More conserved region might be corrupted. Intuition about the likelihood can’t be maintained. The score for gaps do not behave as expected.
31 Non-probabilistic Profiles They also set gap penalties for each column using a heuristic equation that decrease the cost of a gap according to the length of the longest gap observed in the multiple alignment spanning the column
32 Problem With The Approach If we had an alignment with 100 sequences, all with a cysteine (C), at some position, the probability distribution for that column for an “ average ” profile would be exactly the same as would be derived from a single sequence Doesn ’ t correspond to our expectation that the likelihood of a cysteine should go up as we see more confirming examples
33 Similar Problem With Gaps Scores for a deletion in columns 2 and 4 would be set to the same value More reasonable to set the probability of a new gap opening to be higher in column 4
34 Basic Profile HMM Parameterization HMM profiles have emission and transition probabilities Assuming that these probabilities are non-zero, a profile HMM can model any possible sequence of residues from the given alphabet A profile HMM defines a probability distribution over the whole space of sequences The aim of parameterization is to make this distribution peak around members of the family Parameters: probabilities and the length of the model (control the shape of the distribution)
35 Model Length The choice of length of the model corresponds more precisely to a decision on which multiple alignment columns to assign to match states, and which to assign to insert states The consensus sequence at Figure 5.3 should only have 8 residues, and that the two non-starred residues in GLB1_GLYDI should be treated as an insertion with respect to the consensus Should decide which columns should correspond to match states, and which to inserts A simple rule that works well in practice is that columns that are more than half gap characters should be modeled by inserts
36 Probability Values k,l : indices over states a kl and e k : transition and emission probabilities A kl and E k : transition and emission frequencies
37 Problem With The Approach Transitions and emissions that don ’ t appear in the training dataset would acquire zero probability (would never be allowed) Solution: add pseudocounts to the observed frequencies Simplest pseudocount method is Laplace ’ s rule: add one to each frequency
38 Example
39 Example: Full Profile HMM
40 Searching With Profile HMMs One of the main purposes of developing profile HMMs is to use them to detect potential membership in a family by obtaining significant matches of a sequence to the profile HMM We assume that we are looking for global matches We can use either the Viterbi algorithm to get the most probable alignment or the forward algorithm to calculate the full probability of the sequence summed over all possible paths
41 Searching with profile HMMs Maintaining log-odd ratio compared with random model
42 Viterbi Equations V j M (i) is the log-odds score of the best path matching subsequence x 1…i to the submodel up to state j, ending with x i being emitted by state M j V j I (i) is the score of the best path ending in x i being emitted by I j, and V j D (i) for the best path ending in state D j
43 Viterbi Algorithm
44 Viterbi Equations In a typical case, there is no emission score e I j (x i ) in the equation for V j I (i) since we assume that the emission distribution from the insert state I j is the same as the background distribution, so the probabilities cancel in the log-odds form Also the D→I and I→D trsnsition terms may not be present
45 Forward Algorithm
46 Variants for non-global alignments We can generalize profile HMMs for local, repeat and overlap alignments The profile HMM for local algorithm is to specify a new model for the complete sequence x,which incorporates the original profile HMM together with one or more copies of a simple selflooping model that is used to account for the regions of unaligned sequence
47 Variants for non-global alignments Notice that as well as specifying the emission probabilities of the new states, which will normally of course be q a, we must specify a number of new transition probabilities The looping probability on the flanking states should be close to 1, since they must account for long stretches of sequence Let us set these to (1-η)
48 Variants for non-global alignments For the transition probabilities from the left flanking state to different start points in the model, we can give them equal probabilities, η/L Another option is to assign more probability to starting at the beginning of the model. That is the option used in HMMER package (Eddy 1996)
49 Variants for non-global alignments Local alignments (flanking model) –Emission prob. in flanking states use background values q a. –Looping prob. close to 1, e.g. (1- ) for some small . MjMj IjIj DjDj Begin End QQ
50 Variants for non-global alignments Overlap alignments –Only transitions to the first model state are allowed. –When expecting to find either present as a whole or absent –Transition to first delete state allows missing first residue Begin MjMj End IjIj Q DjDj Q
51 Variants for non-global alignments Repeat alignments –Transition from right flanking state back to random model –Can find multiple matching segments in query string MjMj IjIj DjDj BeginEnd Q
52 More on estimation of prob. Maximum likelihood (ML) estimation –given observed freq. c ja of residue a in position j. Problem of ML estimation –If observed cases are absent? –Specially when observed examples are somewhat few.
53 More on estimation of prob. Simple pseudocounts –q a : background distribution –A: weight factor –Laplace’s rule: Aq a = 1 Bayesian framework –Dirichlet prior (α a = Aq a ) (θare model probabilities)
54 Dirichlet Distribution The Dirichlet distribution of order K ≥ 2 with parameters α 1,..., α K > 0 has a pdf on R K–1 given by
55 More on estimation of prob. Dirichlet mixtures –Mixtures of dirichlet prior: better than single dirichlet prior –With K pseudocount priors,
56 Optimal model construction In order to estimate the probability parameters from profile HMMs, we need to consider –Which columns to insert states or which to match states? (called model construction) –If marked multiple alignments have no errors, the optimal model can be constructed –In the profile HMM formalism, it is assumed that an aligned column of symbols corresponds either to emissions from the same match state or to emissions from the same insert state
57 Optimal model construction It therefore suffices to mark which columns come from match states to specify a profile HMM architecture and the state paths for all the sequences in the alignment, as shown in Figure 5.7 In a marked column, symbols are assigned to match states and gaps are assigned to delete states In an unmarked column, symbols are assigned to insert states and gaps are ignored State transition and symbol emission counts are obtained from the state paths, and these counts can be used to estimate probability parameters
58 Optimal model construction 2 L combinations for markings of L columns, and hence 2 L different profile HMMs to choose from A manual construction is shown in Figure 5.7 Other ways to determine the marking include heuristic construction and a maximum a posteriori (MAP) choice
59 Optimal model construction begMMMend I III DDD x x... x bat A G C rat A - A G - C cat A G - A A - gnat - - A A A C goat A G C (a) Multiple alignment: (b) Profile-HMM architecture: A C G T A C G T M-M M-D M-I I-M I-D I-I D-M D-D D-I (c) Observed emission/transition counts match emissions insert emissions state transitions
60 Optimal Model Construction MAP match-insert assignment –Recursive calculation of a number S j S j : log prob. of the optimal model for alignment up to and including column j, assuming j is marked. S j is calculated from S i and summed log prob. between i and j. T ij : summed log prob. of all the state transitions between marked i and j. –c xy are obtained from partial state paths implied by marking i and j.
61 Optimal Model Construction Algorithm: MAP model construction –Initialization: S 0 = 0, M L+1 = 0. –Recurrence: for j = 1,..., L+1: –Traceback: from j = L+1, while j > 0: Mark column j as a match column j = j.
62 Weighting Training Sequences One issue that we have avoided completely so far is that of weighting sequences when estimating parameters In a typical alignment, there are often some sequences that very closely related to each other Intuitively, some of the information from these sequences is shared, so we should not give them each the same influence in the estimation process as a single sequence that is more highly diverged from all the others
63 Weighting Training Sequences In the extreme that two sequences are identical, it makes sense that they should each get half the weight of other sequences There have been a large number of proposals for different ways to assign weights to sequences
64 Weighting Training Sequences Good random sample do you have? “Assumption : all examples are independent samples” might be incorrect Solutions –Weight sequences based on similarity
65 Simple Weighting Scheme Derived from a Tree Many weighting approaches based on building a tree relating the sequences Since sequences in a family are related by an evolutionary tree, a very natural approach is to try to reconstruct this tree and use it when estimating the independent contribution of each of the observed sequences, down weighting sequences that have only recently diverged
66 Simple Weighting Scheme Derived from a Tree Here we assume that we are given a tree connecting the sequences, with branch lengths indicating the relative degrees of divergence for each edge in the tree
67 Weighting Training Sequences Simple weighting schemes derived from a tree –Phylogenetic tree is given. [Thompson, Higgins & Gibson 1994b] –Kirchohoff’s law [Gerstein, Sonnhammer & Chothia 1994]
68 Weighting Training Sequences We are given a tree made of a conducting wire of constant thickness and apply a voltage V to the root All the leaves are set to zero potential and the currents flowing from them are measured and taken to be the weights Clearly, the currents will be smaller in the highly divided parts of the tree so these weights have the right qualitative properties
69 Weighting Training Sequences t 4 = 8 t 3 = 5 t 2 = 2 t 1 = 2 t 5 = 3 t 6 = I4I4 I 1 +I 2 I 1 +I 2 +I 3 V5V5 V6V6 V7V7 I1I1 I2I2 I3I3 I 1 :I 2 :I 3 :I 4 = 20:20:32:47w 1 :w 2 :w 3 :w 4 = 35:35:50:64
70 Weighting Training Sequences Root weights from Gaussian parameters –Influence of leaves on the root distr. –Altchul-Carroll-Lipman wieghts Make gaussian distr. Mean : linearly combination of x i. Combination weights represent the influences of leaves.
71 Weighting Training Sequences t3t3 t2t2 t1t1 4 x1x1 x2x2 x3x3 5
72 Weighting Training Sequences Voronoi weights –Proportional to the volume of empty space –Sequence family in sequence space –Algorithm Random sample: choosing at kth position uniformly from the set of residues occurring kth position n i : count of samples closest to the ith family ith weight
73 Weighting Training Sequences Maximum discrimination weights –Focus: decision on whether sequences are members of the family or not –discrimination –weight: 1-P(M|x i ) –effect: difficult members are given big weight
74 Weighting Training Sequences Maximum entropy weights (1) –Intuition k ia : number of residues of type a in column i of a multiple alignment m i : number of different types of residues in column i As uniform as possible –weight for sequence k: –ML estimation under the weights: p ia = 1/m i –Averaging over all columns [Henikoff 1994]
75 Weighting Training Sequences Maximum entropy weights (2) –entropy: an measure of the ‘uniformity’ [Krogh & Mitchison 1995] –maximize –example x 1 = AFA, x 2 = AAC, x 3 = DAC w 1 = w 3 =0.5, w 2 = 0 (sum to one constraints)