Ab Initio Profile HMM Generation Sam Gross
Profile HMMs STOLEN FROM BATZOGLOU LECTURE BEGIN I0 I1 Im-1 D1 D2 Dm END Im Dm-1 Protein profile H Each M state has a position-specific pre-computed substitution table Each I and D state has position-specific gap penalties Profile is a generative model: The sequence X that is aligned to H, is thought of as “generated by” H Therefore, H parametrizes a conditional distribution P(X | H)
Ab Initio Profile Generation Given N related protein sequences x1…xN Construct a profile HMM H such that is maximized Õ i x H P ) | (
Easier Said Than Done Profile HMM length is unknown Use average sequence length Alignment is unknown HMM parameters are unknown
Not A New Problem Instance of the general problem of HMM parameter estimation using unlabelled outputs Instance of the even more general problem of MLE with partially missing data We want We know arg max P ( D | q ) obs q P ( D , D | q ) obs hid
The Expectation Maximization (EM) Algorithm Start with initial guess for parameters Iterate until convergence: E-step: Calculate expectations for missing data M-step: Treating expectations as observations, calculate MLE for parameters
Baum-Welsh: EM For HMMs Start with initial guess of HMM parameters Iterate until convergence: Forward-backward algorithm MLE using forward-backward posterior probabilities
Incorporating Prior Knowledge We know in advance certain types of residues tend to align together Use a Dirichlet mixture prior over outputs for match states Each distribution in the mixture corresponds to a different “alignment environment”
Coin Flips Example Two trick coins used to generated a sequence of heads and tails You see only the sequence, and must determine the probability of heads for each coin Coin A Coin B
10,000 Coin Flips Real coins Initial guess Learned model PA(heads) = 0.4 PB(heads) = 0.8 Initial guess PA(heads) = 0.51 PB(heads) = 0.49 Learned model PA(heads) = 0.801 PB(heads) = 0.413
Toy Profile Example Create a profile for the following sequences: ADACGIH ADAGIH ADACGH AACQH ADAYGIH Use the profile to align the sequences
Results ADACGIH ADA-GIH ADACG-H A-ACQ-H ADAYGIH Match1 A 100% Match2 D 100% Match3 A 100% Match4 C 75%, Y 25% Match5 G 80%, Q 20% Match6 I 62%, H 38% Match7 H 100%
Clustering With A Mixture Of Profiles Given N protein sequences x1…xN Construct M profile HMMs H1…HM and a mapping F: xH such that is maximized F is a natural clustering of the protein sequences into M groups Õ i x F P )) ( |