Sequential Pattern Discovery under a Markov Assumption Darya Chudova, Padhraic Smyth. 11/22/2018
Introduction Problem: Identification of recurrent patterns in large data sets in the form of categorical sequences (E.g. Motifs in a DNA sequence). As example consider the pattern ADDABB embedded in a background process: ….BACADBADBBC[ADDABB]BACDBDBA[ADDACB]DAC… 11/22/2018
Models for the Patterns Model 1, True Model: True model for generating patterns and background known. Corresponds to Bayes classification and error. Model 2, Supervised Training: General form of model and location of patterns known. Model 3, Unsupervised Training: General form of the model known. All parameters and location of patterns unknown. 11/22/2018
Contributions of the paper Provide an accurate approximate expression for the Bayes error under Markov assumption. Illustrate how alphabet size, pattern length, pattern frequency, and pattern autocorrelation affects the Bayes error rate. Empirical investigation of several well-known algorithms in the Markov context. Application to motif-finding problems and how the theoretical framework helps. 11/22/2018
Hidden Markov Model P1 P3 P4 P2 B 0.9 B 0.9 B 0.9 D 0.9 A 0.25 B 0.25 C 0.25 D 0.25 P1 P3 P4 P2 B 0.99 0.01 1.0 1.0 1.0 1.0 Background state B can only transition to itself or the first pattern state P1. Each pattern state Pi can only transition to state Pi+1, 0<I<L. The last pattern state PL can only transition back to the background state B. 11/22/2018
The parameters nA: size of the observable alphabet. L: length of the pattern. : noise probability of substitution error in each of the pattern positions. ns: expected number of substitutions in a pattern, ns=L . F: frequency of pattern occurrence in the sequence, so that expected number of patterns in a sequence of length N is given by F N. 11/22/2018
Bayes Error Rate Under iid assumption: Pe*= min{p(h=B|o), p(h=P1…L|o)}p(o) o h For the Markov case: Pe*= lim (1/N) min {p(hi=B|O),p(hi=P1…L|O)} N i Explain Bayes rule. 11/22/2018
Analytical Expressions for Bayes Error Rate Closed form expression difficult. Get iid approximation i.e. each position i is classified independently depending on the next L-1 symbols oi,…,oL-1. Pe* PeIID Expression for PeIID very complex to evaluate or interpret. 11/22/2018
IID-pure Further simplifying assumption that each substring of length L, starting at position i is generated by a run of L background states or L pattern states. We calculate the associated error Pe^IID=LCl(nA-1)lmin{(1-)(L-l)( /(nA-1))l F,(1/nA)L(1-F)} Pe* PeIID Pe^IID Normalized Bayes error rate: PNe*= Pe*/F Explain symbols. 11/22/2018
Normalized Error Normalized error rate increases with increasing substitution error ns. Normalized error rate decreases with increasing pattern frequency F Normalized Bayes error rate decreases with increasing pattern length if the metric used by Sze et al. (2002) is kept constant. Draw graphs figure 2 and figure 3. Compares with Sze et al. (2002). Who define the expected percentage of symbols with substitution errors in a pattern. 11/22/2018
Insights from analytical expression As substitution error 0 for the trial case with L=5, nA=4 and F=0.005, we get 20% of patterns misclassified. For fixed pattern length and pattern frequency, if >0.28 all the pattern symbols are classified as background. For fixed L and , if F is less than 3 in a thousand, all patterns are classified as background. Introducing insertions increases Bayes error rate. 11/22/2018
The effect of pattern structure Bayes error is higher for structured patterns i.e. error for BBBBBBBBBB is higher than error for BCBCBCBCBC which is higher than BCCBCCBCCB Bayes error for highly autocorrelated patterns is higher. Discuss the graphs. 11/22/2018
Three Pattern Discovery Algorithms Motif sampler by Liu et al. (1995) (IID-Gibbs). MEME algorithm by Bailey and Elkan (1995) (IID-EM). HMM based algorithm (HMM-EM) 11/22/2018
Comparing the three algorithms All three algorithms perform as well with a strong prior on the pattern frequency. For a weak prior IID-Gibbs remains unaffected but the other algorithms worsen, with HMM-EM performing worst. Asymptotically all seem to converge to the true rate. Draw diagrams figure 8 and figure 9 11/22/2018
Component-Wise Breakdown of Error The Basic Bayes error Additional error due to noise in parameter estimates Further additional error from not knowing where the patterns are located. High accuracy is obtained if all the above errors are small. 11/22/2018
Finding Real Motifs HMM model fitted to data of E. coli DNA-binding protein families. Bayes error rate seems to be independent of the training sample size. 11/22/2018
11/22/2018