Sequential Pattern Discovery under a Markov Assumption

Sequential Pattern Discovery under a Markov Assumption
Darya Chudova, Padhraic Smyth. 11/22/2018

Introduction Problem: Identification of recurrent patterns in large data sets in the form of categorical sequences (E.g. Motifs in a DNA sequence). As example consider the pattern ADDABB embedded in a background process: ….BACADBADBBC[ADDABB]BACDBDBA[ADDACB]DAC… 11/22/2018

Models for the Patterns
Model 1, True Model: True model for generating patterns and background known. Corresponds to Bayes classification and error. Model 2, Supervised Training: General form of model and location of patterns known. Model 3, Unsupervised Training: General form of the model known. All parameters and location of patterns unknown. 11/22/2018

Contributions of the paper
Provide an accurate approximate expression for the Bayes error under Markov assumption. Illustrate how alphabet size, pattern length, pattern frequency, and pattern autocorrelation affects the Bayes error rate. Empirical investigation of several well-known algorithms in the Markov context. Application to motif-finding problems and how the theoretical framework helps. 11/22/2018

Hidden Markov Model P1 P3 P4 P2 B
0.9 B 0.9 B 0.9 D 0.9 A 0.25 B 0.25 C 0.25 D 0.25 P1 P3 P4 P2 B 0.99 0.01 1.0 1.0 1.0 1.0 Background state B can only transition to itself or the first pattern state P1. Each pattern state Pi can only transition to state Pi+1, 0<I<L. The last pattern state PL can only transition back to the background state B. 11/22/2018

The parameters nA: size of the observable alphabet.
L: length of the pattern. : noise probability of substitution error in each of the pattern positions. ns: expected number of substitutions in a pattern, ns=L  . F: frequency of pattern occurrence in the sequence, so that expected number of patterns in a sequence of length N is given by F  N. 11/22/2018

Bayes Error Rate Under iid assumption:
Pe*=  min{p(h=B|o), p(h=P1…L|o)}p(o) o h For the Markov case: Pe*= lim (1/N) min {p(hi=B|O),p(hi=P1…L|O)} N i Explain Bayes rule. 11/22/2018

Analytical Expressions for Bayes Error Rate
Closed form expression difficult. Get iid approximation i.e. each position i is classified independently depending on the next L-1 symbols oi,…,oL-1. Pe* PeIID Expression for PeIID very complex to evaluate or interpret. 11/22/2018

IID-pure Further simplifying assumption that each substring of length L, starting at position i is generated by a run of L background states or L pattern states. We calculate the associated error Pe^IID=LCl(nA-1)lmin{(1-)(L-l)( /(nA-1))l F,(1/nA)L(1-F)} Pe* PeIID  Pe^IID Normalized Bayes error rate: PNe*= Pe*/F Explain symbols. 11/22/2018

Normalized Error Normalized error rate increases with increasing substitution error ns. Normalized error rate decreases with increasing pattern frequency F Normalized Bayes error rate decreases with increasing pattern length if the metric used by Sze et al. (2002) is kept constant. Draw graphs figure 2 and figure 3. Compares with Sze et al. (2002). Who define the expected percentage of symbols with substitution errors in a pattern. 11/22/2018

Insights from analytical expression
As substitution error 0 for the trial case with L=5, nA=4 and F=0.005, we get 20% of patterns misclassified. For fixed pattern length and pattern frequency, if >0.28 all the pattern symbols are classified as background. For fixed L and , if F is less than 3 in a thousand, all patterns are classified as background. Introducing insertions increases Bayes error rate. 11/22/2018

The effect of pattern structure
Bayes error is higher for structured patterns i.e. error for BBBBBBBBBB is higher than error for BCBCBCBCBC which is higher than BCCBCCBCCB Bayes error for highly autocorrelated patterns is higher. Discuss the graphs. 11/22/2018

Three Pattern Discovery Algorithms
Motif sampler by Liu et al. (1995) (IID-Gibbs). MEME algorithm by Bailey and Elkan (1995) (IID-EM). HMM based algorithm (HMM-EM) 11/22/2018

Comparing the three algorithms
All three algorithms perform as well with a strong prior on the pattern frequency. For a weak prior IID-Gibbs remains unaffected but the other algorithms worsen, with HMM-EM performing worst. Asymptotically all seem to converge to the true rate. Draw diagrams figure 8 and figure 9 11/22/2018

Component-Wise Breakdown of Error
The Basic Bayes error Additional error due to noise in parameter estimates Further additional error from not knowing where the patterns are located. High accuracy is obtained if all the above errors are small. 11/22/2018

Finding Real Motifs HMM model fitted to data of E. coli DNA-binding protein families. Bayes error rate seems to be independent of the training sample size. 11/22/2018

11/22/2018

Sequential Pattern Discovery under a Markov Assumption

Similar presentations

Presentation on theme: "Sequential Pattern Discovery under a Markov Assumption"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequential Pattern Discovery under a Markov Assumption

Similar presentations

Presentation on theme: "Sequential Pattern Discovery under a Markov Assumption"— Presentation transcript:

Similar presentations

About project

Feedback