Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004.

Slides:



Advertisements
Similar presentations
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Advertisements

Frequent Closed Pattern Search By Row and Feature Enumeration
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Suffix Trees Construction and Applications João Carreira 2008.
Dynamic Bayesian Networks (DBNs)
Probabilities and Probabilistic Models
Hidden Markov models and its application to bioinformatics.
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
Hidden Markov Models Eine Einführung.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Statistical NLP: Lecture 11
Hidden Markov Models in Bioinformatics
Service Discrimination and Audit File Reduction for Effective Intrusion Detection by Fernando Godínez (ITESM) In collaboration with Dieter Hutter (DFKI)
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University.
Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.
Heuristic alignment algorithms and cost matrices
1 Protein Multiple Alignment by Konstantin Davydov.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Algorithms for variable length Markov chain modeling Author: Gill Bejerano Presented by Xiangbin Qiu.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Hidden Markov Models In BioInformatics
By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.
Module 04: Algorithms Topic 07: Instance-Based Learning
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Demo. Overview Overall the project has two main goals: 1) Develop a method to use sensor data to determine behavior probability. 2) Use the behavior probability.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology.
(H)MMs in gene prediction and similarity searches.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
Hidden Markov Models BMI/CS 576
Machine Learning: Ensemble Methods
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Context-based Data Compression
Strings: Tries, Suffix Trees
N-Gram Model Formulas Word sequences Chain rule of probability
Generalizations of Markov model to characterize biological sequences
謝孫源 (Sun-Yuan Hsieh) 成功大學 電機資訊學院 資訊工程系
CSE 5290: Algorithms for Bioinformatics Fall 2009
Strings: Tries, Suffix Trees
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004

2 Goal  Provide efficient prediction for protein families  Probabilistic Suffix Trees (PSTs) are variable length Markov models (VMMs)

3 Conceptual Map Probabilistic Suffix Trees ePST Suffix Trees Variable Length Markov Model bPST

4 Background  PSTs were introduced by Ron, Singer, Tishby  Bejerano, Yona made further improvements (bPST)  Poulin – efficient PSTs (ePSTs)  PSTs a.k.a. prediction suffix trees

5 Higher Order Markov Models  A k-order Markov chain: history of length k for conditional probabilities  Exponential storage requirements  Order of the chain increases, amount of training data increases to improve estimation accuracy

6 Variable Length Markov Models (VMMs)  Space and parameter-estimation efficient variable length of the history sequence for prediction only needed parameters are stored  Created from less training data >T1 Test sequence AHGSGYMNAB Training sequences Is T1 in the training set?

7 VMMs  P(sequence) = product of the probabilities of each amino acid given those that precede it  Conditional probability based on the context of each amino acid  A context function k(·) can select the history length based on the context x 1... x i−1 x i  VMMs were first introduced as PSTs

8 PSTs  VMMs for efficient prediction  Pruned during training to contain only required parameters  bPST: represents histories  ePST: represents sequences

9 bPST  Used to represent the histories for prediction instead of the training sequences  The possible histories are the reversed strings of all the substrings of the training sequences

10 Prediction with bPSTs  The conditional probabilities P(x i |x i-1 …) are obtained for each position by tracing a path from the root that matches the preceding residues

11 Construction bPST  We add histories for the training data  Nodes: parameters that estimate the conditional probabilities γ history (a) = P(a|history) P bPST (x i |x i−1,..., x 1 ) = γ x1...xi−1 (x i ) if in bPST else γ x2...xi−1 (x i ) if in bPST etc. else γ (x i )

12 bPST created and pruned using P(01001) = P(0)P(1|0)P(0|01)P(0|010)P(1|0100) = γ (0) γ 0 (1) γ 01 (0) γ 0 * (0) γ 00 * (1) = (13/27)(8/13)(5/8)(5/13)(4/5) = 10400/ = Brett Poulin

13 Complexity bPST  bPST building process requires O(Ln 2 ) time L is the length limit of the tree n is the total length of the training set.  bPST building requires all training sequences at once (in order to get all the reverse substrings) and cannot be done online (the bPST cannot be built as the training data is encountered)  Prediction: O(mL), m = sequence length

14 Improved bPST  Idea: tree with training sequences  n length of all training sequences  m length of tested sequence  Result (theoretical): linear time building O(n) linear time prediction O(m).

15 Efficient PST (ePST)  Used for predicting protein function  ePST represents sequences  Linear construction and prediction

16 Example ePST Brett Poulin

17 Prediction with ePSTs  The probabilities for a substring are obtained for each position by tracing the path representing the sequence from the root  If the entire sequence is not found in the tree, suffix links are followed

18 Construction ePST  ePSTs gain efficiency by representing the training sequences in the PST  Nodes store counts of the subsequence occurrences in the training data (with respect to the complete tree)  Conditional probabilities deducted from the counts are stored as well

19 Example ePST - AYYYA Brett Poulin

20 Complexity ePST  Linear time and space with regards to the combined length of the training sequences O(n)  Linear prediction time O(m)

21 Advantages and Disadvantages  Avoid exponential space requirements and parameter estimation problems of higher order Markov chains  Pruned during training to contain only required parameters  bPSTs for local predictions: more accurate prediction than global  Loss in classification performance: Pfarm, SCOP

22 Conclusions  PSTs require less training and prediction time than HMMs  Despite some loss in classification performance, PSTs compete with HMMs due to PSTs reduced resource demands  PSTs take advantage of VMMs higher order correlations

23 References  Brett Poulin, Sequence-based Protein Function Prediction, Master Thesis, University of Alberta, 2004  G Bejerano, G Yona, Modeling protein families using probabilistic suffix trees, RECOMB’99  G Bejerano, Algorithms for variable length markov chain modeling, Bioinformatics Applications Note, 20(5):788–789, 2004

24 PSTs and HMMs  “HMMs do not capture any higher-order correlations. An HMM assumes that the identity of a particular position is independent of the identity of all other positions.” [1]  PSTs are variable length Markov models for efficient prediction. The prediction uses the longest available context matching the history of the current amino acid.  For protein prediction in general, “the main advantage of PSTs over HMMs is that the training and prediction time requirements of PSTs are much less than for the equivalent HMMs.” [1]

25 Suffix Trees (ST) Brett Poulin

26 bPST  Histories added to the tree must occur more frequently than a threshold P min  The substrings are added in order of length from smallest to largest

27 bPST vs ST  The string s is only added to the tree if the resulting conditional probability at the node to be created will be greater than the minimum prediction probability γ min + α and the probability for the prefix of the string is different (with some ratio r) from the probability assigned to the next shortest substring suf(s) (which is already in the tree). After all the substrings are added to the tree, the probabilities are smoothed according to the parameter γ min.  The smoothing (as calculated by the equation below) prevents any probability from being less than γ min

28 New!