Presentation is loading. Please wait.

Presentation is loading. Please wait.

10. Decision Trees and Markov Chains for Gene Finding.

Similar presentations


Presentation on theme: "10. Decision Trees and Markov Chains for Gene Finding."— Presentation transcript:

1 10. Decision Trees and Markov Chains for Gene Finding

2 2 Introduction  Decision trees for the analysis of DNA sequence data  How decision trees are used in the gene- finding system MORGAN

3 3 1. Decision Trees (1)  Basic structure : Chapter 2  Assumption  All tests are binary or “ yes-no ” questions  Problem  Feature selection  Decision tree induction algorithm

4 4 1. Decision Trees (2)

5 5 1.1 Induction of decision trees (1)  Algorithms  ID3, C4.5, OC1, CART  S : a set of non-overlapping DNA subsequences  An example  a collection of feature values (GC content) and a class label (exon, intron, … )

6 6 1.1 Induction of decision trees (2)  Algorithm Build-Tree (S)  Find a test that splits the set S into two (or more) subsets.  Score the two subsets (using scoring rules).  If a subset is pure, then make a leaf node for the subset and stop.  Else, call Build-Tree recursively to split any subsets that are not pure.

7 7 1.2 Splitting rules (1)  Univariate test  Common approach  Consider each feature in turn  Choose a threshold for the test  e.g.  N-1 different tests for any set of N examples  Tests occur at the midpoints between successive feature values

8 8  With D features, D(N-1) possible tests  Score : impurity measure  Simplest way : count the number of examples that would be misclassified by the test  Information gain : entropy  Statistical measures 1.2 Splitting rules (2)

9 9  Oblique split  Linear discriminant  Requires much more computation than univariate test  Exist efficient methods : e.g. OC1 1.2 Splitting rules (3)

10 10 1.3 Pruning rules (1)  Problem : overfitting  Cost complexity pruning (OC1)  Separates the data randomly into two sets, the training set (T) and the pruning set (P)  Builds a complete tree using T.  Measure the cost complexity of each non- leaf node.  The number of examples that would be misclassified if that node were made into a leaf  The size of the subtree rooted at that node

11 11  The node whose cost complexity is greatest is pruned.  Pruning is stopped when the tree is pruned down to a single node.  Examine each smaller trees using P.  The tree with the highest accuracy on the P becomes the output of the system. 1.3 Pruning rules (2)

12 12 1.4 Internet resources for decision tree software  OC1  http://www.cs.jhu.edu/labs/compbio/home.html http://www.cs.jhu.edu/labs/compbio/home.html  C4.5  ftp://ftp.cs.su.oz.au/pub/ml ftp://ftp.cs.su.oz.au/pub/ml  IND (CART, C4.5)  http://www.ultimode.com/~wray http://www.ultimode.com/~wray

13 13 2. Decision Trees to Classify Sequences  Convert sequence into a set of features  MORGAN  Considered most of the 21 coding measures  Settled a small subset of features by experimentation

14 14  Coding measure  In-frame hexamer statistic for distinguishing coding and non-coding DNA  A value greater than zero indicates that the sequence looks more like coding DNA than non-coding DNA

15 15  Position asymmetry statistic  Counts the frequency of each base in each of the three codon positions  Four separate feature values  Scoring the donor and acceptor sites using a second-order Markov chain

16 16 3. Decision Trees as probability estimators  Able to measure the confidence

17 17  OC1  Build ten different trees (for same training set).  Probability : average of these distributions


Download ppt "10. Decision Trees and Markov Chains for Gene Finding."

Similar presentations


Ads by Google