Download presentation
Presentation is loading. Please wait.
Published byFrank McBride Modified over 8 years ago
1
10. Decision Trees and Markov Chains for Gene Finding
2
2 Introduction Decision trees for the analysis of DNA sequence data How decision trees are used in the gene- finding system MORGAN
3
3 1. Decision Trees (1) Basic structure : Chapter 2 Assumption All tests are binary or “ yes-no ” questions Problem Feature selection Decision tree induction algorithm
4
4 1. Decision Trees (2)
5
5 1.1 Induction of decision trees (1) Algorithms ID3, C4.5, OC1, CART S : a set of non-overlapping DNA subsequences An example a collection of feature values (GC content) and a class label (exon, intron, … )
6
6 1.1 Induction of decision trees (2) Algorithm Build-Tree (S) Find a test that splits the set S into two (or more) subsets. Score the two subsets (using scoring rules). If a subset is pure, then make a leaf node for the subset and stop. Else, call Build-Tree recursively to split any subsets that are not pure.
7
7 1.2 Splitting rules (1) Univariate test Common approach Consider each feature in turn Choose a threshold for the test e.g. N-1 different tests for any set of N examples Tests occur at the midpoints between successive feature values
8
8 With D features, D(N-1) possible tests Score : impurity measure Simplest way : count the number of examples that would be misclassified by the test Information gain : entropy Statistical measures 1.2 Splitting rules (2)
9
9 Oblique split Linear discriminant Requires much more computation than univariate test Exist efficient methods : e.g. OC1 1.2 Splitting rules (3)
10
10 1.3 Pruning rules (1) Problem : overfitting Cost complexity pruning (OC1) Separates the data randomly into two sets, the training set (T) and the pruning set (P) Builds a complete tree using T. Measure the cost complexity of each non- leaf node. The number of examples that would be misclassified if that node were made into a leaf The size of the subtree rooted at that node
11
11 The node whose cost complexity is greatest is pruned. Pruning is stopped when the tree is pruned down to a single node. Examine each smaller trees using P. The tree with the highest accuracy on the P becomes the output of the system. 1.3 Pruning rules (2)
12
12 1.4 Internet resources for decision tree software OC1 http://www.cs.jhu.edu/labs/compbio/home.html http://www.cs.jhu.edu/labs/compbio/home.html C4.5 ftp://ftp.cs.su.oz.au/pub/ml ftp://ftp.cs.su.oz.au/pub/ml IND (CART, C4.5) http://www.ultimode.com/~wray http://www.ultimode.com/~wray
13
13 2. Decision Trees to Classify Sequences Convert sequence into a set of features MORGAN Considered most of the 21 coding measures Settled a small subset of features by experimentation
14
14 Coding measure In-frame hexamer statistic for distinguishing coding and non-coding DNA A value greater than zero indicates that the sequence looks more like coding DNA than non-coding DNA
15
15 Position asymmetry statistic Counts the frequency of each base in each of the three codon positions Four separate feature values Scoring the donor and acceptor sites using a second-order Markov chain
16
16 3. Decision Trees as probability estimators Able to measure the confidence
17
17 OC1 Build ten different trees (for same training set). Probability : average of these distributions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.