Download presentation
Presentation is loading. Please wait.
1
Predicting E. Coli Promoters Using SVM
DIEP MAI Course: CS/ECE 539 – Fall 2008 Instructor: Prof. Yu Hen Hu
2
Purpose Build and train a SVM system to predict E. Coli promoters based on the given gene sequences. Example: Given a gene sequence aagcaaagaaatgcttgactctgtagcgggaaggcgtattatgcacaccgccgcgcc Is it an E. Coli promoter? For more theoretical information about E. Coli promoter:
3
Dataset Data file is obtained from Dataset information:
Dataset information: Number of instances: 106 Attributes: Number of attributes: 57 Type: Non-numeric nominal values (A, C, G, or T) Classes: Number of classes: 2 Type: Positive (+1) or Negative (-1)
4
Data preprocessing Randomly partition the dataset to TRAINSET and TESTSET Ratio = TESTSET / (TRAINSET + TESTSET) Encode non-numeric attributes A = 110 C = 210 G = 410 T = 810 Scaling each feature to [-1, 1] to avoid the domination of large on small values.
5
Approach RBF kernel is used need to find “good” C (cost) and G (gamma) parameters. Parameter scanning: Set the range of C to [2-15, 25] and G to [2-15, 22] For each pair (C, G), use leave-one-out method to determine the pairs that yield high accuracy rates This process is repeated a few times; the pair that "often" produces high accuracy rates is more likely to be selected. Training/Testing: Use selected parameters and the whole TRAINSET to train the system. Use the trained system to predict the TRAINSET. preferred accuracy rate = 100% Use the trained system to predict the TESTSET.
6
Accuracy rate of the testing process
Results Configuration: Ratio of partitioning dataset = 1/5 Split the dataset to 5 roughly equal sets; one is preserved as TESTSET K-fold = 15 (15 folds in total) Number of repetitions to select paras. = 10 After running the system several times: Training result Accuracy rate “Best” (C, G) Avg. Best C G 84.35% 84.23% 88.82% 83.52% 85.88% 90.59% 0.7071 1.1892 1.0000 0.0371 0.0313 0.0743 0.0625 Accuracy rate of the testing process Occr. freq. TRAINSET TESTSET Often Sometimes Rare 85/85 = 100% 19/21 = 90.48% 18/21 = 85.71% 20/21 = 95.23% 21/21 = 100% 15/21 = 71.43%
7
Observation/Conclusion
SVM: For this dataset, the number of attributes is not large, the use of RBF kernel seems appropriate to map the feature to a higher dimension Scanning (C, G) takes a large amount of time. One of approaches to speed up this process: Split the range to “large” equal intervals Pick the interval that yields high accuracy rates Divide this range to smaller equal intervals Repeat K-fold method: The larger the number of folds is, the more time the process requires For this dataset, the number of instances is not large, large numbers of folds seem to work well.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.