Download presentation
Presentation is loading. Please wait.
1
Feature Selection and Error Tolerance for the Logical Analysis of Data Craig Bowles Kathryn Davidson Cornell University University of Pennsylvania Mentor: Endre Boros RUTCOR
2
Our Goals Train a computer to tell us which attributes in a medical data set are important Have the computer suggest possible formulas for distinguishing healthy and sick patients Achieve these goals with as much tolerance for data error as possible
3
Training Data Set: Wisconsin Breast Cancer Database from University of Wisconsin Hospitals, Madison, Dr. William H. Wolberg Sample Patient Vector: 1000025,5,1,1,1,2,1,3,1,1,2 ID #, 9 test results, class distinction There is a total of 699 patients (458 Benign “2”, 241 Malignant “4”)
4
Minimal Difference Vectors Dualization Maximal Error Difference Vector: |(5 4 4 5 7 10 3 2 1) – (6 8 8 1 3 4 3 7 1)| = (1 4 4 4 4 6 0 5 0) ***An error (1 4 4 4 4 6 0 5 0) would not distinguish (5 4 4 5 7 10 3 2 1) from (6 8 8 1 3 4 3 7 1) = {Difference Vectors} (+90,000 for WBCD) Minimal Difference Vectors: such that there is no other vector in that is less than or equal to in every coordinate Next Step: Input {Minimal Difference Vectors} into Dualization Algorithm
5
Minimal Difference Vectors Dualization Maximal Error Vectors Input: Minimal Vectors = Red Output: (5,0), (3,2), (2,3) To find blue points (what we want): Take (dimension) – outputs = (0,5), (2,3), (3,2) http://rutcor.rutgers.edu/~boros/IDM /DualizationCode.html
6
Error Minimal Difference Vectors Dualization Maximal Error Vectors The output of the Dualization Algorithm is another set of vectors For each coordinate in these vectors, take the compliment (i.e. 10 – coordinate) Divide each vector by 2 to find a maximal error vector Sort these error vectors by greatest sum, most 5’s, maximal minimum element, etc. Choose an epsilon from the sorted lists that looks good
7
Binarization For a good error, we binarize the original data: eg: Error = (0.5, 5, 0, 5, 0, 0, 5, 5, 5) Thresholds ( ): Col 1 = 4, Col 3 =7, Col 5 = 5, Col 6 = 8 For a patient (5, 1, 1, 1, 2, 1, 3, 1, 1) we test each threshold to see if the value in the patient vector is greater or less than the threshold value, or if it is within error (5, 1, 1, 1, 2, 1, 3, 1, 1) 1 0 0 0 (4, 1, 1, 1, 2, 1, 3, 1, 1) * 0 0 0
8
RESULTS FOR WISCONSIN BREAST CANCER DATA Error Tolerance: 0.5, 5, 0, 5, 0, 0, 5, 5, 5 Attributes and Corresponding Thresholds: 1 : 4, 3 : 7, 5 : 5, 6 : 8 Total Pos/Neg Entries: 1 0 0 0 : 96, 21 (162, 23) 1 0 0 1 : 1, 37 (1, 41) 0 1 1 1 : 0, 6 (0, 6) 1 0 1 0 : 2, 16 (3, 17) 1 0 1 1 : 1, 27 (1, 28) 1 1 0 0 : 1, 12 (1, 14) 1 1 0 1 : 0, 19 (0, 21) 1 1 1 0 : 0, 24 (0, 24) 1 1 1 1 : 2, 52 (2, 52) 0 0 0 0 : 268, 2 (334, 4) 0 0 0 1 : 1, 3 (1, 7) 0 0 1 0 : 5, 2 (6, 3) 0 0 1 1 : 0, 4 (0, 5) 0 1 0 1 : 0, 2 (0, 4)
9
Formula for WBCD: Let P = Col 1 4 Q = Col 3 7 R = Col 5 5 S = Col 6 8 Then we can characterize most (432/444) positives with: -Q -R -S Some example patient vectors: Negatives: (8,7,5,10,7,9,5,5,4), (7,4,6,4,6,1,4,3,1) Positives: (4,1,1,1,2,1,2,1,1), (4,1,1,1,2,1,3,1,1)
10
More to do: Test our procedure on different databases Study heuristic methods for threshold selection In general, explore ways to use more flexible error vectors and/or thresholds References: [1] Boros, E., Hammer, P.L., Ibaraki, T., and Kogan, A. "Logical Analysis of Numerical Data," Math. Programming, 79, (1997), 163-190 [2] Boros, E., Ibaraki, T., and Makino, K., "Variations on Extending Partially Defined Boolean Functions with Missing Bits," June 6, 2000 [3] Boros, E. http://rutcor.rutgers.edu/~boros/IDM/DualizationCode.html July 1, 2005 [4] Mangasarian, O.L. and W. H. Wolberg. "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18 [5] Rudell, Richard. Espresso Boolean Minimization http://www.csc.uvic.ca/~csc485c/espresso/instructions.html July 18, 2005
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.