Pfizer HTS Machine Learning Algorithms: November 2002 Paul Hsiung (hsiung+@cs.cmu.edu) Paul Komarek (komarek@cs.cmu.edu) Ting Liu (tingliu@cs.cmu.edu) Andrew W. Moore (awm@cs.cmu.edu) Auton Lab, Carnegie Mellon University School of Computer Science www.autonlab.org
Datasets Our Name Num. Records Num Attributes Num non-zero input cells Num positive outputs Description train1 26,733 6,348 3.7M 804 The original dataset sent to CMU in Feb 2002 test1 1,456 6,121 0.2M 878 The test set associated with the above training set jun-3-1 88,358 1,143,054 30M 423 The large “TEST3” dataset sent to us in May 2002. the “-1” at the end denotes that we were using the first of the four activation columns combined 211 Combining the “TEST3” datasets. The activation in Combined is positive if and only if at least two of the four original activations were positive. Auton Lab, www.autonlab.org
Projections train1 train100 train10 test1 test100 test10 train-pls-100 Our Name name given to original name given to 100 dimensional projection name given to 10 dimensional projection train1 train100 train10 test1 test100 test10 train-pls-100 train-pls-10 test-pls-100 test-pls-10 jun-3-1 n/a combined Auton Lab, www.autonlab.org
Previous Algorithms BC Bayes Classifier Dtree Decision Tree SVM On original data, a naïve categorical classifier was used. On Real-valued projected data, a Naïve Gaussian classifier was used. Dtree Decision Tree This technique is also known as Recursive Partitioning and CART. It was only implemented for the original data. SVM Support Vector Machine. Except where stated otherwise, a linear SVM was used. We could not find significant performance difference between Linear SVM and Radial Basis Function SVM with a variety of RBF parameters. k-NN k-nearest neighbor Except where stated otherwise, k=9 neighbors were used. Only implemented for projected data. LR Logistic Regression Except where stated otherwise, used Conjugate Gradient to perform intermediate weighted regressions, using a newly developed technique. Auton Lab, www.autonlab.org
New Algorithms new-KNN Tractable High dimensional k-nearest neighbor Can work on the 1,000,000 dimensional “June” data. EFP Explicit False Positive Logistic Regression Logistic regression that accounts for the high false positive rate. SMod Super Model. Automatically combining the predictions from multiple algorithms with a “meta-level” of logistic regression. PLS-proj Partial Least Squares Projection Using PLS instead of PCA to project down data PLS Partial Least Squares Prediction Using the PLS algorithm as a predictor Auton Lab, www.autonlab.org
Explicit False Positive Model Auton Lab, www.autonlab.org
Explicit False Positive Model Auton Lab, www.autonlab.org
Example in 2 dimensions: Decision Boundary Auton Lab, www.autonlab.org
Example in 2 dimensions: 100 true positives Auton Lab, www.autonlab.org
100 true positives and 100 true negatives Auton Lab, www.autonlab.org
100 TP, 100 TN, 10 FP Auton Lab, www.autonlab.org
Using regular logistic regression Auton Lab, www.autonlab.org
Using EFP Model Auton Lab, www.autonlab.org
Example: 10000 true positives Auton Lab, www.autonlab.org
10000 true positives, 10000 true negatives Auton Lab, www.autonlab.org
10000 TP, 10000 TN, 1000 FP Auton Lab, www.autonlab.org
Using regular logistic regression Auton Lab, www.autonlab.org
Using EFP Model Auton Lab, www.autonlab.org
EFP Model Real Data Results K-fold Auton Lab, www.autonlab.org
EFP Effect …Very impressive on Train1 / Test1 Auton Lab, www.autonlab.org
Log X-axis Auton Lab, www.autonlab.org
EFP Effect …Unimpressive on jun31 / jun32 Auton Lab, www.autonlab.org
Super Model Divide Training Set into Compartment A and Compartment B Learn each of N models on Compartment A Predict each of N models on Compartment B Learn best weighting of opinions with Logistic Regression of Predictions on Compartment B Apply the models and their weights to Test Data Auton Lab, www.autonlab.org
Comparison Auton Lab, www.autonlab.org
Log X-Axis Scale Auton Lab, www.autonlab.org
Comparison on 100-dims Auton Lab, www.autonlab.org
Log X-axis Auton Lab, www.autonlab.org
Comparison on 10 dims Auton Lab, www.autonlab.org
Log X-axis Auton Lab, www.autonlab.org
NewKNN summary of results and timings Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
PLS summary of results PLS projections did not do so well. However, PLS as a predictor performed well, especially under train100/test100. PLS is fast. The runtime varies from 1 to 10 minutes. But PLS takes large amounts of memory. Impossible to use in a sparse representation. (This is due to the update on each iteration.) Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Auton Lab, www.autonlab.org
Summary of results SVM best early on in Train1, LR better in the long-haul. Projecting to 10-d always a disaster Projecting to 100-d often indistinguishable from behavior with original data (and much cheaper) Naïve Gaussian Bayes Classifier best on JUN-3-1 (k-nn better for long haul) Naïve Gaussian Bayes Classifier best on combined Non-linear SVM never seems distinguishable from Linear SVM All methods have won in at least one context, except Dtree. Auton Lab, www.autonlab.org
Some AUC Results * = Not statistically significantly different Experiment Algorithm AUC Train on Train1 then test on Test1 Linear SVM 0.876* Best non-Linear SVM 0.875* BC 0.867* LR 0.71 KNN 0.872* DTree 0.70 Combined SVM 0.638 0.700 0.606 0.603 * = Not statistically significantly different Auton Lab, www.autonlab.org
Some AUC Results Experiment Algorithm AUC 10-fold cross-validation on Train1 Linear SVM 0.919 BC 0.885 LR 0.933 DTree 0.894 Auton Lab, www.autonlab.org