MMS Software Deliverables: Year 1 Paul Kantor, Dave Lewis Presentation for MMS Site Visit, DIMACS, Rutgers Univ., 26-Feb-2004
Outline Overview of software deliverables Software deliverables Adaptive Filtering (Rocchio, Centroid, kNN) BBR (Bayesian logistic regression) libAML (cPCA, tuned SVM) Homotopy Fusion
MMS Deliverables Research software (focus of this talk) Source code Experimenter-oriented documentation Experimental results Insights
Research Software: Design Flexibly parameterized Easy to experiment with many variations Including ones we found didn’t work Assumes all data provided at start of run Often incorporates evaluation code Looks at test data judgments only after run Simulates processing stream of incoming data; can’t accept real-time input
Research Software: Robustness Alpha and beta testing by sophisticated nondevelopers Test cases, regression testing not systematic Abnormal conditions (missing files, etc.) not always handled gracefully Assumes data has been converted to appropriate format
Research Software: Usability Provided as source code (C++) Some shell, awk, Java code for running experiments and preprocessing data Documentation assumes sophisticated, experimentation-oriented user I/O formats vary among packages (driven by data) According to needs of different experiments Use of libraries, other code w/ license conditions say something about data sets (driving I/O formats)
“Components” of Filtering MMS project oriented around five abstract filtering “components”: Compression Representation Matching Learning Fusion
One Program Sometimes Handles Several Components One algorithm may accomplish goals of multiple “components” Component processing may be needed in conditional, multiple, or iterative fashion Running large numbers of experiments efficiently sometimes requires incorporating several components in same program
1. Adaptive Filtering Software (AFS) End-to-end system Compression : simple feature selection (in Rocchio and Centroid) Representation : classic term weighting Matching : efficient finding of nearest neighbors Learning : Rocchio, Centroid, kNN Fusion : none Note to MMS people: Throughout this presentation I’m discussing feature selection as a “compression” technique, rather than a representation technique. This is completely arbitrary.
AFS Software Available as source code (C++) CMU Lemur toolkit used for preprocessing and storage of documents Can simulate batch and adaptive filtering environments All documents must be provided at start of run Incorporates evaluation code wait to hear if vladimir providing linux executable as well
AFS: Rocchio Classic learning algorithm for text classification Implementation designed for testing of many design choices discussed in literature Pseudofeedback (handling of unjudged data) Thresholding (adaptive and batch) Feature Selection Term weighting
AFS: Centroid Similar to Rocchio, but emphasizing contrast between positive and negative examples Same design choices can be explored as for Rocchio
AFS: kNN Classic pattern recognition algorithm Experiments focused on Put test document in same categories as most similar training documents Experiments focused on Matching : reducing space/time to find neighbors Learning : adjusting neighborhood size, weighting, thresholds
AFS kNN Matching Capabilities Exact scoring with inverted index (Already faster than exhaustive matching) Approximate scoring with inverted index Training document pruning prior to indexing Test document pruning Classification time inverted list pruning Random projections Several variants of theory-motivated method
AFS kNN Learning Capabilities Classic and two weighted kNN variants Cross-validation based thresholding, optimizing user-specified effectiveness measure Adaptive filtering (incorporating training data as seen) (Pseudofeedback not yet supported)
2. BBR (Bayesian Binary Regression) End-to-end system Compression : feature selection, sparseness-inducing Bayesian priors Representation : some classic term weighting Matching : only linear classifier application Learning : Bayesian logistic regression, thresholding Fusion : none (though logistic regression is a technique that can be used for fusion)
BBR Software C++ source Two programs Train classifier on judged data, produce classifier Apply classifier to judged data, produce classified data and evaluate classification accuracy Assumes inputs formatted as sparse vectors Uses code with GNU GPL license Zhang-Oles patent may apply
BBR Algorithms Logistic regression Best (tied w/ several) supervised learning algorithm for text classification Bayesian priors help avoid overfitting Gaussian : favors dense classifier Laplace : favors sparse classifier (few nonzero weights) Value of prior chosen by user, or cross-validation Thresholding for user-specified effectiveness measure Optional feature selection
3. libAML Library plus programs based on it Compression : cPCA Representation : classic term weighting Matching : only linear classifier application Learning : aiSVM Fusion : none
libAML Software C++ source Two programs that use libAML library dataFilterAndFeatureSelector : term weighting, shrink vectors using feature selection (cPCA) aiSVM : train, apply, evaluate classifier Assumes sparse vector input Utility provided to convert from Lemur format Uses SVM_Light (noncommercial use only)
libAML Algorithms cPCA aiSVM Select high quality subset of features by simultaneous clustering of documents and features aiSVM SVM approach produces highly effective linear text classifiers aiSVM allows tuning to user effectiveness needs
4. Homotopy End-to-end system Compression : simple feature selection Representation : classic term weighting Matching : linear classifier application Learning : Rocchio (explores variations in parameter settings of Rocchio) Fusion : none
Homotopy Software and Algorithm Built on early version of AFS and behaves similarly, but only includes Rocchio Purpose is to investigate alternate parameterizations and variants of Rocchio Separate program for evaluating classification results
5. Fusion Code Collection of scripts Compression : none Representation : none Matching : none Learning : none Fusion : techniques for combining outputs of multiple classifiers for same task
Fusion Software and Algorithms Collection of scripts in shell, awk, GNU Octave, and R Input is list of scores/class labels assigned by other classifiers to documents Several fusion algorithms: affine, linear, logistic, centroid