MMS Software Deliverables: Year 1

Slides:



Advertisements
Similar presentations
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Advertisements

Chapter 5: Introduction to Information Retrieval
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Instance Based Learning
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Information Retrieval in Practice
K nearest neighbor and Rocchio algorithm
Sparse vs. Ensemble Approaches to Supervised Learning
What is Cluster Analysis?
Exploration & Exploitation in Adaptive Filtering Based on Bayesian Active Learning Yi Zhang, Jamie Callan Carnegie Mellon Univ. Wei Xu NEC Lab America.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
CS Instance Based Learning1 Instance Based Learning.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Overview of Search Engines
An Introduction to Support Vector Machines Martin Law.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
WEKA – Knowledge Flow & Simple CLI
The identification of interesting web sites Presented by Xiaoshu Cai.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
K Nearest Neighbors Classifier & Decision Trees
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
An Introduction to Support Vector Machines (M. Law)
Neural and Evolutionary Computing - Lecture 9 1 Evolutionary Neural Networks Design  Motivation  Evolutionary training  Evolutionary design of the architecture.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
November 10, 2004Dmitriy Fradkin, CIKM'041 A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems Dmitriy Fradkin, Paul.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
1 Chapter 2 SW Process Models. 2 Objectives  Understand various process models  Understand the pros and cons of each model  Evaluate the applicability.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Information Retrieval in Practice
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
P.Demestichas (1), S. Vassaki(2,3), A.Georgakopoulos(2,3)
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
General-Purpose Learning Machine
Efficient Image Classification on Vertically Decomposed Data
10701 / Machine Learning.
Rutgers/DIMACS MMS Project
Machine Learning Basics
Waikato Environment for Knowledge Analysis
Machine Learning Week 1.
Efficient Image Classification on Vertically Decomposed Data
Introduction to Systems Analysis and Design
A Unifying View on Instance Selection
Logistic Regression & Parallel SGD
Machine Learning with Weka
Instance Based Learning
MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.
MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Lecture 10 – Introduction to Weka
MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.
Introduction to Neural Networks
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
What's New in eCognition 9
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Machine Learning for Cyber
Presentation transcript:

MMS Software Deliverables: Year 1 Paul Kantor, Dave Lewis Presentation for MMS Site Visit, DIMACS, Rutgers Univ., 26-Feb-2004

Outline Overview of software deliverables Software deliverables Adaptive Filtering (Rocchio, Centroid, kNN) BBR (Bayesian logistic regression) libAML (cPCA, tuned SVM) Homotopy Fusion

MMS Deliverables Research software (focus of this talk) Source code Experimenter-oriented documentation Experimental results Insights

Research Software: Design Flexibly parameterized Easy to experiment with many variations Including ones we found didn’t work Assumes all data provided at start of run Often incorporates evaluation code Looks at test data judgments only after run Simulates processing stream of incoming data; can’t accept real-time input

Research Software: Robustness Alpha and beta testing by sophisticated nondevelopers Test cases, regression testing not systematic Abnormal conditions (missing files, etc.) not always handled gracefully Assumes data has been converted to appropriate format

Research Software: Usability Provided as source code (C++) Some shell, awk, Java code for running experiments and preprocessing data Documentation assumes sophisticated, experimentation-oriented user I/O formats vary among packages (driven by data) According to needs of different experiments Use of libraries, other code w/ license conditions say something about data sets (driving I/O formats)

“Components” of Filtering MMS project oriented around five abstract filtering “components”: Compression Representation Matching Learning Fusion

One Program Sometimes Handles Several Components One algorithm may accomplish goals of multiple “components” Component processing may be needed in conditional, multiple, or iterative fashion Running large numbers of experiments efficiently sometimes requires incorporating several components in same program

1. Adaptive Filtering Software (AFS) End-to-end system Compression : simple feature selection (in Rocchio and Centroid) Representation : classic term weighting Matching : efficient finding of nearest neighbors Learning : Rocchio, Centroid, kNN Fusion : none Note to MMS people: Throughout this presentation I’m discussing feature selection as a “compression” technique, rather than a representation technique. This is completely arbitrary.

AFS Software Available as source code (C++) CMU Lemur toolkit used for preprocessing and storage of documents Can simulate batch and adaptive filtering environments All documents must be provided at start of run Incorporates evaluation code wait to hear if vladimir providing linux executable as well

AFS: Rocchio Classic learning algorithm for text classification Implementation designed for testing of many design choices discussed in literature Pseudofeedback (handling of unjudged data) Thresholding (adaptive and batch) Feature Selection Term weighting

AFS: Centroid Similar to Rocchio, but emphasizing contrast between positive and negative examples Same design choices can be explored as for Rocchio

AFS: kNN Classic pattern recognition algorithm Experiments focused on Put test document in same categories as most similar training documents Experiments focused on Matching : reducing space/time to find neighbors Learning : adjusting neighborhood size, weighting, thresholds

AFS kNN Matching Capabilities Exact scoring with inverted index (Already faster than exhaustive matching) Approximate scoring with inverted index Training document pruning prior to indexing Test document pruning Classification time inverted list pruning Random projections Several variants of theory-motivated method

AFS kNN Learning Capabilities Classic and two weighted kNN variants Cross-validation based thresholding, optimizing user-specified effectiveness measure Adaptive filtering (incorporating training data as seen) (Pseudofeedback not yet supported)

2. BBR (Bayesian Binary Regression) End-to-end system Compression : feature selection, sparseness-inducing Bayesian priors Representation : some classic term weighting Matching : only linear classifier application Learning : Bayesian logistic regression, thresholding Fusion : none (though logistic regression is a technique that can be used for fusion)

BBR Software C++ source Two programs Train classifier on judged data, produce classifier Apply classifier to judged data, produce classified data and evaluate classification accuracy Assumes inputs formatted as sparse vectors Uses code with GNU GPL license Zhang-Oles patent may apply

BBR Algorithms Logistic regression Best (tied w/ several) supervised learning algorithm for text classification Bayesian priors help avoid overfitting Gaussian : favors dense classifier Laplace : favors sparse classifier (few nonzero weights) Value of prior chosen by user, or cross-validation Thresholding for user-specified effectiveness measure Optional feature selection

3. libAML Library plus programs based on it Compression : cPCA Representation : classic term weighting Matching : only linear classifier application Learning : aiSVM Fusion : none

libAML Software C++ source Two programs that use libAML library dataFilterAndFeatureSelector : term weighting, shrink vectors using feature selection (cPCA) aiSVM : train, apply, evaluate classifier Assumes sparse vector input Utility provided to convert from Lemur format Uses SVM_Light (noncommercial use only)

libAML Algorithms cPCA aiSVM Select high quality subset of features by simultaneous clustering of documents and features aiSVM SVM approach produces highly effective linear text classifiers aiSVM allows tuning to user effectiveness needs

4. Homotopy End-to-end system Compression : simple feature selection Representation : classic term weighting Matching : linear classifier application Learning : Rocchio (explores variations in parameter settings of Rocchio) Fusion : none

Homotopy Software and Algorithm Built on early version of AFS and behaves similarly, but only includes Rocchio Purpose is to investigate alternate parameterizations and variants of Rocchio Separate program for evaluating classification results

5. Fusion Code Collection of scripts Compression : none Representation : none Matching : none Learning : none Fusion : techniques for combining outputs of multiple classifiers for same task

Fusion Software and Algorithms Collection of scripts in shell, awk, GNU Octave, and R Input is list of scores/class labels assigned by other classifiers to documents Several fusion algorithms: affine, linear, logistic, centroid