Mapping Between Taxonomies Elena Eneva 30 Oct 2001 Advanced IR Seminar
Idea Review German French Textile Automobile By country By industry
Learning Algorithms 2 separate learners for the documents Old doc category -> new doc category Doc contents -> new category Put together Weighted average based on confidence Final result determined by a decision tree One combined learner – used both old category and contents as features
Data Sets Hoovers – 4285 documents –28 categories –255 categories Reuter 2001 – documents –Topics –Industry categories
Current System Simple Decision Tree (C4.5) – learns probabilities of new categories based on old categories (doesn’t know about documents/words) Naïve Bayes (rainbow) – word-based classification into the new categories (doesn’t know about old categories) Combination (Decision Tree) – takes the outputs and confidences of the two, predicts new category
Current Results NB tr NB te DT tr DT te Comb tr Comb te 28p255? p28??100 Accuracy (%) Five fold cross validation
Work in Progress Naïve Bayes for 255 predict 28 (expect higher accuracies) Use one classifier only (taking both kinds of features - words & old categories) – NB An additional single simple classifier – KNN (and VNC-Light, if there is time in the end) Run everything on Reuters 2001 (in addition to Hoovers)
Comments? The end.