Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar
Mapping Between Taxonomies Formal systems of orderly classification of knowledge, which are designed for a specific purpose Companies, organizing information in various ways (eg. one for marketing, another for product development)
Approach German French Textile Automobile By country By industry
Approach German French Textile Automobile By country By industry
Approach German French Textile Automobile By country By industry
Approach German French Textile Automobile By country By industry
Approach Textile Automobile By industry
Approach Textile Automobile By industry abc
Approach Textile Automobile By industry abc
Approach German French Textile Automobile By country By industry abc
Approach German French Textile Automobile By country By industry abc
Approach German French Textile Automobile By country By industry abc
Datasets Two classification schemes: Reuter 2001 ( docs) Topics (127) Industry categories (871) Regions (376) Hoovers-255 and Hoovers-28 (4286 docs) industry categories (28) industry categories (255)
Learning 2 separate methods of learning for the documents: Old doc category -> new doc category Doc contents -> new category Combined method: Weighted average based on confidence Final result determined by a decision tree One combined learner – used both old category and contents as features
Simple Learners Simple Decision Tree (C4.5) – learns probabilities of new categories based on 1 kind of feature: Old categories (doesn’t know about documents/words) Word-based classification (doesn’t know about old categories) Naïve Bayes (rainbow) Old categories (doesn’t know about documents/words) Word-based classification (doesn’t know about old categories) Support Vector Machine (SVM-Light) word-based classification (doesn’t know about old categories), linear kernel [results will be reported in the final paper]
Learning Using the document content abc Using the document labels DT, NB, SVM
Combined Learners Weighted Average Voting scheme Combination Decision Tree takes the outputs and confidences of two of the simple learners, predicts new category
Learning Using both the content and the label Combining the two outputs abc DT abc DT, NB, SVM voting 3 rd classifier
Results Words Only 5-fold cross validation
Results Categories Only 5-fold cross validation
Results Combination 5-fold cross validation
Results
Remarks Hierarchy (old classes) usually ignored Shown that helps Learners are not the issue Better way of understanding Old label (or hierarchy path) is meta data
Remaining Work SVM results (running even as we speak) Repeat experiments on Reuters-2001 Internal hierarchies Missing labels Less correlated types of classes Results in standard evaluation format
Future Work Try with a web dataset (Google and Yahoo! Hierarchies) Hierarchies of more levels Meta data (for non-text sources)
Related Literature A study of Approaches to Hypertext, Y. Yang, S. Slattery, R. Ghani, Journal of Intelligent Information Systems, Volume 18, Number 2, March 2002 (to appear). Learning Mappings between Data Schemas, A. Doan, P. Domingos, and A. Levy. Proceedings of the AAAI-2000 Workshop on Learning Statistical Models from Relational Data, 2000, Austin, TX.
Questions and Suggestions The end.
Taxonomies Formal systems of orderly classification of knowledge, which are designed for a specific purpose Change of purpose, change of taxonomies Businesses often need and keep the information in several structures Important to be able to automatically map between taxonomies
Useful Mappings Companies, organizing information in various ways (eg. one for marketing, another for product development) Personal online bookmark classification Search engines (eg. Google Yahoo) EU Committee for Standardization “detailed overview of the existing taxonomies officially used in the EU, in order to derive general concepts such as: information organisation, properties, multilinguality, keywords, etc. and, last but not least, the mapping between.”