Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09
Motivation Classify tags in Flickr as broad categories such as what, where, when and who Easier indexing and navigation WordNet is usually used for classification but has limited coverage
Example
The ClassTag System
Classifying Wikipedia Articles Using only metadata (i.e. Categories and Templates) – high scalability Supervised Classifier Articles as objects WordNet noun semantic categories as classification classes Categories and Templates as features Support Vector Machine (SVM) as classifier
Categories and Templates
Supervised Classification Ground Truth All Wikipedia articles that match WordNet nouns Data Sparsity WordNet categories under represented (10 out of 25) Articles have very few features
Reducing Data Sparsity Using category and template network transclusion … but noise is added
System Optimization Number of arcs traversed in Category network Template network Choice of weighting function Term Frequency (tf) Term Frequency – Inverse Document Frequency (tf-idf) Term Frequency – Inverse Layer (tf-il)
Example
Fine Tuning Partitioned the ground truth into training and test sets Criteria At least 80% precision Maximum possible recall Resulted optimal values Category arcs: 3, Template arcs: 3, TF-IL Precision: 87% F1-Measure: 0.696
SVM Threshold SVM outputs confidence with which an article is correctly classified as a member of a category Training experiment with 250 Wikipedia articles (1 assessor)
SVM Threshold
Summary Optimised for Recall (ClassTag) 39% of Articles classified 664,770 Wikipedia articles Optimised for Precision (ClassTag+) 21% of Articles classified 338,061 Wikipedia articles
Comparison with DBpedia Experimental Setup – 300 pooled articles – 3 Assessors – Blind Assessments – 50 articles overlap Partial Agreement: – 86% Total Agreement: – 78%
Results
Classification of Flickr Tags Tag Anchor Text String matching Anchor Text Wikipedia Article Number of times an anchor refers to a Wikipedia article Wikipedia Article Category Output of SVM decision
Ambiguity Tag Anchor Text Some ambiguity because often tags are lower case with no white spaces Anchor Text Wikipedia Article 13.4% of Anchor text -> Wikipedia Article mappings ambiguous 4% of Anchor text -> Category mappings ambiguous Example George Bush -> George W. Bush, George Bush Senior George Bush -> Person Wikipedia Article Category 5.7% of classified articles result in multiple classification
Example
Evaluation WordNet classification extended vocabulary coverage by 115% Taking tag frequency into account ClassTag classified 69.2% of Flickr tags 22% more than WordNet baseline
Tag distribution
Multilanguage Classification 80% of tags in English, 7% in German and 6% in Dutch Maybe a portion of the unclassified tags fall into this category Possible alternate language classification Run ClassTag using alternate Wikipedia language and a corresponding lexicon Translate the English classification using Wikipedia’s interlanguage links
Contributions Classifying open content resources using their structural patterns Presenting ClassTag - a system for classifying tags ClassTag extends the WordNet lexicon using the structural patterns of Wikipedia
Conclusion Tuneable system for classifying Wikipedia pages ClassTag: Nearly 40% of articles classified with a precision of 72% ClassTag+: 21% of articles classified with a precision of 86% (equal to assessor agreement) Nearly 70% of Flickr tags matched to WordNet categories