Multi-label Classification Yusuke Miyao
N. Ghamrawi, A. McCallum. Collective multi-label classification. CIKM S. Godbole, S. Sarawagi. Discriminative methods for multi-labeled classification. PAKDD G. Tsoumakas, I. Vlahavas. Random k-labelsets: An ensemble method for multilabel classification. ECML G. Tsoumakas, I. Katakis. Multi-label classification: An overview. Journal of Data Warehousing and Mining A. Fujino, H. Isozaki. Multi-label text categorization with model combination based on F1-score maximization. IJCNLP 2008.
Machine Learning Template Library Separate data structures from learning algorithms Allow for any combinations of structures and algorithms decode expectation diff InterfaceData structure Perceptron 1-best MIRA Log-linear model n-best Classifier Markov chain Dep. tree Semi-Markov Multi-label Learning algorithm n-best MIRA Max-margin EM algorithm Reranking Feature forest Naïve Bayes
Target Problem Choose multiple labels from a fixed set of labels Ex. Keyword assignment (text categorization) Keyword set Text Politics Sports Entertainment Life Food Recipe Comedy Drama Travel Tech Health Video Book Food Recipe Animation Select appropriate keywords for the text Music
Applications Keyword assignment (text categorization) – Benchmark data: Reuter-21578, OHSUMED, etc. Medical diagnosis Protein function classification – Benchmark data: Yeast, Genbase, etc. Music/scene categorization Non-contiguous, overlapping segmentation [McDonald et al., 2005]
Formulation x : object, L : label set, y ⊆ L : labels assigned to x y = argmax x f(x,y) L x Politics Sports Entertainment Life Food Recipe Comedy Drama Travel Tech Health Video Book Food Recipe Animation y Music
Popular Approaches Subsets as atomic labels – Each subset is considered as an atomic label – Tractable only when |L| is small A set of binary classifications – One-vs-all – Each label is independently assigned Label ranking – A ranking function is induced from multi-labeled data (BoosTexter [Schapire et al., 2000], Rank-SVM [Elisseeff et al., 2002], large-margin [Crammer et al., 2003] ) Probabilistic generative models [McCallum 1999; Ueda et al., 2003; Sato et al., 2007]
Issues on Multi-Label Classification How to reduce training/running cost – The number of targets (i.e. subsets) is exponentially related to the size of the label set How to model correlation of labels – Binary classification cannot use features on multiple labels Classification vs. Ranking Hierarchical multi-label classification (ex. MeSH term) [Cesa-Bianchi et al. 2006; J. Rousu et al., 2006]
Collective Multi-Label Classification CRF is applied to multi-label classification Features are defined on pairs of labels Notation: – y i = 1 if i- th label ∈ y – y i = 0 otherwise
Accounting for Multiple Labels Binary Model: f b (x,y) : y i given x Collective Multi-Label (CML) model: f ml (x,y) : y i and y j Collective Multi-Label with Features (CMLF) model: f mlf (x,y) : y i and y j given x
Parameter Estimation Enumeration of y is intractable in general Two approximations: – Supported combinations: consider only the label combinations that occur in training data – Binary pruned inference: first apply binary model consider only the labels having probabilities above a threshold t No dynamic programming
Experiments Reuters Modified Apte (ModApte) split – 90 labels – Training: 9,603 docs, Test: 3,299 docs – 8.7% of the documents have multiple labels OHSUMED “Heart Disease” documents – 40 labels assigned to training documents – 16 labels assigned to 75 or more training documents
Supported combinations Binary pruned Results: Reuters BinaryCMLCMLF macro-F micro-F exact match classification time 1.4 ms48 ms78 ms BinaryCMLCMLF macro-F micro-F exact match classification time 1.4 ms4.6 ms4.7 ms
Supported combinations Binary pruned Results: OHSUMED BinaryCMLCMLF macro-F micro-F exact match BinaryCMLCMLF macro-F micro-F exact match
Similar Methods H. Kazawa et al. Maximal margin labeling for multi-topic text categorization. NIPS – All subsets are considered as atomic labels – Approximation by only considering neighbor subsets (subsets that differ in a single label from the gold) S. Zhu et al. Multi-labelled classification using maximum entropy method. SIGIR – Simply enumerate all subsets, and use f ml – Only evaluated with small label sets ( ≦ 10)
Discriminative Methods for Multi- Labeled Classification Cascade binary classifiers (SVM) Another technique: remove negative instances that are close to decision boundary |L||L| |L||L| classifier for each label input text |L||L| |L||L| ensemble classifier
Random k-Labelsets Randomly select size- k subsets from 2 L Train multi-class classifiers for the subsets Label a new instance by majority voting YmYm YmYm Y2Y2 Y2Y2 Y1Y1 Y1Y1 Y3Y3 Y3Y3 classifiers for size- k subsets input text (1,0,0,1,0,…,0,0) (0,1,0,1,0,…,0,1) (1,0,0,0,1,…,0,1) (0,0,0,1,0,…,1,1) majority voting (0,0,0,1,0,…,0,1)
Other Approaches Learn a latent model to account for label correlations – K. Yu et al. Multi-label informed latent semantic indexing. SIGIR – J. Zhang et al. Learning multiple related tasks using latent independent component analysis. NIPS – V. Roth et al. Improved functional prediction of proteins by learning kernel combinations in multilabel settings. PMSB kNN-like algorithms – M-L Zhang et al. A k-nearest neighbor based algorithm for multi- label classification. IEEE Conference on Granular Computing – F. Kang et al. Correlated label propagation with application to multi-label learning. CVPR – K. Brinker et al. Case-based multilabel ranking. IJCAI 2007.
Summary Multi-label classification is an important and interesting problem Major issues: – Label correlation – Computational cost A lot of methods have been proposed – Basically, enhancement of fundamental methods (subsets as atomic labels, set of binary classifications) No existing methods solve the problem completely
Future Directions Algorithm for exact solution? Other learning algorithms – Via machine learning template library Structurization of label sets – IS-A hierarchy → hierarchical multi-label – Exclusive labels Modeling of label distance – Redesign of objective functions
Possible Applications Any tasks of keyword assignments Substitute for n-best/ranking Multi-label problems where label sets are not fixed – Keyword (key phrase) extraction Choose words/phrases from each document – Summarization by sentence extraction cf. D. Xin et al. Extracting redundancy-aware top-k patterns. KDD 2006.