CNN-RNN: A Unified Framework for Multi-label Image Classification Xueying Bai, Jiankun Xu
Multi-label Image Classification Co-occurrence dependency Higher-order correlation: one label can be predicted using the previous label Semantic redundancy: labels have overlapping meanings (cat and kitten)
Previous Models Multiple single-label classification Fail to model the dependency between multiple labels Graphic model Large amount of parameters; Can not model higher-order correlation
RNN-CNN Model Learn the semantic redundancy and the co-occurrence dependencies Have an end-to-end training process Predict more objects that need contexts (higher-order correlation)
CNN-RNN Framework
Joint Embedding Model Label embedding: the embedding vector in a low-d Euclidian space in which embeddings of semantically similar labels are close to each other Image embedding: the embedding vector close to that of its associated labels in the same space Exploit semantic redundancy problem: share classification parameters
Model Diagram Output of CNN: Image embedding Output of RNN (o(t)): new embedding including the information from previous label (to model higher order correlations)
LSTM
Recurrent Neural Network
Inference Prediction Path Beam Search Find top N labels in each time step as candidates Find top N prediction paths for each time (t+1)
Beam Search When comes to ‘End’: add to the candidate path set Termination condition: probability of current intermediate paths is smaller than that of all candidate paths.
Experiments CNN module uses the 16 layers VGG network Dimension of label embedding is 64 Dimension of LSTM RNN layer is 512 Test on Datasets: NUS-WIDE, MS COCO and VOC PASCAL 2007
Precision: correctly annotated labels/ generated labels Evaluation Metric Precision: correctly annotated labels/ generated labels Recall: correctly annotated labels/ ground-truth labels C-P, O-P; C-R, O-R C-Fl, O-Fl: geometrical average MAP
NUS-WIDE A web image dataset contains 269648 images and 5018 tags. Test on dataset with 1000 tags and 81 tags.
MS COCO It contains 123 thousand images of 80 objects types. Training data has 82783 images and testing data has 40504 images. Most images have multiple objects.
PASCAL VOC 2007 Training data has 5011 images and testing data has 4952 images. Use AP and mAP to evaluate.
Label embedding The model effectively learns a joint label embedding
Attention Visualization
Conclusion and Future Work Combines the advantages of the joint image/label embedding and label co-occurrence models by employing CNN and RNN Experimental results on several datasets show good performance Predicting small objects is still a challenge.
Reference: CNN-RNN: A Unified Framework for Multi-label Image Classification — Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, Wei Xu Questions?
Thank you all!