CNN-RNN: A Unified Framework for Multi-label Image Classification

Slides:



Advertisements
Similar presentations
Deep Learning in NLP Word representation and how to use it for Parsing
Advertisements

Deep Learning and Neural Nets Spring 2015
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Object Detection Sliding Window Based Approach Context Helps
Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags Sung Ju Hwang and Kristen Grauman University of Texas at Austin Jingnan.
Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong
Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks Yun Wang, Leonardo Neves, Florian Metze 3/23/2016.
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Attention Model in NLP Jichuan ZENG.
R-NET: Machine Reading Comprehension With Self-Matching Networks
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
Convolutional Sequence to Sequence Learning
CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.
Faster R-CNN – Concepts
What Convnets Make for Image Captioning?
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Object Detection based on Segment Masks
Compact Bilinear Pooling
Mini Places Challenge Adrià Recasens, Nov 21.
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Recurrent Neural Networks for Natural Language Processing
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Saliency-guided Video Classification via Adaptively weighted learning
Correlative Multi-Label Multi-Instance Image Annotation
Neural Machine Translation by Jointly Learning to Align and Translate
Combining CNN with RNN for scene labeling (segmentation)
Compositional Human Pose Regression
Presenter: Chu-Song Chen
CSCI 5922 Neural Networks and Deep Learning: Image Captioning
Intro to NLP and Deep Learning
Synthesis of X-ray Projections via Deep Learning
Summary Presentation.
Huazhong University of Science and Technology
Vector-Space (Distributional) Lexical Semantics
mengye ren, ryan kiros, richard s. zemel
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Image Question Answering
Attention Is All You Need
Accounting for the relative importance of objects in image retrieval
Attention-based Caption Description Mun Jonghwan.
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
The Big Health Data–Intelligent Machine Paradox
Outline Background Motivation Proposed Model Experimental Results
Predicting Body Movement and Recognizing Actions: an Integrated Framework for Mutual Benefits Boyu Wang and Minh Hoai Stony Brook University Experiments:
Learning Object Context for Dense Captioning
Beyond Monte Carlo Tree Search: Playing Go with Deep Alternative Neural Network and Long-Term Evaluation Jinzhuo Wang, WenminWang, Ronggang Wang, Wen Gao.
Attention.
Zeroshot Learning Mun Jonghwan.
Ali Hakimi Parizi, Paul Cook
边缘检测年度进展概述 Ming-Ming Cheng Media Computing Lab, Nankai University
Word embeddings (continued)
Meta Learning (Part 2): Gradient Descent as LSTM
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Attention for translation
Automatic Handwriting Generation
Visual Question Answering
Presented by: Anurag Paul
Deep Object Co-Segmentation
Word representations David Kauchak CS158 – Fall 2016.
Semantic Segmentation
Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation
Presented By: Harshul Gupta
Sequence-to-Sequence Models
Jiahe Li
Bidirectional LSTM-CRF Models for Sequence Tagging
Eliminating Background-Bias for Robust Person Re-identification
Presentation transcript:

CNN-RNN: A Unified Framework for Multi-label Image Classification Xueying Bai, Jiankun Xu

Multi-label Image Classification Co-occurrence dependency Higher-order correlation: one label can be predicted using the previous label Semantic redundancy: labels have overlapping meanings (cat and kitten)

Previous Models Multiple single-label classification Fail to model the dependency between multiple labels Graphic model Large amount of parameters; Can not model higher-order correlation

RNN-CNN Model Learn the semantic redundancy and the co-occurrence dependencies Have an end-to-end training process Predict more objects that need contexts (higher-order correlation)

CNN-RNN Framework

Joint Embedding Model Label embedding: the embedding vector in a low-d Euclidian space in which embeddings of semantically similar labels are close to each other Image embedding: the embedding vector close to that of its associated labels in the same space Exploit semantic redundancy problem: share classification parameters

Model Diagram Output of CNN: Image embedding Output of RNN (o(t)): new embedding including the information from previous label (to model higher order correlations)

LSTM

Recurrent Neural Network

Inference Prediction Path Beam Search Find top N labels in each time step as candidates Find top N prediction paths for each time (t+1)

Beam Search When comes to ‘End’: add to the candidate path set Termination condition: probability of current intermediate paths is smaller than that of all candidate paths.

Experiments CNN module uses the 16 layers VGG network Dimension of label embedding is 64 Dimension of LSTM RNN layer is 512 Test on Datasets: NUS-WIDE, MS COCO and VOC PASCAL 2007

Precision: correctly annotated labels/ generated labels Evaluation Metric Precision: correctly annotated labels/ generated labels Recall: correctly annotated labels/ ground-truth labels C-P, O-P; C-R, O-R C-Fl, O-Fl: geometrical average MAP

NUS-WIDE A web image dataset contains 269648 images and 5018 tags. Test on dataset with 1000 tags and 81 tags.

MS COCO It contains 123 thousand images of 80 objects types. Training data has 82783 images and testing data has 40504 images. Most images have multiple objects.

PASCAL VOC 2007 Training data has 5011 images and testing data has 4952 images. Use AP and mAP to evaluate.

Label embedding The model effectively learns a joint label embedding

Attention Visualization

Conclusion and Future Work Combines the advantages of the joint image/label embedding and label co-occurrence models by employing CNN and RNN Experimental results on several datasets show good performance Predicting small objects is still a challenge.

Reference: CNN-RNN: A Unified Framework for Multi-label Image Classification — Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, Wei Xu Questions?

Thank you all!