CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation

Slides:

Advertisements

Similar presentations

Deep Learning in NLP Word representation and how to use it for Parsing

Advertisements

Deep Learning and Neural Nets Spring 2015

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Object Detection Sliding Window Based Approach Context Helps

Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags Sung Ju Hwang and Kristen Grauman University of Texas at Austin Jingnan.

Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong

Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks Yun Wang, Leonardo Neves, Florian Metze 3/23/2016.

S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer

Attention Model in NLP Jichuan ZENG.

R-NET: Machine Reading Comprehension With Self-Matching Networks

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Convolutional Sequence to Sequence Learning

CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.

Faster R-CNN – Concepts

What Convnets Make for Image Captioning?

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Object Detection based on Segment Masks

Compact Bilinear Pooling

Mini Places Challenge Adrià Recasens, Nov 21.

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Recurrent Neural Networks for Natural Language Processing

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.

Saliency-guided Video Classification via Adaptively weighted learning

Correlative Multi-Label Multi-Instance Image Annotation

Neural Machine Translation by Jointly Learning to Align and Translate

Combining CNN with RNN for scene labeling (segmentation)

Compositional Human Pose Regression

Presenter: Chu-Song Chen

CSCI 5922 Neural Networks and Deep Learning: Image Captioning

Intro to NLP and Deep Learning

Synthesis of X-ray Projections via Deep Learning

Summary Presentation.

Huazhong University of Science and Technology

Vector-Space (Distributional) Lexical Semantics

mengye ren, ryan kiros, richard s. zemel

R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.

Image Question Answering

Attention Is All You Need

Accounting for the relative importance of objects in image retrieval

Attention-based Caption Description Mun Jonghwan.

MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.

The Big Health Data–Intelligent Machine Paradox

Outline Background Motivation Proposed Model Experimental Results

Predicting Body Movement and Recognizing Actions: an Integrated Framework for Mutual Benefits Boyu Wang and Minh Hoai Stony Brook University Experiments:

Learning Object Context for Dense Captioning

Beyond Monte Carlo Tree Search: Playing Go with Deep Alternative Neural Network and Long-Term Evaluation Jinzhuo Wang, WenminWang, Ronggang Wang, Wen Gao.

Zeroshot Learning Mun Jonghwan.

Ali Hakimi Parizi, Paul Cook

边缘检测年度进展概述 Ming-Ming Cheng Media Computing Lab, Nankai University

Word embeddings (continued)

Meta Learning (Part 2): Gradient Descent as LSTM

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Attention for translation

Automatic Handwriting Generation

Visual Question Answering

Presented by: Anurag Paul

Deep Object Co-Segmentation

Word representations David Kauchak CS158 – Fall 2016.

Semantic Segmentation

Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation

Presented By: Harshul Gupta

Sequence-to-Sequence Models

Bidirectional LSTM-CRF Models for Sequence Tagging

Eliminating Background-Bias for Robust Person Re-identification

Presentation transcript:

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation Xueying Bai, Jiankun Xu

Multi-label Image Classification Co-occurrence dependency Higher-order correlation: one label can be predicted using the previous label Semantic redundancy: labels have overlapping meanings (cat and kitten)

Previous Models Multiple single-label classification Fail to model the dependency between multiple labels Graphic model Large amount of parameters; Can not model higher-order correlation

RNN-CNN Model Learn the semantic redundancy and the co-occurrence dependencies Have an end-to-end training process Predict more objects that need contexts (higher-order correlation)

CNN-RNN Framework

Joint Embedding Model Label embedding: the embedding vector in a low-d Euclidian space in which embeddings of semantically similar labels are close to each other Image embedding: the embedding vector close to that of its associated labels in the same space Exploit semantic redundancy problem: share classification parameters

Model Diagram Output of CNN: Image embedding Output of RNN (o(t)): new embedding including the information from previous label (to model higher order correlations)

LSTM

Recurrent Neural Network

Inference Prediction Path Beam Search Find top N labels in each time step as candidates Find top N prediction paths for each time (t+1)

Beam Search When comes to ‘End’: add to the candidate path set Termination condition: probability of current intermediate paths is smaller than that of all candidate paths.

Experiments CNN module uses the 16 layers VGG network Dimension of label embedding is 64 Dimension of LSTM RNN layer is 512 Test on Datasets: NUS-WIDE, MS COCO and VOC PASCAL 2007

Precision: correctly annotated labels/ generated labels Evaluation Metric Precision: correctly annotated labels/ generated labels Recall: correctly annotated labels/ ground-truth labels C-P, O-P; C-R, O-R C-Fl, O-Fl: geometrical average MAP

NUS-WIDE A web image dataset contains 269648 images and 5018 tags. Test on dataset with 1000 tags and 81 tags.

MS COCO It contains 123 thousand images of 80 objects types. Training data has 82783 images and testing data has 40504 images. Most images have multiple objects.

PASCAL VOC 2007 Training data has 5011 images and testing data has 4952 images. Use AP and mAP to evaluate.

Label embedding The model effectively learns a joint label embedding

Attention Visualization

Conclusion and Future Work Combines the advantages of the joint image/label embedding and label co-occurrence models by employing CNN and RNN Experimental results on several datasets show good performance Predicting small objects is still a challenge.

Reference: CNN-RNN: A Unified Framework for Multi-label Image Classification — Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, Wei Xu Questions?

Thank you all！