Visual Question Generation Jhih-Ciang Wu Institution of Information Science, Academia Sinica jcwu@iis.sinica.edu.tw May. 8, 2018
Overview Backgrounds Baseline model References ILSVRC VGG RNN LSTM CNN+RNN References
ILSVRC ImageNet Large Scale Visual Recognition Challenge. In classfication task, we list winners over the years. AlexNet(2012) ZFNet(2013) VGGNet(2014 The second place) ResNet(2015) MaskRCNN(2017)
VGG VGG uses very small 3×3 filters in all convolutional layers.
VGG
RNN Recurrent Neural Network(RNN): allows it to exhibit dynamic temporal behavior.
LSTM Long Short-Term Memory(LSTM): a special kind of RNN, capable of learning long-term dependencies.
LSTM
LSTM
LSTM
LSTM
Baseline model
CNN+LSTM what color is the surfboard ? ∗learning rate = 0.00001, batch = 64, epochs = 100.
CNN+LSTM is this a zebra ? ∗learning rate = 0.00001, batch = 64, epochs = 100.
CNN+LSTM what color are the flowers ? ∗learning rate = 0.00001, batch = 64, epochs = 100.
CNN+LSTM what is the green vegetable ? ∗learning rate = 0.00001, batch = 64, epochs = 100.
CNN+LSTM how many people are in the picture ? ∗learning rate = 0.00001, batch = 64, epochs = 100.
Modified MLP We use K-means method to separate training data into K clusters.
Reference Deep Visual-Semantic Alignments for Generating Image Descriptions Show and Tell: A Neural Image Caption Generator