Watch Listen & Learn: Co-training on Captioned Images and Videos

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Weakly supervised learning of MRF models for image region labeling Jakob Verbeek LEAR team, INRIA Rhône-Alpes.

Human Detection Phanindra Varma. Detection -- Overview  Human detection in static images is based on the HOG (Histogram of Oriented Gradients) encoding.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

SVM—Support Vector Machines

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Machine learning continued Image source:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Supervised Learning Recap

LPP-HOG: A New Local Image Descriptor for Fast Human Detection Andy Qing Jun Wang and Ru Bo Zhang IEEE International Symposium.

Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.

The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,

Discriminative and generative methods for bags of features

Robust Moving Object Detection & Categorization using self- improving classifiers Omar Javed, Saad Ali & Mubarak Shah.

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.

Object Class Recognition Using Discriminative Local Features Gyuri Dorko and Cordelia Schmid.

5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.

Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

K-means Based Unsupervised Feature Learning for Image Recognition Ling Zheng.

Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.

Discriminative and generative methods for bags of features

© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,

Image Processing David Kauchak cs458 Fall 2012 Empirical Evaluation of Dissimilarity Measures for Color and Texture Jan Puzicha, Joachim M. Buhmann, Yossi.

Exercise Session 10 – Image Categorization

Bag of Video-Words Video Representation

A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data Authors: Eleazar Eskin, Andrew Arnold, Michael Prerau,

Watch, Listen & Learn: Co-training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kristen Grauman, Raymond Mooney The University of Texas at.

Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Efficient Model Selection for Support Vector Machines

Text Classification using SVM- light DSSI 2008 Jing Jiang.

Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.

Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao.

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,

Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Project by: Cirill Aizenberg, Dima Altshuler Supervisor: Erez Berkovich.

Methods for classification and image representation

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

Predicting Voice Elicited Emotions

Text From Corners: A Novel Approach to Detect Text and Caption in Videos Xu Zhao, Kai-Hsiang Lin, Yun Fu, Member, IEEE, Yuxiao Hu, Member, IEEE, Yuncai.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.

 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:

Object Recognition Tutorial Beatrice van Eden - Part time PhD Student at the University of the Witwatersrand. - Fulltime employee of the Council for Scientific.

Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Unsupervised Classification

1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.

Semi-supervised Machine Learning Gergana Lazarova

Data Driven Attributes for Action Detection

Classification with Perceptrons Reading:

Machine Learning Basics

A Novel Smoke Detection Method Using Support Vector Machine

MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn

Semi-Supervised Learning

Presentation transcript:

Watch Listen & Learn: Co-training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney.

Their endeavour : Recognize categories from natural scenes from images with captions Recognizing human actions in sports videos accompanied with commentary.

How do they go about it Their model learns to classify images and videos from labelled and unlabelled multi- modal examples. A semi-supervised approach. Use image or video content together with its textual annotation(captions or commentary) to learn scene and action categories.

Features: Visual Features Textual features Static Image features Motion descriptors from videos Textual features

Histogram Of Oriented Gradients(HOG) Divide image into small connected regions(cells) Compile a histogram of gradient directions or edge orientation(This is the descriptor). The implementation of these descriptors can be achieved by dividing the image into small connected regions, called cells, and for each cell compiling a histogram of gradient directions or edge orientations for the pixels within the cell. The combination of these histograms then represents the descriptor.

Gabor Filter Convolution of a sinusoid with a gaussian Texture detection

LAB color space L(lightness) a and b are color opponent dimensions Includes all perceivable colors(exceeds that of RGB and CYMk) device independent Designed to approximate human vision

Static Image Features Each Image is broken into 4x6 grids texture feature for each regions using gabor filters Record the mean, std deviation & skewness of per channel RGB Lab color pixel values 30 -dimensional feature vector for each of the 24 regions K-means clustering Each region of each image is assigned to the closest cluster centroid.

N x 30

Bag of visual words The final bag of visual words for each image consists of: A vector of k-values: The ith element represents the number of regions in the image that belongs to the ith cluster.

Motion descriptors laptev spatio temporal motion descriptors At each feature point the patch is divided into 3x3x2 spatio-temporal blocks 4-bin HOG descriptors calculated for each block 72 element feature vector is obtained. http://www.irisa.fr/vista/Papers/2005_ijcv_laptev.pdf A video clip is represented as a histogram over this vocabulary Motion descriptors from all video in the training set are clustered to form a vocabulary.

N x 72

Textual features Image captions or transcribed video commentary Preprocess:Remove stop words and stem the remaining. The frequency of resulting word stem comprised the feature set.

Co- training What is it? - A semi supervised learning paradigm that exploits two mutually independent views. Independent views in our case: Text classifier Visual classifier

Text classifier Visual Classifier Text View Visual View + - Initially labelled Instances

Supervised learning Classifiers used to learn from labeled instances Visual Classifier Text classifier Text View + _ Visual View - + Initially labelled Instances

Co -train The trained classifier from the previous step used to label unlabelled instances Text classifier Visual Classifier Text View Visual View Unlabelled Instances

Supervised learning Classify Most confident instances Visual View Visual Classifier Text classifier Text View Visual View Partially labelled Instances

Supervised learning Label all views in the instances Visual Classifier Text classifier Text View + - Visual View + - Classifier labelled Instances

Retrain Classifier Retrain based on the new labels Visual Classifier Text classifier Text View + - Visual View + -

Classify a new instance Text View Visual View Text View Visual view Text View Visual View Text view Visual View

Ready for Co- training Classify Most confident instances System Input A Set of labelled and unlabeled examples each with two set of features(one for each view) Two classifiers whose predictions can be combined to classify new test instances. System Input Output

Combined result used for labeling test instances Early and Late Fusion Early Fusion Visual Features Fused Vector Classifier Combined result used for labeling test instances Textual Features

Combined result used for labeling test instances Early and Late Fusion Late Fusion Visual Features Visual Classifier Combined result used for labeling test instances Textual Features Text Classifier

Semi-supervised EM and Transductive SVMs Semi-supervised Expectation Maximization Learns the probabilistic classifier from the labeled training data Performs EM iterations E-step – uses the currently trained classifier to probabilistically label the unlabeled training examples M-step – retrains the classifier on the union of labeled data and probabilistically labeled unsupervised examples

Transductive SVMs Method of improving the generalization accuracy of SVMs by using unlabeled data Finds the labeling of the test examples that results in maximum margin hyperplane that separates the positive and negative examples of both training and testing data

Transductive SVMs (contd) Transductive SVMs are typically designed to improve the performance of the test data by utilizing its availability during training. It is also used directly in semi supervised setting where unlabeled data is available during training, which comes from the same distribution as the test data.

Methodology For co-training, Support Vector Machine – Base classifier for both image and text views We use Weka implementation of sequential minimal optimization (SMO) SMO is an algorithm for efficiently solving the optimization problem which arises during the training of SVMs

Methodology (Continued) Parameters used in SMO -RBF Kernel (γ=0.01) -Batch size: 5 Confidence Threshold Static Images Videos Image/Video view 0.65 0.6 Text view 0.98 0.9

Methodology (Continued) Ten iterations of ten-fold cross validation is performed to get smoother and reliable results. Test set is disjoint from both the labeled and unlabeled training data. Learning curves are used for evaluating the accuracy.

Learning Curves The learning curve can represent at a glance, the initial difficulty of learning something and, to an extent, how much there is to learn after initial familiarity. These curves are generated where at each point some fraction of the training data is labeled and the remainder is used as unlabeled training data

Results Classify captioned static images: Image dataset : Israel dataset Images have short text captions Two classes : ‘Desert’ , ‘Trees’

Examples of images DESERT Ibex in Judean Desert Bedouin Leads His Donkey That Carries Load Of Straw TREES Ibex Eating In The Nature Entrance To Mikveh Israel Agricultural School

Co-training Vs Supervised Classifiers

Co-training Vs. Semi-Supervised EM Co-training Vs. Transductive SVM

Results(contd.) Recognize actions from Commented videos: Video clips of soccer and ice-skating Resized to 240x360 resolution and then divided manually into short clips. Clip length varies from 20 to 120 frames. Four categories : kicking, dribbling, spinning, dancing

Examples of Videos

Examples of Videos(contd.)

Co-training Vs. supervised learning on commented video dataset

Co-training Vs. supervised learning when text commentary is not available

Limitations in the approach Data set used is small and requires only binary classification. Images that have explicit captions are used.

QUESTIONS ?????

THANK YOU Joydeep Sinha Anuhya Koripella Akshitha Muthireddy