Learning Visual Clothing Style with Heterogeneous Dyadic Co-occurrences Andreas Veit ∗ 1, Balazs Kovacs ∗ 1, Sean Bell 1, Julian McAuley 3, Kavita Bala.

Slides:



Advertisements
Similar presentations
Face Recognition: A Convolutional Neural Network Approach
Advertisements

Collaborative QoS Prediction in Cloud Computing Department of Computer Science & Engineering The Chinese University of Hong Kong Hong Kong, China Rocky.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Tiled Convolutional Neural Networks TICA Speedup Results on the CIFAR-10 dataset Motivation Pretraining with Topographic ICA References [1] Y. LeCun, L.
Data Visualization STAT 890, STAT 442, CM 462
Large-Scale Object Recognition with Weak Supervision
ACM Multimedia th Annual Conference, October , 2004
CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Recommender systems Ram Akella November 26 th 2008.
Spatial Pyramid Pooling in Deep Convolutional
Predicting Matchability - CVPR 2014 Paper -
Unsupervised Learning of Categories from Sets of Partially Matching Image Features Kristen Grauman and Trevor Darrel CVPR 2006 Presented By Sovan Biswas.
Kuan-Chuan Peng Tsuhan Chen
Object Detection with Discriminatively Trained Part Based Models
Object Recognition in Images Slides originally created by Bernd Heisele.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
NEUROAESTHETICS IN FASHION: MODELING THE PERCEPTION OF FASHIONABILITY EDGAR SIMO-SERRA SANJA FIDLER FRANCESC MORENO-NOGUER RAQUEL URTASUN CVPR2015.
Neural networks in modern image processing Petra Budíková DISA seminar,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo.
Rich feature hierarchies for accurate object detection and semantic segmentation 2014 IEEE Conference on Computer Vision and Pattern Recognition Ross Girshick,
Gaussian Conditional Random Field Network for Semantic Segmentation
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition arXiv: v4 [cs.CV(CVPR)] 23 Apr 2015 Kaiming He, Xiangyu Zhang, Shaoqing.
Jo˜ao Carreira, Abhishek Kar, Shubham Tulsiani and Jitendra Malik University of California, Berkeley CVPR2015 Virtual View Networks for Object Reconstruction.
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
Recent developments in object detection
Siamese Neural Networks
Visualizing High-Dimensional Data
Recommender Systems & Collaborative Filtering
Convolutional Neural Network
The Relationship between Deep Learning and Brain Function
Data Mining, Neural Network and Genetic Programming
A Brief Introduction to Distant Supervision
From Vision to Grasping: Adapting Visual Networks
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
A Pool of Deep Models for Event Recognition
Article Review Todd Hricik.
Lecture 24: Convolutional neural networks
YOLO9000:Better, Faster, Stronger
Introductory Seminar on Research: Fall 2017
Hybrid Features based Gender Classification
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Modeling the world with photos
Finding Clusters within a Class to Improve Classification Accuracy
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Zan Gao, Deyu Wang, Xiangnan He, Hua Zhang
Distributed Representation of Words, Sentences and Paragraphs
Introduction to Neural Networks
Image Classification.
iSRD Spam Review Detection with Imbalanced Data Distributions
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE
Neural network training
On Convolutional Neural Network
YOLO-LITE: A Real-Time Object Detection Web Implementation
Outline Background Motivation Proposed Model Experimental Results
Visualizing and Understanding Convolutional Networks
Intro to Machine Learning
Heterogeneous convolutional neural networks for visual recognition
Face Recognition: A Convolutional Neural Network Approach
Semi-Supervised Learning
Human-object interaction
Image Processing and Multi-domain Translation
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Machine Learning.
Identifying Private Content for Online Image Sharing
Goodfellow: Chapter 14 Autoencoders
Do Better ImageNet Models Transfer Better?
Presentation transcript:

Learning Visual Clothing Style with Heterogeneous Dyadic Co-occurrences Andreas Veit ∗ 1, Balazs Kovacs ∗ 1, Sean Bell 1, Julian McAuley 3, Kavita Bala 1, Serge Belongie 1,2 1 Department of Computer Science, Cornell University 2 Cornell Tech 3 Department of Computer Science and Engineering, UC San Diego ICCV 2015

OUTLINE Introduction Dataset Learning the style space Generating outfits Visualizing the style space Evaluation

Introduction ‘What outfit goes well with this pair of shoes?’

Introduction A novel learning framework −Learn a feature transformation from images of items into a latent space(style space) that expresses compatibility. −Is capable of retrieving bundles of compatible objects. ( Bundle  a set of items from different categories )

Introduction Goal: Learn visual compatibility across clothing categories.  4 key components

Introduction Goal: Learn visual compatibility across clothing categories. Item images Category labels Links between items  co-occurrences

Introduction Goal: Learn visual compatibility across clothing categories. Strategically sample training examples (positive / negative pairs) Heterogeneous dyads

Introduction Goal: Learn visual compatibility across clothing categories. Use Siamese CNNs [5]  learn a feature transformation from the image space to the latent style space [5] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In Computer vision and pattern recognition, 2006 IEEE computer society conference on, volume 2, pages 1735–1742. IEEE, 2006.

Introduction Goal: Learn visual compatibility across clothing categories. Use a robust nearest neighbor retrieval  generate structured bundles (outfits) of compatible items

Dataset Positive / Negative training examples of clothing pairs Require  items of positive training examples to belong to different categories.

Dataset [14] J. McAuley, C. Targett, Q. Shi, and A. van den Hengel. Image-based recommendations on style and substitutes. Proceedings of the 38st annual international ACM SIGIR conference., 2015.

Dataset compatibility  co-purchase data from Amazon (Amazon’s recommendations [13]) Challenge  user behavior data is very sparse and often noisy. (In the Amazon dataset, two items that are not labeled as compatible are not necessarily incompatible) [13] G. Linden, B. Smith, and J. York. Amazon. com recommendations: Item-to-item collaborative filtering. Internet Computing, IEEE, 7(1):76–80, 2003.

Learning the style space

Novel sampling strategy −to generate training sets that represent notions of style compatibility across categories. How to train a Siamese CNN −to learn a feature transformation from the image space into the latent style space.

(1)-Heterogeneous dyadic co-occurrences Two key concepts of the proposed sampling approach  heterogeneous dyads  co-occurrences we define co-occurrence between items to be co-purchases

(2)-Generating the training set ~ 1.1 million clothing products with product images and class labels we first split the images into training, validation and test sets ( 80 : 1 : 19 )  for each of the three sets we generate positive and negative examples.  negative pairs  randomly among those not labeled compatible (each positive example sample 16 negative examples) Balance the training set for categories [2]  choose a training set size of 2 million pairs [2] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material recognition in the wild with the materials in context database. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 3479–3487, mean class accuracy↑

(2)-Generating the training set We use three different sampling strategies 1.Naïve  All positive and negative training examples are sampled randomly.  Positive as well as negative pairs can contain two items from within the same category or from two different categories.

(2)-Generating the training set 2.Strategic Motivation  Items from the same/different category are generally visually similar/dissimilar to each other.  CNN tend to map visually similar items close in the output feature space. Want  learn a notion of style across categories enforce all positive (close) training pairs to be heterogeneous dyads.

(2)-Generating the training set 3.Holdout-categories Training and test sets - evaluate the transferability of the learned notion of style towards unseen categories. Sample training examples  same rules as ‘strategic’ Training set does not contain any objects from the holdout-category. Test and validation set contain only pairs with at least one item from the holdout category.

(3)-Training the Siamese network Follow Bell and Bala [1] −AlexNet and GoogLeNet (pretrained on ILSVRC2012 [17]) −augment the networks with a 256-dimensional fully connected layer −fine-tune the networks on about 2 million pairs −~24 hours on an Amazon EC2 g2.2xlarge instance using the Caffe library [7] [1] S. Bell and K. Bala. Learning visual similarity for product design with convolutional neural networks. ACM Trans. on Graphics (SIGGRAPH 2015), 34(4), [17] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, [7] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv: , 2014.

(3)-Training the Siamese network The network’s objective  project positive pairs close together and negative pairs far apart. vanilla GoogLeNet trained on ImageNetGoogLeNet trained with strategic sampling

Generating outfits Handpick sets of categories  meaningful Challenge  label noise  e.g. shirts are falsely labeled as shoes  Siamese CNNs ─ put similar looking objects close High probability a shirt labeled as shoe will be closer to the queried shirt than real shoes ! “dress”, “shirt” project Outfit consist :

Generating outfits 20 centroids

Generating outfits

Visualizing the style space t-SNE algorithm [19] -project the 256-dimensional embedding down to a 2D embedding Discretize the style space into a grid and pick one image from each grid cell at random [19] L. van der Maaten and G. Hinton. Visualizing data using t-sne. The Journal of Machine Learning Research, 9( ):85, 2008.

Visualizing the style space

Visualize stylistic insights the network learned about 1.Cluster the space for each category 2.For each pair of categories, we retrieve the closest/most distant clusters in the style space “clothing goes well together” & “clothing doesn’t go well together”

Evaluation 1.Test set prediction accuracy  measures the link prediction performance of our algorithm on the test set. (strategic)  close and distant links in ratio 50 : 50 Compare 4 different approaches -GoogLeNet trained with strategic sampling -AlexNet with strategic sampling -GoogLeNet with naïve sampling -vanilla ImageNet-trained GoogLeNet

Evaluation ROC curve computed by sweeping a threshold value to predict if a link is close or distant.

Evaluation AUC scores

Evaluation 2.Feature transferability  evaluate the transferability of the learned features to new unseen categories  Holdout-categories Perform 3 different holdout categories -shoes, jeans and shirts

Evaluation ROC curves JeansShirtsShoes

Evaluation AUC scores 67.0% 48.6% 47.5%

Evaluation 3.Comparison to related work [14]  learning task and training/test sets differ between their work and ours  learn and separately optimize two models −predict if items are bought together −predict if they are also bought  test sets contain mostly links within the same category accuracy 85% 74% on bought together  Ours 87.4% (compared to 92.5%) on also bought  Ours 83.1% (compared to 88.7%) [14] J. McAuley, C. Targett, Q. Shi, and A. van den Hengel. Image-based recommendations on style and substitutes. Proceedings of the 38st annual international ACM SIGIR conference., 2015.

Evaluation 4.User study (online)  how users think about style and compatibility  compare our learning framework against baselines image1 image2image3 shoesshirts “Given this pair of shoes, which of the two presented shirts fits better?” Different networks + nearest neighbor retrieval method

Evaluation random choice GoogLeNet naïve AlexNet stragetic GoogLeNet Vanilla GoogLeNet strategic Dashed line: if both bars are below this line, the difference is not statistically significant

Evaluation Survey (participating users)  asking them how they decide which option to pick 1.Users tend to choose the option that fits in functionality. 2.Users sometimes choose the item that is stylistically similar, but not stylistically compatible. 3.Users sometimes pick the item they like more, not the item that better matches according to style. Not based only on stylistic compatibility !