Liang Lin, Shuicheng Yan

Slides:



Advertisements
Similar presentations
Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.
Advertisements

- Recovering Human Body Configurations: Combining Segmentation and Recognition (CVPR’04) Greg Mori, Xiaofeng Ren, Alexei A. Efros and Jitendra Malik -
Parsing Clothing in Fashion Photographs
Spatial Pyramid Pooling in Deep Convolutional
REALTIME OBJECT-OF-INTEREST TRACKING BY LEARNING COMPOSITE PATCH-BASED TEMPLATES Yuanlu Xu, Hongfei Zhou, Qing Wang*, Liang Lin Sun Yat-sen University,
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.
A New Method for Automatic Clothing Tagging Utilizing Image-Click-Ads Introduction Conclusion Can We Do Better to Reduce Workload?
Feedforward semantic segmentation with zoom-out features
PANDA: Pose Aligned Networks for Deep Attribute Modeling Ning Zhang 1,2 Manohar Paluri 1 Marć Aurelio Ranzato 1 Trevor Darrell 2 Lumbomir Boudev 1 1 Facebook.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
Deeply-Recursive Convolutional Network for Image Super-Resolution
Recent developments in object detection
Hybrid Deep Learning for Reflectance Confocal Microscopy Skin Images
CNN-RNN: A Unified Framework for Multi-label Image Classification
What Convnets Make for Image Captioning?
ALADDIN A Locality Aligned Deep Model for Instance Search
Deeply learned face representations are sparse, selective, and robust
Summary of “Efficient Deep Learning for Stereo Matching”
Object Detection based on Segment Masks
Compact Bilinear Pooling
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Bag-of-Visual-Words Based Feature Extraction
Deep Predictive Model for Autonomous Driving
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Saliency-guided Video Classification via Adaptively weighted learning
Understanding and Predicting Image Memorability at a Large Scale
A Pool of Deep Models for Event Recognition
Compositional Human Pose Regression
Nonparametric Semantic Segmentation
Fast Preprocessing for Robust Face Sketch Synthesis
Huazhong University of Science and Technology
CS6890 Deep Learning Weizhen Cai
Presenter: Hajar Emami
Adversarially Tuned Scene Generation
Enhanced-alignment Measure for Binary Foreground Map Evaluation
State-of-the-art face recognition systems
A Convolutional Neural Network Cascade For Face Detection
By: Kevin Yu Ph.D. in Computer Engineering
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Zan Gao, Deyu Wang, Xiangnan He, Hua Zhang
Presenter: Usman Sajid
“The Truth About Cats And Dogs”
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
CornerNet: Detecting Objects as Paired Keypoints
Lecture: Deep Convolutional Neural Networks
Outline Background Motivation Proposed Model Experimental Results
Deep Cross-media Knowledge Transfer
Tuning CNN: Tips & Tricks
Liyuan Li, Jerry Kah Eng Hoe, Xinguo Yu, Li Dong, and Xinqi Chu
Yang Liu, Perry Palmedo, Qing Ye, Bonnie Berger, Jian Peng 
RCNN, Fast-RCNN, Faster-RCNN
Sun Yat-sen University
Zhedong Zheng, Liang Zheng and Yi Yang
边缘检测年度进展概述 Ming-Ming Cheng Media Computing Lab, Nankai University
Department of Computer Science Ben-Gurion University of the Negev
Human-object interaction
Deep Object Co-Segmentation
Feature Selective Anchor-Free Module for Single-Shot Object Detection
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Semantic Segmentation
Object Detection Implementations
Presented By: Harshul Gupta
Deep Structured Scene Parsing by Learning with Image Descriptions
SFNet: Learning Object-aware Semantic Correspondence
Eliminating Background-Bias for Robust Person Re-identification
20 November 2019 Output maps Normal Diffuse Roughness Specular
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

Liang Lin, Shuicheng Yan Human Parsing with Contextualized Convolutional Neural Network Xiaodan Liang, Chunyan Xu, Xiaohui Shen, Jianchao Yang, Si Liu, Jinhui Tang Liang Lin, Shuicheng Yan Hello everyone, I am liang xiaodan from sun yat-sen university. It’s my great pleasure to talk about my research on human parsing with contextualized convolutional neural network

Task: Human Parsing Decompose a human photo into semantic fashion/body items Pixel-level semantic labeling Human parsing, refers to decomposing a human image into semantic clothes/body regions. It can be treat as a specific semantic segmentation, while very fine-grained small parts such as sunglasses, belt and bag need to be predicted. Upper-clothes Sun-glass skirt scarf right-shoe right-leg right-arm pants left-shoe left-leg left-arm hat face dress belt bag hair null

Human Parsing = Engine for Applications It can stimulate many higher level applications, such as virtual clothing fitting, targeted advertisement based on clothing items, media management and person re-identification in surveillance, where face is not well sensible while body is visible Often the cases where face is not well sensible while body is visble

Why Feedback/Contexts Important: multi-level context Background? Bag The necessary of combining multi-level context. The feedback and contextual information are very critical for precise pixel-wise labeling, especially using multi-level context fusion For example, one may easily consider this black region as background only from local region cues. but if we observe the arm and upperclothes, this black region is indeed a bag according to the small bag strap. This speaks well the necessary of combing low-level and high-level context information.

Why Feedback/Contexts Important: global label context Dress? or Skirt ? We show another common example when predicting clothing items. Just from the lower body, we may think the green region is a skirt because it is different from the neighboring upperclothes,

Why Feedback/Contexts Important: global label context Skirt But after looking at the whole body, the collar with same texture help tell it is a dress covered by an upperclothes. The global context can help rectify the local prediction.

Why Feedback/Contexts Important: local super-pixel context Very critical for segmenting semantic labels with small regions!! More details Where are left-shoes and right-shoes? In the fine-grained human parsing task, another important cue for prediction is the very detailed information. For example, it is very difficult to distinguish left and right shoes from the pants because of very similar color and texture. But after we look into the detailed apperance under larger resolution of the image, the left and right shoes can be recognized according to fine boundaries. So we need the informative feedback and multi-source contexts to make a right decision.

Contextualized Network Contexts + Fully convolutional neural network Cross-layer context: : multi-level feature fusion Global top-down context: : coherence between pixel-wise labelling and image label prediction Local bottom-up context: : within-superpixel consistency and cross-superpixel appearance consistency This becomes our motivation of developing a contextualized network. Our work aims to integrate multi-source context into a unified network. Three kinds of contexts are mainly considered. Cross-layer context is used to perform multi-level feature fusion, which is a extension from the skipping strategy in FCN network. Global image-level context aims to achieve the coherence between pixel-wise labelling and image-level label prediction. Local super-pixel context retains the local boundaries and appearance consistencies within the superpixels.

Human Parsing with Contextualized Convolutional Neural Network Cross-layer context four feature map fusions 5*5 convolutions The feature maps from deep layers often focus on the global structure and are insensitive to local boundaries and spatial displacements. We up-sample the feature maps from deep layers and then combine them with the feature maps from former layers under the same resolution. We perform four feature map fusions to use cross-layer context. Four different spatial resolutions are used to capture different levels of semantic information. our basic structure can thus hierarchically encodes the local details from the early layers and the global semantic information from the deep layers. Hierarchically combine the low-level local details and high-level semantic information

Human Parsing with Contextualized Convolutional Neural Network Global image-level context Incorporate global image label prediction by image label concatenation and element-wise summation Element-wise summation Image Label Concatenation To guarantee the label coherence, we incorporate image label prediction into pixel-wise categorization. The squared loss is used to predict global label. We then use the predicted image-level label probabilities to guide the feature learning from two aspects. First, the probabilities are used to facilitate the feature maps of each intermediate layer to generate the semantic-aware feature responses. Second, these probabilities are also used to re-weight the pixel-wise label confidences in the prediction layer.

Human Parsing with Contextualized Convolutional Neural Network Global top-down context helps distinguish the confusing labels Skirt Dress Co-CNN w/o global label Global image label Co-CNN Global image-level context can successfully distinguish the confusing labels. For example, by using the image label probabilities to guide feature learning, the confidence maps for skirt and dress can be corrected. skirt dress upper-clothes

Human Parsing with Contextualized Convolutional Neural Network Local Super-pixel context integrate the within-super-pixel smoothing and the cross-super-pixel neighborhood voting into the training and testing process Finally, the within-super-pixel smoothing and cross super-pixel neighborhood voting are leveraged to retain the local boundaries and label consistencies. They are formulated as natural sub-components of the network in both the training and the testing process.

Human Parsing with Contextualized Convolutional Neural Network Local Super-pixel context integrate the within-super-pixel smoothing and the cross-super-pixel neighborhood voting into the training and testing process … Base on the over-segmentation of the image, the within-superpixel smoothing averages the confidences within each superpixel. After that, the cross-super-pixel neighborhood voting can take the neighboring larger regions into account for better inference, and exploit more structures and correlations between different super-pixels. Our local superpixel smoothing and voting can be seen as two types of pooling methods, which are performed on the local responses within the irregular regions depicted by super-pixels.

Results Comparison of parsing performances with four state-of-the-art methods on ATR dataset that includes 7700 images: 12.57% increase !!! Accuracy Foreground accuracy Average precision Average recall F-1 scores Yamaguchi et al.[1] 84.38 55.59 37.54 51.05 41.80 Paperdoll[2] 88.96 62.18 52.75 49.43 44.76 M-CNN[3] 89.57 73.98 64.56 65.17 62.81 ATR[4] 91.11 71.04 71.69 60.25 64.38 Co-CNN 95.23 80.90 81.55 74.42 76.95 [1] K. Yamaguchi, M.H. Kiapour, and T.L. Berg. Paper doll parsing: Retrieving similar styles to parse clothing items. In ICCV, 2013 Let we come to the experiments; We compare our performance with four state-of-the-art methods on five metrics. Our method can significantly outperform four baselines over 12.57%. [2] K. Yamaguchi, M.H. Kiapour, L.E. Ortiz, and T.L. Berg. Parsing clothing in fashion photographs. In CVPR 2012. [3] Liu, et al. Matching-CNN Meets KNN: Quasi-Parametric Human Parsing . In CVPR, 2015 [4] X. Liang, et al. Deep Human Parsing with Active Template Regression. In TPAMI, 2015

Dataset Our new dataset: include 10,000 human Existing ATR dataset[4]: 7700 images Our new dataset: include 10,000 human pictures from “Chictopia.com” Near frontal-view Arbitrary poses, views and clothing styles To promote future research on human parsing, we annotate10,000 real world human pictures to construct the largest dataset. Compared to the existing ATR dataset with near frontal-viewed images, our new dataset mainly contains images in the wild (e.g., more challenging poses, occlusion and clothing styles).

Dataset Pixel-wise annotations with 18 semantic labels Link: https://github.com/lemondan/HumanParsing-Dataset/ 18 clothing and body semantic labels are annotated for each pixel. Everyone can find our dataset in this link to develop new methods for this task. The human images with diverse poses, background clutters are carefully annotated.

Results We collect 10,000 human pictures from “chictopia.com”: Accuracy Foreground accuracy Average precision Average recall F-1 scores ATR 91.11 71.04 71.69 60.25 64.38 Co-CNN 95.23 80.90 81.55 74.42 76.95 Co-CNN(+Chictopia10k) 96.02 83.57 84.95 77.66 80.14 After training with the new dataset, our method can further improve the average F-1score by 3.19%.

Results Analysis on architectural variants of our model Cross-layer context: 7.47% increase Analysis on architectural variants of our model Accuracy Foreground accuracy Average precision Average recall F-1 scores Co-CNN w/o fusion 92.57 70.76 67.17 64.34 65.25 Co-CNN(cross-layer) 94.41 78.54 76.62 71.24 72.72 Co-CNN w/o cross-layer Co-CNN Co-CNN w/o cross-layer Co-CNN We further evaluate the effectiveness of our three components. Using the cross-layer context can offer 7.47% increase on average F-1 score. Since combining the cross-layer information enables the network to capture multi-level context to make precise local predictions and respect global semantic information. For example, the skirt and upper clothes can be well recognized by considering this cross-layer context.

Global image label context: Results Global image label context: 2.55% increase Analysis on architectural variants of our model Accuracy Foreground accuracy Average precision Average recall F-1 scores Baseline 94.41 78.54 76.62 71.24 72.72 Co-CNN(global label) 94.87 79.86 78.00 73.94 75.27 Co-CNN w/o global label Co-CNN Co-CNN w/o global label Co-CNN By incorporating global image label context, the performance can be increased by 2.55%. The label exclusiveness and co-occurrences can be well captured during dense pixel-wise prediction, such as producing the coherent prediction for two legs.

Local super-pixel context: Results Local super-pixel context: 1.68% increase Analysis on architectural variants of our model Accuracy Foreground accuracy Average precision Average recall F-1 scores Baseline 94.87 79.86 78.00 73.94 75.27 Co-CNN(local super-pixel) 95.23 80.90 81.55 74.42 76.95 Co-CNN w/o superpixel Co-CNN Co-CNN w/o superpixel Co-CNN Our full network leads to 1.68% increase over the version without using local super-pixel smoothing and voting. This demonstrates that the local super-pixel contexts can help preserve the local boundaries and generate more precise classification

Parsing Results Test Paperdoll ATR Co-CNN Test Paperdoll ATR Co-CNN We then show some qualitative comparison on parsing results. Our Co-CNN outputs more meaningful and precise predictions despite the large appearance and position variations. It can successfully predict the labels of small regions (e.g. hat, scarf, sun-glasses, belt) and the confusing clothing items.

Online Human Parsing Engine (<0.15s or 20fps simplified) We released a online human parsing demo, it can rapidly process one image with about 0.15 second. Even though only part of the image is provided, the parsing demo can also generate satisfactory results. welcome everyone to have a try.

Conclusion We propose a novel contextualized network for human parsing, which integrates multiple sources of contexts. Our Co-CNN produces the correspondingly-sized pixel-wise prediction in a fully end-to-end way. A new large dataset “Chictopia10k” has been built and released. To conclude, https://github.com/lemondan/HumanParsing-Dataset/

Questions? Thank you for your attention. I am happy to take any questions.