Liang Lin, Shuicheng Yan

Liang Lin, Shuicheng Yan
Human Parsing with Contextualized Convolutional Neural Network Xiaodan Liang, Chunyan Xu, Xiaohui Shen, Jianchao Yang, Si Liu, Jinhui Tang Liang Lin, Shuicheng Yan Hello everyone, I am liang xiaodan from sun yat-sen university. It’s my great pleasure to talk about my research on human parsing with contextualized convolutional neural network

Task: Human Parsing Decompose a human photo into semantic fashion/body items Pixel-level semantic labeling Human parsing, refers to decomposing a human image into semantic clothes/body regions. It can be treat as a specific semantic segmentation, while very fine-grained small parts such as sunglasses, belt and bag need to be predicted. Upper-clothes Sun-glass skirt scarf right-shoe right-leg right-arm pants left-shoe left-leg left-arm hat face dress belt bag hair null

Human Parsing = Engine for Applications
It can stimulate many higher level applications, such as virtual clothing fitting, targeted advertisement based on clothing items, media management and person re-identification in surveillance, where face is not well sensible while body is visible Often the cases where face is not well sensible while body is visble

Why Feedback/Contexts Important: multi-level context
Background? Bag The necessary of combining multi-level context. The feedback and contextual information are very critical for precise pixel-wise labeling， especially using multi-level context fusion For example, one may easily consider this black region as background only from local region cues. but if we observe the arm and upperclothes, this black region is indeed a bag according to the small bag strap. This speaks well the necessary of combing low-level and high-level context information.

Why Feedback/Contexts Important: global label context
Dress? or Skirt ? We show another common example when predicting clothing items. Just from the lower body, we may think the green region is a skirt because it is different from the neighboring upperclothes,

Why Feedback/Contexts Important: global label context
Skirt But after looking at the whole body, the collar with same texture help tell it is a dress covered by an upperclothes. The global context can help rectify the local prediction.

Why Feedback/Contexts Important: local super-pixel context
Very critical for segmenting semantic labels with small regions!! More details Where are left-shoes and right-shoes? In the fine-grained human parsing task, another important cue for prediction is the very detailed information. For example, it is very difficult to distinguish left and right shoes from the pants because of very similar color and texture. But after we look into the detailed apperance under larger resolution of the image, the left and right shoes can be recognized according to fine boundaries. So we need the informative feedback and multi-source contexts to make a right decision.

Contextualized Network
Contexts + Fully convolutional neural network Cross-layer context: : multi-level feature fusion Global top-down context: : coherence between pixel-wise labelling and image label prediction Local bottom-up context: : within-superpixel consistency and cross-superpixel appearance consistency This becomes our motivation of developing a contextualized network. Our work aims to integrate multi-source context into a unified network. Three kinds of contexts are mainly considered. Cross-layer context is used to perform multi-level feature fusion, which is a extension from the skipping strategy in FCN network. Global image-level context aims to achieve the coherence between pixel-wise labelling and image-level label prediction. Local super-pixel context retains the local boundaries and appearance consistencies within the superpixels.

Human Parsing with Contextualized Convolutional Neural Network
Cross-layer context four feature map fusions 5*5 convolutions The feature maps from deep layers often focus on the global structure and are insensitive to local boundaries and spatial displacements. We up-sample the feature maps from deep layers and then combine them with the feature maps from former layers under the same resolution. We perform four feature map fusions to use cross-layer context. Four different spatial resolutions are used to capture different levels of semantic information. our basic structure can thus hierarchically encodes the local details from the early layers and the global semantic information from the deep layers. Hierarchically combine the low-level local details and high-level semantic information

Global image-level context Incorporate global image label prediction by image label concatenation and element-wise summation Element-wise summation Image Label Concatenation To guarantee the label coherence, we incorporate image label prediction into pixel-wise categorization. The squared loss is used to predict global label. We then use the predicted image-level label probabilities to guide the feature learning from two aspects. First, the probabilities are used to facilitate the feature maps of each intermediate layer to generate the semantic-aware feature responses. Second, these probabilities are also used to re-weight the pixel-wise label conﬁdences in the prediction layer.

Global top-down context helps distinguish the confusing labels Skirt Dress Co-CNN w/o global label Global image label Co-CNN Global image-level context can successfully distinguish the confusing labels. For example, by using the image label probabilities to guide feature learning, the conﬁdence maps for skirt and dress can be corrected. skirt dress upper-clothes

Local Super-pixel context integrate the within-super-pixel smoothing and the cross-super-pixel neighborhood voting into the training and testing process Finally, the within-super-pixel smoothing and cross super-pixel neighborhood voting are leveraged to retain the local boundaries and label consistencies. They are formulated as natural sub-components of the network in both the training and the testing process.

Local Super-pixel context integrate the within-super-pixel smoothing and the cross-super-pixel neighborhood voting into the training and testing process … Base on the over-segmentation of the image, the within-superpixel smoothing averages the confidences within each superpixel. After that, the cross-super-pixel neighborhood voting can take the neighboring larger regions into account for better inference, and exploit more structures and correlations between different super-pixels. Our local superpixel smoothing and voting can be seen as two types of pooling methods, which are performed on the local responses within the irregular regions depicted by super-pixels.

Results Comparison of parsing performances with four state-of-the-art methods on ATR dataset that includes 7700 images: 12.57% increase !!! Accuracy Foreground accuracy Average precision Average recall F-1 scores Yamaguchi et al.[1] 84.38 55.59 37.54 51.05 41.80 Paperdoll[2] 88.96 62.18 52.75 49.43 44.76 M-CNN[3] 89.57 73.98 64.56 65.17 62.81 ATR[4] 91.11 71.04 71.69 60.25 64.38 Co-CNN 95.23 80.90 81.55 74.42 76.95 [1] K. Yamaguchi, M.H. Kiapour, and T.L. Berg. Paper doll parsing: Retrieving similar styles to parse clothing items. In ICCV, 2013 Let we come to the experiments; We compare our performance with four state-of-the-art methods on five metrics. Our method can signiﬁcantly outperform four baselines over 12.57%. [2] K. Yamaguchi, M.H. Kiapour, L.E. Ortiz, and T.L. Berg. Parsing clothing in fashion photographs. In CVPR 2012. [3] Liu, et al. Matching-CNN Meets KNN: Quasi-Parametric Human Parsing . In CVPR, 2015 [4] X. Liang, et al. Deep Human Parsing with Active Template Regression. In TPAMI, 2015

Dataset Our new dataset: include 10,000 human
Existing ATR dataset[4]: 7700 images Our new dataset: include 10,000 human pictures from “Chictopia.com” Near frontal-view Arbitrary poses, views and clothing styles To promote future research on human parsing, we annotate10,000 real world human pictures to construct the largest dataset. Compared to the existing ATR dataset with near frontal-viewed images, our new dataset mainly contains images in the wild (e.g., more challenging poses, occlusion and clothing styles).

Dataset Pixel-wise annotations with 18 semantic labels Link:
18 clothing and body semantic labels are annotated for each pixel. Everyone can find our dataset in this link to develop new methods for this task. The human images with diverse poses, background clutters are carefully annotated.

Results We collect 10,000 human pictures from “chictopia.com”:
Accuracy Foreground accuracy Average precision Average recall F-1 scores ATR 91.11 71.04 71.69 60.25 64.38 Co-CNN 95.23 80.90 81.55 74.42 76.95 Co-CNN(+Chictopia10k) 96.02 83.57 84.95 77.66 80.14 After training with the new dataset, our method can further improve the average F-1score by 3.19%.

Results Analysis on architectural variants of our model
Cross-layer context: 7.47% increase Analysis on architectural variants of our model Accuracy Foreground accuracy Average precision Average recall F-1 scores Co-CNN w/o fusion 92.57 70.76 67.17 64.34 65.25 Co-CNN(cross-layer) 94.41 78.54 76.62 71.24 72.72 Co-CNN w/o cross-layer Co-CNN Co-CNN w/o cross-layer Co-CNN We further evaluate the effectiveness of our three components. Using the cross-layer context can offer 7.47% increase on average F-1 score. Since combining the cross-layer information enables the network to capture multi-level context to make precise local predictions and respect global semantic information. For example, the skirt and upper clothes can be well recognized by considering this cross-layer context.

Global image label context:
Results Global image label context: 2.55% increase Analysis on architectural variants of our model Accuracy Foreground accuracy Average precision Average recall F-1 scores Baseline 94.41 78.54 76.62 71.24 72.72 Co-CNN(global label) 94.87 79.86 78.00 73.94 75.27 Co-CNN w/o global label Co-CNN Co-CNN w/o global label Co-CNN By incorporating global image label context, the performance can be increased by 2.55%. The label exclusiveness and co-occurrences can be well captured during dense pixel-wise prediction, such as producing the coherent prediction for two legs.

Local super-pixel context:
Results Local super-pixel context: 1.68% increase Analysis on architectural variants of our model Accuracy Foreground accuracy Average precision Average recall F-1 scores Baseline 94.87 79.86 78.00 73.94 75.27 Co-CNN(local super-pixel) 95.23 80.90 81.55 74.42 76.95 Co-CNN w/o superpixel Co-CNN Co-CNN w/o superpixel Co-CNN Our full network leads to 1.68% increase over the version without using local super-pixel smoothing and voting. This demonstrates that the local super-pixel contexts can help preserve the local boundaries and generate more precise classiﬁcation

Parsing Results Test Paperdoll ATR Co-CNN Test Paperdoll ATR Co-CNN
We then show some qualitative comparison on parsing results. Our Co-CNN outputs more meaningful and precise predictions despite the large appearance and position variations. It can successfully predict the labels of small regions (e.g. hat, scarf, sun-glasses, belt) and the confusing clothing items.

Online Human Parsing Engine (<0.15s or 20fps simplified)
We released a online human parsing demo, it can rapidly process one image with about 0.15 second. Even though only part of the image is provided, the parsing demo can also generate satisfactory results. welcome everyone to have a try.

Conclusion We propose a novel contextualized network for human parsing, which integrates multiple sources of contexts. Our Co-CNN produces the correspondingly-sized pixel-wise prediction in a fully end-to-end way. A new large dataset “Chictopia10k” has been built and released. To conclude,

Questions? Thank you for your attention. I am happy to take any questions.

Liang Lin, Shuicheng Yan

Similar presentations

Presentation on theme: "Liang Lin, Shuicheng Yan"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Liang Lin, Shuicheng Yan

Similar presentations

Presentation on theme: "Liang Lin, Shuicheng Yan"— Presentation transcript:

Similar presentations

About project

Feedback