MULTI-VIEW VISUAL SPEECH RECOGNITION BASED ON MULTI TASK LEARNING HouJeung Han, Sunghun Kang and Chang D. Yoo Dept. of Electrical Engineering, KAIST, Republic.

Slides:

Advertisements

Similar presentations

Automatic Feature Extraction for Multi-view 3D Face Recognition

Advertisements

Adviser：Ming-Yuan Shieh Student：shun-te chuang SN：M

Multiple View Based 3D Object Classification Using Ensemble Learning of Local Subspaces ( ThBT4.3 ) Jianing Wu, Kazuhiro Fukui

Spatial Pyramid Pooling in Deep Convolutional

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

Multimodal Deep Learning

Kuan-Chuan Peng Tsuhan Chen

Automated Lip reading technique for people with speech disabilities by converting identified visemes into direct speech using image processing and machine.

Miguel Reyes 1,2, Gabriel Dominguez 2, Sergio Escalera 1,2 Computer Vision Center (CVC) 1, University of Barcelona (UB) 2

An Information Fusion Approach for Multiview Feature Tracking Esra Ataer-Cansizoglu and Margrit Betke ) Image and.

Multimodal Information Analysis for Emotion Recognition

Pairwise Linear Regression: An Efficient and Fast Multi-view Facial Expression Recognition By: Anusha Reddy Tokala.

Object Recognition in Images Slides originally created by Bernd Heisele.

A New Fingertip Detection and Tracking Algorithm and Its Application on Writing-in-the-air System The th International Congress on Image and Signal.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

Face Image-Based Gender Recognition Using Complex-Valued Neural Network Instructor :Dr. Dong-Chul Kim Indrani Gorripati.

Semantic Compositionality through Recursive Matrix-Vector Spaces

Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.

Rich feature hierarchies for accurate object detection and semantic segmentation 2014 IEEE Conference on Computer Vision and Pattern Recognition Ross Girshick,

WLD: A Robust Local Image Descriptor Jie Chen, Shiguang Shan, Chu He, Guoying Zhao, Matti Pietikäinen, Xilin Chen, Wen Gao 报告人：蒲薇榄.

Face Detection 蔡宇軒.

Effect of Hough Forests Parameters on Face Detection Performance: An Empirical Analysis M. Hassaballah, Mourad Ahmed and H.A. Alshazly Department of Mathematics,

Facial Smile Detection Based on Deep Learning Features Authors: Kaihao Zhang, Yongzhen Huang, Hong Wu and Liang Wang Center for Research on Intelligent.

Under Guidance of Mr. A. S. Jalal Associate Professor Dept. of Computer Engineering and Applications GLA University, Mathura Presented by Dev Drume Agrawal.

Olivier Siohan David Rybach

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Big data classification using neural network

Hybrid Deep Learning for Reflectance Confocal Microscopy Skin Images

Deep Learning for Dual-Energy X-Ray

Object Detection based on Segment Masks

Deep Learning Amin Sobhani.

Compact Bilinear Pooling

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.

From Vision to Grasping: Adapting Visual Networks

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.

Saliency-guided Video Classification via Adaptively weighted learning

A Pool of Deep Models for Event Recognition

Article Review Todd Hricik.

Combining CNN with RNN for scene labeling (segmentation)

Part-Based Room Categorization for Household Service Robots

Seunghui Cha1, Wookhyun Kim1

Hybrid Features based Gender Classification

Hierarchical Deep Convolutional Neural Network

Deep Neural Networks based Text- Dependent Speaker Verification

Mean Euclidean Distance Error (mm)

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

Finding Clusters within a Class to Improve Classification Accuracy

Bird-species Recognition Using Convolutional Neural Network

Two-Stream Convolutional Networks for Action Recognition in Videos

Introduction of MATRIX CAPSULES WITH EM ROUTING

[Figure taken from googleblog

Age and Gender Classification using Convolutional Neural Networks

John H.L. Hansen & Taufiq Al Babba Hasan

Predicting Body Movement and Recognizing Actions: an Integrated Framework for Mutual Benefits Boyu Wang and Minh Hoai Stony Brook University Experiments:

Graph Neural Networks Amog Kamsetty January 30, 2019.

Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2

Advances in Deep Audio and Audio-Visual Processing

Neural Network Pipeline CONTACT & ACKNOWLEDGEMENTS

Automatic Handwriting Generation

Deep Object Co-Segmentation

Image Processing and Multi-domain Translation

Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation

Presented By: Harshul Gupta

End-to-End Facial Alignment and Recognition

Week 3 Volodymyr Bobyr.

Report 2 Brandon Silva.

Introduction Face detection and alignment are essential to many applications such as face recognition, facial expression recognition, age identification,

Nguyen Ngoc Hoang, Guee-Sang Lee, Soo-Hyung Kim, Hyung-Jeong Yang

Shengcong Chen, Changxing Ding, Minfeng Liu 2018

Presentation transcript:

MULTI-VIEW VISUAL SPEECH RECOGNITION BASED ON MULTI TASK LEARNING HouJeung Han, Sunghun Kang and Chang D. Yoo Dept. of Electrical Engineering, KAIST, Republic of Korea Abstract Proposed Framework Result This paper proposes a pose-invariant network which can recognize words spoken from any arbitrary view input. The architecture that combines CNN with bidirectional LSTM is trained in a multi-task manner such that the pose and the word spoken are jointly classified where pose classification is considered as the auxiliary task. This deep model achieved recognition performance of 95.0% accuracy on OuluVS2 dataset. Confusion matrix of multi-view training with single task and multi task. Vertical axis is ground-truth labels and horizontal axis is predicted labels Accuracy of all views and comparison on frontal views with previous work. Introduction Until now, most VSR studies has focused on the frontal view, which may be impractical. In this paper, we present a model that can recognize visual speech from any arbitrary point of view To enhance lip reading, we present a learning method that jointly learns the position of the face and the words spoken. the model also auxiliarily trains with facial position labels. We believe subsidiary view information can assist better phrase recognition by positioning features of the same class more precisely according to position. Multi task classification t-SNE plot of multi-view training with single task and multi task in test phase. The final loss function 𝐻(𝑌,𝑉,𝑦,𝑣) is weighted sum of cross-entropy loss of phrase classification 𝐻 𝑝ℎ𝑟𝑎𝑠𝑒 (𝑌,𝑦) and view classification 𝐻 𝑣𝑖𝑒𝑤 (𝑉,𝑣), where Y,𝑉 are ground truth vector of phrase and view, 𝑦,𝑣 are predicted probability of phrase and view, and tribute. Experiments References [6] JS Chung and A Zisserman, “Lip reading in the wild,” in Asian Conference on Computer Vision, 2016. [7] Stavros Petridis, Zuwei Li, and Maja Pantic, “End-toend visual speech recognition with lstms,” [10] Daehyun Lee, Jongmin Lee, and Kee-Eung Kim, “Multi-view automatic lip-reading using neural network,”in ACCV 2016 Workshop on Multi-view Lipreading Challenges. Asian Conference on Computer Vision (ACCV), 2016. [11] Joon Son Chung and Andrew Zisserman, “Out of time: automated lip sync in the wild,” in Workshop on Multiview Lip-reading, ACCV, 2016. [12] Iryna Anina, Ziheng Zhou, Guoying Zhao, and Matti Pietik¨ainen, “Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis,” in Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on. IEEE, 2015, vol. 1, pp. 1–5. Dataset: OuluVS2[12] - 52 speakers saying 10 categories of utterances, 3 times each Simultaneously recorded on five different views: {0, 30, 45, 60, 90} degree. The mouth region of interests (ROIs) are provided and they are downscaled to 26 by 44 in order to keep the aspect ratio constant.