Lip movement Synthesis from Text

Slides:

Advertisements

Similar presentations

Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.

Advertisements

Deep Learning Overview Sources: workshop-tutorial-final.pdf

DeepWalk: Online Learning of Social Representations

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Conditional Generative Adversarial Networks

Big data classification using neural network

Generative Adversarial Nets

Unsupervised Learning of Video Representations using LSTMs

Generative Adversarial Network (GAN)

CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.

Environment Generation with GANs

Deep Neural Net Scenery Generation

Computer Science and Engineering, Seoul National University

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Understanding and Predicting Image Memorability at a Large Scale

Perceptual Loss Deep Feature Interpolation for Image Content Changes

Generative Adversarial Networks

CSCI 5922 Neural Networks and Deep Learning Generative Adversarial Networks Mike Mozer Department of Computer Science and Institute of Cognitive Science.

Intro to NLP and Deep Learning

Intro to NLP and Deep Learning

Intro to NLP and Deep Learning

CNN Demo LIU Pengpeng.

Neural networks (3) Regularization Autoencoder

CSCI 5922 Neural Networks and Deep Learning: NIPS Highlights

CS6890 Deep Learning Weizhen Cai

Authors: Jun-Yan Zhu*, Taesun Park*, Phillip Isola, Alexei A. Efros

Presenter: Hajar Emami

Deep Learning and Newtonian Physics

Adversarially Tuned Scene Generation

Low Dose CT Image Denoising Using WGAN and Perceptual Loss

Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

RNNs: Going Beyond the SRN in Language Prediction

Paraphrase Generation Using Deep Learning

Recurrent Neural Networks

Final Presentation: Neural Network Doc Summarization

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Generative Adversarial Network

Image to Image Translation using GANs

On Convolutional Neural Network

Chapter 11 Practical Methodology

Coding neural networks: A gentle Introduction to keras

Hein H. Aung Matt Anderson, Advisor Introduction

View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.

Autoencoders hi shea autoencoders Sys-AI.

Recent Advances in Generative Adversarial Networks (GAN)

Textual Video Prediction

Neural networks (3) Regularization Autoencoder

Advances in Deep Audio and Audio-Visual Processing

Beyond Face Rotation: Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis Speaker: ZHAO Jian

Video Imagination from a Single Image with Transformation Generation

TPGAN overview.

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

Attention for translation

Automatic Handwriting Generation

Neural Machine Translation using CNN

Object Detection Implementations

Angel A. Cantu, Nami Akazawa Department of Computer Science

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Deep screen image crop and enhance

Cengizhan Can Phoebe de Nooijer

Week 3 Volodymyr Bobyr.

Self-Supervised Cross-View Action Synthesis

DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.

Self-Supervised Cross-View Action Synthesis

End-to-End Speech-Driven Facial Animation with Temporal GANs

Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee

Sign Language Recognition With Unsupervised Feature Learning

Directional Occlusion with Neural Network

Presentation transcript:

Lip movement Synthesis from Text By - Shishir Mathur Guided By - Prof. Vinay P Namboodiri

Generative Adversarial Networks (GANs) GANs were first introduced by a group of researchers at the University of Montreal lead by Ian Goodfellow. The main idea behind a GAN is to have two competing neural network models. One takes noise as input and generates samples (and so is called the generator). The other model (called the discriminator) receives samples from both the generator and the training data, and has to be able to distinguish between the two sources.

These two networks play a continuous game, where the generator is learning to produce more and more realistic samples, and the discriminator is learning to get better and better at distinguishing generated data from real data. source:http://blog.aylien.com/introduction-generative-adversarial-networks-code-tensorflow/

Why are GANs interesting GANs have been primarily applied to modelling natural images. Source : Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

They can understand the underlying image semantic and can help in inpainting tasks Source: Semantic Image Inpainting with Perceptual and Contextual Losses

Using a contextual Recurrent Neural Network over a GAN it can be trained to generate the next image in a sequence of images Source : Contextual RNN-GANs for Abstract Reasoning Diagram Generation

They can also be trained for Super Resolution to get photorealistic images AND….. Many Many More……. Source: Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

VideoGAN Based on Generating Videos with Scene Dynamics by Carl Vondrik et.al. This approach uses the basic GAN framework on video samples by upsampling the noise using volumetric convolutions to generate video samples rather than just images. Source: http://carlvondrick.com/tinyvideo/

Source: http://carlvondrick.com/tinyvideo/

Generative Adversarial Text to Image Synthesis Based on the paper by Scott Reed It generates realistic images based on text using a text embedding in the generator and discriminator to ensure conditional generation. Source: Generative Adversarial Text to Image Synthesis

Source: Generative Adversarial Text to Image Synthesis

Aim of the work The aim of my work is to generate lip movement given a text. There has been a lot of recent work in Lip Reading like Lip Reading in the Wild by Chung et.al. , LipNet by Yannis et.al.(Oxford). But the reverse was not very well explored using the existing Neural Network Techniques. Image Source: An adaptive approach for lip-reading using image and depth datas A.Rekik et.al.

Lip Movement Synthesis I am working on Video Generation by VideoGANs conditioned on the text data to generate lip movement corresponding to the word being said. Using the VideoGAN framework I am conditioning it to the text embedding as done by Scott Reed in his paper Text to Image Synthesis.

Dataset The present dataset I am using for the task is The GRID audiovisual sentence corpus. The corpus consists of high-quality audio and video (facial) recordings of 1000 sentences spoken by each of 34 talkers. Advantages of GRID Corpus Good Quality video with a high frame rate to extract lip features. Word captioning accurate to the millisecond. Source: http://spandh.dcs.shef.ac.uk/gridcorpus/

Preprocessing Methodology I am generating just the lip region through the GANs. Since all datasets have different faces the only thing that remains invariant across different video samples is the lip movement for word utterances. Using DLib face keypoints extraction we were able to extract the lip regions in all the video frames. The video frames were then paired with the word extractions to create 32 frames long videos of lip movement.

Network Structure Used a basic DCGAN structure with volumetric convolutions for video generation. Used Skip Connections in both Generator and Discriminator for better visualizations and better Discriminator learning. Text encoding is appended to the noise vector in Generator and replicated and appended at a pre final layer in the discriminator. I have added a Dropout layer before every convolution and deconvolution for better learning.

Training Methodology Experimented with different training methods for GANs like WGANs, Conditional GANs etc. Finally decided upon the training methodology similar to Scott Reed. Train the Discriminator with the real video and the corresponding text embedding.(Label TRUE) Train the Discriminator with the fake video and the same text embedding.(Label FALSE) Train the Discriminator with the generated video and the corresponding text embedding.(Label FALSE)

This format of training allows for a more conditional training of GANs by training the discriminator to not only detect the fake video sample but also tie the video sample with the text embeddings and also detect if the video and the embedding are not tied. But here since the discriminator learns more than the generator in each timestep over fitting of the discriminator is possible which leads to bad generator. To avoid that we have scaled down the Discriminator error and added multiple Dropout layers.

Results

Future Work Working on multiple ways of embedding the text data for better generation. Tweaking the Network changing Batch-Norm Layers and Dropout layers for better video generation. Stabilizing the generator and discriminator for better sample generation.

Some Advice if you work on GANs Try to stick to the general network structure, we can increase the number of layers or the number of kernels or adding skip connections but changing it too much from the basic DCGAN structure leads to worse results. Normalize inputs Use Spherical Noise input Use Batch Norm ,LeekyReLU, DropOut Layres with ADAM optimizer. Use soft and noisy labels for training. Don’t try to over train the Discriminator or Generator. Keep looking up at new GAN implementations for inspirations

Thank You!!