Lip movement Synthesis from Text

Slides:



Advertisements
Similar presentations
Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.
Advertisements

Deep Learning Overview Sources: workshop-tutorial-final.pdf
DeepWalk: Online Learning of Social Representations
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Conditional Generative Adversarial Networks
Big data classification using neural network
Generative Adversarial Nets
Unsupervised Learning of Video Representations using LSTMs
Generative Adversarial Network (GAN)
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
Environment Generation with GANs
Deep Neural Net Scenery Generation
Computer Science and Engineering, Seoul National University
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Understanding and Predicting Image Memorability at a Large Scale
Perceptual Loss Deep Feature Interpolation for Image Content Changes
Generative Adversarial Networks
CSCI 5922 Neural Networks and Deep Learning Generative Adversarial Networks Mike Mozer Department of Computer Science and Institute of Cognitive Science.
Intro to NLP and Deep Learning
Intro to NLP and Deep Learning
Intro to NLP and Deep Learning
CNN Demo LIU Pengpeng.
Neural networks (3) Regularization Autoencoder
CSCI 5922 Neural Networks and Deep Learning: NIPS Highlights
CS6890 Deep Learning Weizhen Cai
Authors: Jun-Yan Zhu*, Taesun Park*, Phillip Isola, Alexei A. Efros
Presenter: Hajar Emami
Deep Learning and Newtonian Physics
Adversarially Tuned Scene Generation
Low Dose CT Image Denoising Using WGAN and Perceptual Loss
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
RNNs: Going Beyond the SRN in Language Prediction
Paraphrase Generation Using Deep Learning
Recurrent Neural Networks
Final Presentation: Neural Network Doc Summarization
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Generative Adversarial Network
Image to Image Translation using GANs
On Convolutional Neural Network
Chapter 11 Practical Methodology
Coding neural networks: A gentle Introduction to keras
Hein H. Aung Matt Anderson, Advisor Introduction
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
Autoencoders hi shea autoencoders Sys-AI.
Recent Advances in Generative Adversarial Networks (GAN)
Word2Vec.
Textual Video Prediction
Neural networks (3) Regularization Autoencoder
Advances in Deep Audio and Audio-Visual Processing
Beyond Face Rotation: Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis Speaker: ZHAO Jian
Video Imagination from a Single Image with Transformation Generation
TPGAN overview.
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Attention for translation
Automatic Handwriting Generation
Neural Machine Translation using CNN
Object Detection Implementations
Angel A. Cantu, Nami Akazawa Department of Computer Science
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Deep screen image crop and enhance
Cengizhan Can Phoebe de Nooijer
Week 3 Volodymyr Bobyr.
Self-Supervised Cross-View Action Synthesis
DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.
Self-Supervised Cross-View Action Synthesis
End-to-End Speech-Driven Facial Animation with Temporal GANs
Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee
Sign Language Recognition With Unsupervised Feature Learning
Directional Occlusion with Neural Network
Presentation transcript:

Lip movement Synthesis from Text By - Shishir Mathur Guided By - Prof. Vinay P Namboodiri

Generative Adversarial Networks (GANs) GANs were first introduced by a group of researchers at the University of Montreal lead by Ian Goodfellow. The main idea behind a GAN is to have two competing neural network models. One takes noise as input and generates samples (and so is called the generator). The other model (called the discriminator) receives samples from both the generator and the training data, and has to be able to distinguish between the two sources.

These two networks play a continuous game, where the generator is learning to produce more and more realistic samples, and the discriminator is learning to get better and better at distinguishing generated data from real data. source:http://blog.aylien.com/introduction-generative-adversarial-networks-code-tensorflow/

Why are GANs interesting GANs have been primarily applied to modelling natural images. Source : Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

They can understand the underlying image semantic and can help in inpainting tasks Source: Semantic Image Inpainting with Perceptual and Contextual Losses

Using a contextual Recurrent Neural Network over a GAN it can be trained to generate the next image in a sequence of images Source : Contextual RNN-GANs for Abstract Reasoning Diagram Generation

They can also be trained for Super Resolution to get photorealistic images AND….. Many Many More……. Source: Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

VideoGAN Based on Generating Videos with Scene Dynamics by Carl Vondrik et.al. This approach uses the basic GAN framework on video samples by upsampling the noise using volumetric convolutions to generate video samples rather than just images. Source: http://carlvondrick.com/tinyvideo/

Source: http://carlvondrick.com/tinyvideo/

Generative Adversarial Text to Image Synthesis Based on the paper by Scott Reed It generates realistic images based on text using a text embedding in the generator and discriminator to ensure conditional generation. Source: Generative Adversarial Text to Image Synthesis

Source: Generative Adversarial Text to Image Synthesis

Aim of the work The aim of my work is to generate lip movement given a text. There has been a lot of recent work in Lip Reading like Lip Reading in the Wild by Chung et.al. , LipNet by Yannis et.al.(Oxford). But the reverse was not very well explored using the existing Neural Network Techniques. Image Source: An adaptive approach for lip-reading using image and depth datas A.Rekik et.al.

Lip Movement Synthesis I am working on Video Generation by VideoGANs conditioned on the text data to generate lip movement corresponding to the word being said. Using the VideoGAN framework I am conditioning it to the text embedding as done by Scott Reed in his paper Text to Image Synthesis.

Dataset The present dataset I am using for the task is The GRID audiovisual sentence corpus. The corpus consists of high-quality audio and video (facial) recordings of 1000 sentences spoken by each of 34 talkers. Advantages of GRID Corpus Good Quality video with a high frame rate to extract lip features. Word captioning accurate to the millisecond. Source: http://spandh.dcs.shef.ac.uk/gridcorpus/

Preprocessing Methodology I am generating just the lip region through the GANs. Since all datasets have different faces the only thing that remains invariant across different video samples is the lip movement for word utterances. Using DLib face keypoints extraction we were able to extract the lip regions in all the video frames. The video frames were then paired with the word extractions to create 32 frames long videos of lip movement.

Network Structure Used a basic DCGAN structure with volumetric convolutions for video generation. Used Skip Connections in both Generator and Discriminator for better visualizations and better Discriminator learning. Text encoding is appended to the noise vector in Generator and replicated and appended at a pre final layer in the discriminator. I have added a Dropout layer before every convolution and deconvolution for better learning.

Training Methodology Experimented with different training methods for GANs like WGANs, Conditional GANs etc. Finally decided upon the training methodology similar to Scott Reed. Train the Discriminator with the real video and the corresponding text embedding.(Label TRUE) Train the Discriminator with the fake video and the same text embedding.(Label FALSE) Train the Discriminator with the generated video and the corresponding text embedding.(Label FALSE)

This format of training allows for a more conditional training of GANs by training the discriminator to not only detect the fake video sample but also tie the video sample with the text embeddings and also detect if the video and the embedding are not tied. But here since the discriminator learns more than the generator in each timestep over fitting of the discriminator is possible which leads to bad generator. To avoid that we have scaled down the Discriminator error and added multiple Dropout layers.

Results

Future Work Working on multiple ways of embedding the text data for better generation. Tweaking the Network changing Batch-Norm Layers and Dropout layers for better video generation. Stabilizing the generator and discriminator for better sample generation.

Some Advice if you work on GANs Try to stick to the general network structure, we can increase the number of layers or the number of kernels or adding skip connections but changing it too much from the basic DCGAN structure leads to worse results. Normalize inputs Use Spherical Noise input Use Batch Norm ,LeekyReLU, DropOut Layres with ADAM optimizer. Use soft and noisy labels for training. Don’t try to over train the Discriminator or Generator. Keep looking up at new GAN implementations for inspirations

Thank You!!