Improving Generative Adversarial Network

Slides:

Advertisements

Similar presentations

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Advertisements

Unsupervised Learning

CS590M 2008 Fall: Paper Presentation

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Supervised Learning Recap

Chapter 4: Linear Models for Classification

Visual Recognition Tutorial

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Pattern Recognition and Machine Learning

Lecture 29: Optimization and Neural Nets CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li

Speaker Adaptation for Vowel Classification

Giansalvo EXIN Cirrincione unit #7/8 ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned.

Machine Learning CMPT 726 Simon Fraser University

Visual Recognition Tutorial

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

Linear Models for Classification

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.

Variational Autoencoders Theory and Extensions

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Conditional Generative Adversarial Networks

Lecture 10 SeqGAN, Chatbot, Reinforcement Learning

Neural networks and support vector machines

Generative Adversarial Network (GAN)

Learning Deep Generative Models by Ruslan Salakhutdinov

Deep Feedforward Networks

Lecture 8. Generative Adversarial Network

Environment Generation with GANs

Deep Neural Net Scenery Generation

Machine Learning & Deep Learning

Dhruv Batra Georgia Tech

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Adversarial Learning for Neural Dialogue Generation

Generative Adversarial Networks

Multimodal Learning with Deep Boltzmann Machines

Machine Learning Basics

CSCI 5922 Neural Networks and Deep Learning: NIPS Highlights

Overview of Supervised Learning

Presenter: Hajar Emami

LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS

Low Dose CT Image Denoising Using WGAN and Perceptual Loss

Statistical Learning Dong Liu Dept. EEIS, USTC.

CSCI 5822 Probabilistic Models of Human and Machine Learning

Bayesian Models in Machine Learning

Ying shen Sse, tongji university Sep. 2016

Logistic Regression & Parallel SGD

SMEM Algorithm for Mixture Models

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

[Figure taken from googleblog

Deep Learning for Non-Linear Control

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

Boltzmann Machine (BM) (§6.4)

The loss function, the normal equation,

Mathematical Foundations of BME Reza Shadmehr

Machine learning overview

Image Classification & Training of Neural Networks

Linear Discrimination

Introduction to Neural Networks

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

CSC 578 Neural Networks and Deep Learning

CS249: Neural Language Model

Presentation transcript:

Improving Generative Adversarial Network http://dalimeeting.org/dali2017/generative-adversarial-networks.html Not mention (energy-based): BE GAN Gang of GANs: Generative Adversarial Networks with Maximum Margin Ranking MAGAN: Margin Adaptation for Generative Adversarial Networks MIX+GAN (Generalization and Equilibrium in Generative Adversarial Nets (GANs) )

Improving GAN Quick Review of GAN Unified framework of GAN Selecting a better divergence: W-GAN Evaluation Overview F-div Improved W-GAN W-GAN Evaluation

Review Human judgment http://dalimeeting.org/dali2017/generative-adversarial-networks.html

Basic Idea of GAN The data we want to generate has a distribution 𝑃 𝑑𝑎𝑡𝑎 𝑥 𝑃 𝑑𝑎𝑡𝑎 𝑥 High Probability Low Probability Image Space

Basic Idea of GAN A generator G is a network. The network defines a probability distribution. 𝑃 𝐺 𝑥 𝑃 𝑑𝑎𝑡𝑎 𝑥 Normal Distribution generator G 𝑧 𝑥=𝐺 𝑧 As close as possible It is difficult to compute 𝑃 𝐺 𝑥 We do not know what the distribution looks like. https://blog.openai.com/generative-models/

Basic Idea of GAN Normal Distribution 𝑃 𝐺 𝑥 NN Generator v1 𝑃 𝑑𝑎𝑡𝑎 𝑥 1 𝑃 𝐺 𝑥 NN Generator v1 𝑃 𝑑𝑎𝑡𝑎 𝑥 1 1 1 1 Discri-minator v1 image It can be proofed that the loss the discriminator related to JS divergence. 1/0

Basic Idea of GAN Normal Distribution NN Generator Next step: v1 Updating the parameters of generator To minimize the JS divergence v2 The output be classified as “real” (as close to 1 as possible) Discri-minator v1 Generator + Discriminator = a network Using gradient descent to update the parameters in the generator, but fix the discriminator 1.0 0.13

Actually, we can evaluate any f-divergence. Unified Framework Original GAN using binary classifier to evaluate JS divergence of two distributions Actually, we can evaluate any f-divergence. f-divergence Fenchel Conjugate Connect to GAN Sebastian Nowozin, Botond Cseke, Ryota Tomioka, “f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization”, NIPS, 2016

f-divergence 𝑃 and 𝑄 are two distributions. 𝑝 𝑥 and q 𝑥 are the probability of sampling 𝑥. 𝐷 𝑓 𝑃||𝑄 = 𝑥 𝑞 𝑥 𝑓 𝑝 𝑥 𝑞 𝑥 𝑑𝑥 f is convex f(1) = 0 = 1 If 𝑝 𝑥 =𝑞 𝑥 for all 𝑥 𝐷 𝑓 𝑃||𝑄 = 𝑥 𝑞 𝑥 𝑓 𝑝 𝑥 𝑞 𝑥 𝑑𝑥 =0 𝐷 𝑓 𝑃||𝑄 = 𝑥 𝑞 𝑥 𝑓 𝑝 𝑥 𝑞 𝑥 𝑑𝑥 = 0 The constraint of f is for non-negative http://www.stat.berkeley.edu/~aditya/resources/STAT212aOverview.pdf When P and Q are the same distribution, ≥𝑓 𝑥 𝑞 𝑥 𝑝 𝑥 𝑞 𝑥 𝑑𝑥 Because f is convex 𝐷 𝑓 𝑃||𝑄 has the smallest value, which is 0 =𝑓 1 =0

f-divergence KL Reverse KL Chi Square 𝐷 𝑓 𝑃||𝑄 = 𝑥 𝑞 𝑥 𝑓 𝑝 𝑥 𝑞 𝑥 𝑑𝑥 𝐷 𝑓 𝑃||𝑄 = 𝑥 𝑞 𝑥 𝑓 𝑝 𝑥 𝑞 𝑥 𝑑𝑥 f is convex f(1) = 0 𝐷 𝑓 𝑃||𝑄 = 𝑥 𝑞 𝑥 𝑝 𝑥 𝑞 𝑥 𝑙𝑜𝑔 𝑝 𝑥 𝑞 𝑥 𝑑𝑥 𝑓 𝑥 =𝑥𝑙𝑜𝑔𝑥 = 𝑥 𝑝 𝑥 𝑙𝑜𝑔 𝑝 𝑥 𝑞 𝑥 𝑑𝑥 KL 𝑓 𝑥 =−𝑙𝑜𝑔𝑥 𝐷 𝑓 𝑃||𝑄 = 𝑥 𝑞 𝑥 −𝑙𝑜𝑔 𝑝 𝑥 𝑞 𝑥 𝑑𝑥 = 𝑥 𝑞 𝑥 𝑙𝑜𝑔 𝑞 𝑥 𝑝 𝑥 𝑑𝑥 Reverse KL The constraint of f is for non-negative http://www.stat.berkeley.edu/~aditya/resources/STAT212aOverview.pdf 𝑓 𝑥 = 𝑥−1 2 Chi Square 𝐷 𝑓 𝑃||𝑄 = 𝑥 𝑞 𝑥 𝑝 𝑥 𝑞 𝑥 −1 2 𝑑𝑥 = 𝑥 𝑝 𝑥 −𝑞 𝑥 2 𝑞 𝑥 𝑑𝑥

Fenchel Conjugate Every convex function f has a conjugate function f* 𝐷 𝑓 𝑃||𝑄 = 𝑥 𝑞 𝑥 𝑓 𝑝 𝑥 𝑞 𝑥 𝑑𝑥 Fenchel Conjugate f is convex, f(1) = 0 Every convex function f has a conjugate function f* 𝑓 ∗ 𝑡 = max 𝑥∈𝑑𝑜𝑚 𝑓 𝑥𝑡−𝑓 𝑥 x is given, t is variable 𝑥 1 𝑡−𝑓 𝑥 1 𝑓 ∗ 𝑡 1 𝑓 ∗ 𝑡 2 supernum 最小上界 http://www.seas.ucla.edu/~vandenbe/236C/lectures/conj.pdf Proof at here: http://www.control.lth.se/media/Education/DoctorateProgram/2015/LargeScaleConvexOptimization/Lectures/conj_fcn.pdf 𝑥 2 𝑡−𝑓 𝑥 2 𝑥 3 𝑡−𝑓 𝑥 3 𝑡 1 𝑡 2 𝑡

0.1t – 0.1log0.1 Fenchel Conjugate 1t – 0 10t – 10 log 10 Every convex function f has a conjugate f*, and (f*)* = f 𝑓 ∗ 𝑡 = sup 𝑥∈𝑑𝑜𝑚 𝑓 𝑥𝑡−𝑓 𝑥 𝑥 1 𝑡−𝑓 𝑥 1 𝑓 𝑥 =𝑥𝑙𝑜𝑔𝑥 http://www.seas.ucla.edu/~vandenbe/236C/lectures/conj.pdf 0.1t - 0.1log0.1 1t – 0 10t – 10 log 10 𝑥 2 𝑡−𝑓 𝑥 2 Something like exponential? 𝑓 ∗ 𝑡 =𝑒𝑥𝑝 𝑡−1 𝑥 3 𝑡−𝑓 𝑥 3

Fenchel Conjugate Every convex function f has a conjugate f*, and (f*)* = f 𝑓 ∗ 𝑡 = sup 𝑥∈𝑑𝑜𝑚 𝑓 𝑥𝑡−𝑓 𝑥 𝑓 𝑥 =𝑥𝑙𝑜𝑔𝑥 𝑓 ∗ 𝑡 =𝑒𝑥𝑝 𝑡−1 𝑓 ∗ 𝑡 = sup 𝑥∈𝑑𝑜𝑚 𝑓 𝑥𝑡−𝑥𝑙𝑜𝑔𝑥 http://www.seas.ucla.edu/~vandenbe/236C/lectures/conj.pdf 𝑔 𝑥 =𝑥𝑡−𝑥𝑙𝑜𝑔𝑥 Given t, find x maximizing 𝑔 𝑥 𝑡−𝑙𝑜𝑔𝑥−1=0 𝑥=𝑒𝑥𝑝 𝑡−1 𝑓 ∗ 𝑡 =𝑒𝑥𝑝 𝑡−1 ×𝑡−𝑒𝑥𝑝 𝑡−1 × 𝑡−1

Connection with GAN 𝑓 ∗ 𝑡 = sup 𝑥∈𝑑𝑜𝑚 𝑓 𝑥𝑡−𝑓 𝑥 𝐷 𝑓 𝑃||𝑄 = 𝑥 𝑞 𝑥 𝑓 𝑝 𝑥 𝑞 𝑥 𝑑𝑥 𝑝 𝑥 𝑞 𝑥 𝑝 𝑥 𝑞 𝑥 = 𝑥 𝑞 𝑥 sup 𝑡∈𝑑𝑜𝑚 𝑓 ∗ 𝑝 𝑥 𝑞 𝑥 𝑡− 𝑓 ∗ 𝑡 𝑑𝑥 ≥ 𝑥 𝑞 𝑥 max 𝐷 𝑝 𝑥 𝑞 𝑥 𝐷 𝑥 − 𝑓 ∗ 𝐷 𝑥 𝑑𝑥 D is a function whose input is x, and output is t 最小上界 Imagine that there is a function T 𝑥 whose output is t = max D 𝑥 𝑞 𝑥 𝑝 𝑥 𝑞 𝑥 𝐷 𝑥 − 𝑓 ∗ 𝐷 𝑥 𝑑𝑥 = max D 𝑥 𝑝 𝑥 𝐷 𝑥 𝑑𝑥+ 𝑥 𝑞 𝑥 𝑓 ∗ 𝐷 𝑥 𝑑𝑥

Connection with GAN 𝐺 ∗ =𝑎𝑟𝑔 min 𝐺 max 𝐷 𝑉 𝐺,𝐷 ≈ max D 𝑥 𝑝 𝑥 𝐷 𝑥 𝑑𝑥+ 𝑥 𝑞 𝑥 𝑓 ∗ 𝐷 𝑥 𝑑𝑥 𝐷 𝑓 𝑃||𝑄 = max D 𝐸 𝑥~𝑃 𝐷 𝑥 + 𝐸 𝑥~𝑄 𝑓 ∗ 𝐷 𝑥 Samples from P Samples from Q 𝐷 𝑓 𝑃 𝑑𝑎𝑡𝑎 || 𝑃 𝐺 = max D 𝐸 𝑥~ 𝑃 𝑑𝑎𝑡𝑎 𝐷 𝑥 + 𝐸 𝑥~ 𝑃 𝐺 𝑓 ∗ 𝐷 𝑥 Now you can minimize any f-divergence 𝐺 ∗ =𝑎𝑟𝑔 min 𝐺 max 𝐷 𝑉 𝐺,𝐷 Original GAN minimizes something related to JS

Using the f-divergence you like  𝐷 𝑓 𝑃 𝑑𝑎𝑡𝑎 || 𝑃 𝐺 = max D 𝐸 𝑥~ 𝑃 𝑑𝑎𝑡𝑎 𝐷 𝑥 + 𝐸 𝑥~ 𝑃 𝐺 𝑓 ∗ 𝐷 𝑥 Using the f-divergence you like  https://arxiv.org/pdf/1606.00709.pdf

Double-loop v.s. Single-step 𝐺 ∗ =𝑎𝑟𝑔 min 𝐺 max 𝐷 𝑉 𝐺,𝐷 Original paper of GAN: double-loop algorithm In each iteration Given a generator 𝐺 𝑡 , 𝐷 𝑡 =𝑎𝑟𝑔 max 𝐷 𝑉 𝐺 𝑡 ,𝐷 Update the parameters 𝜃 𝐷 many times to find 𝐷 𝑡 Update generator once: 𝜃 𝐺 𝑡+1 ← 𝜃 𝐺 𝑡 +𝜂 𝛻 𝜃 𝐺 𝑉 𝜃 𝐺 𝑡 , 𝜃 𝐷 𝑡 Paper of f-GAN: Single-step algorithm 𝜃 𝐷 𝑡+1 ← 𝜃 𝐷 𝑡 −𝜂 𝛻 𝜃 𝐷 𝑉 𝜃 𝐺 𝑡 , 𝜃 𝐷 𝑡 Inner Loop One back propogation It will converge Outer Loop One Backpropogation

Experimental Results Approximate a mixture of Gaussians

Using Earth Mover’s Distance (Wasserstein Distance) W-GAN Using Earth Mover’s Distance (Wasserstein Distance) For quick start: Read-through: Wasserstein GAN: http://www.alexirpan.com/2017/02/22/wasserstein-gan.html 令人拍案叫絕的Wasserstein GAN: https://zhuanlan.zhihu.com/p/25071913 Wasserstein GAN and the Kantorovich-Rubinstein Duality: https://vincentherrmann.github.io/blog/wasserstein/ Russian-American mathematician 一月 GAN https://www.youtube.com/watch?v=ymWDGzpQdls Martin Arjovsky, Soumith Chintala, Léon Bottou, Wasserstein GAN, arXiv prepring, 2017

Earth Mover’s Distance Considering one distribution P as a pile of earth, and another distribution Q as the target The average distance the earth mover has to move the earth? 𝑃 𝑄 http://www.wxlhcc.com/product/328294.html d 𝑊 𝑃,𝑄 =𝑑

Earth Mover’s Distance 𝑄 𝑃 If there are many possible “moving plan”, using the one with the smallest average distance to define the earth mover’s distance. Source of image: https://vincentherrmann.github.io/blog/wasserstein/

A “moving plan” is a matrix The value of the element is the amount of earth from one position to another. Average distance of a plan 𝛾: 𝐵 𝛾 = 𝑥 𝑔 , 𝑥 𝑑 𝛾 𝑥 𝑔 , 𝑥 𝑑 𝑥 𝑔 − 𝑥 𝑑 Earth Mover’s Distance: 𝑊 𝑃 𝐺 , 𝑃 𝑑𝑎𝑡𝑎 = min 𝛾∈Π 𝐵 𝛾 moving plan 𝛾 The best plan All possible plan Π

Earth Mover’s Distance 𝑑 50 𝑑 0 𝑃 𝐺 0 𝑃 𝑑𝑎𝑡𝑎 𝑃 𝐺 50 𝑃 𝑑𝑎𝑡𝑎 𝑃 𝐺 100 …… …… 𝑃 𝑑𝑎𝑡𝑎 𝐽𝑆 𝑃 𝐺 0 , 𝑃 𝑑𝑎𝑡𝑎 =𝑙𝑜𝑔2 𝐽𝑆 𝑃 𝐺 100 , 𝑃 𝑑𝑎𝑡𝑎 =0 𝐽𝑆 𝑃 𝐺 50 , 𝑃 𝑑𝑎𝑡𝑎 =𝑙𝑜𝑔2 𝑊 𝑃 𝐺 0 , 𝑃 𝑑𝑎𝑡𝑎 = 𝑑 0 𝑊 𝑃 𝐺 100 , 𝑃 𝑑𝑎𝑡𝑎 =0 𝑊 𝑃 𝐺 50 , 𝑃 𝑑𝑎𝑡𝑎 = 𝑑 50

Back to the GAN framework 𝑊 𝑃 𝑑𝑎𝑡𝑎 , 𝑃 𝐺 = max 𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧 𝐸 𝑥~ 𝑃 𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸 𝑥~ 𝑃 𝐺 𝐷 𝑥 Lipschitz Function 𝑓 𝑥 1 −𝑓 𝑥 2 ≤𝐾 𝑥 1 − 𝑥 2 K=1 for "1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧" 利普希茨連 𝐷 𝑓 𝑃 𝑑𝑎𝑡𝑎 || 𝑃 𝐺 = max 𝐷 𝐸 𝑥~ 𝑃 𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸 𝑥~ 𝑃 𝐺 𝑓 ∗ 𝐷 𝑥

Back to the GAN framework 𝑊 𝑃 𝑑𝑎𝑡𝑎 , 𝑃 𝐺 𝑘+𝑑 𝑘 = max 𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧 𝐸 𝑥~ 𝑃 𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸 𝑥~ 𝑃 𝐺 𝐷 𝑥 Blue: D(x) for original GAN 𝐷 𝑥 1 =+∞ 𝐷 𝑥 1 =−∞ Green: D(x) for WGAN 𝑘+𝑑 𝑘 𝑃 𝑑𝑎𝑡𝑎 𝑃 𝐺 d 𝑥 1 𝑥 2 WGAN will provide gradient to push PG towards Pdata 𝑓 𝑥 1 −𝑓 𝑥 2 ≤ 𝑥 1 − 𝑥 2

Back to the GAN framework 𝑊 𝑃 𝑑𝑎𝑡𝑎 , 𝑃 𝐺 = max 𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧 𝐸 𝑥~ 𝑃 𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸 𝑥~ 𝑃 𝐺 𝐷 𝑥 K How to use gradient descent to optimize this equation? Clip all the weights w between c and -c No clip clipping After parameter update, if w > c, then w=c; if w<-c, then w=-c We only ensure that 𝑓 𝑥 1 −𝑓 𝑥 2 ≤𝐾 𝑥 1 − 𝑥 2 For some K

Using RMSProp instead of Adam Algorithm of WGAN Initialize 𝜃 𝑑 for D and 𝜃 𝑔 for G Using RMSProp instead of Adam In each training iteration: Sample m examples 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑚 from data distribution 𝑃 𝑑𝑎𝑡𝑎 𝑥 Sample m noise samples 𝑧 1 , 𝑧 2 ,…, 𝑧 𝑚 from the prior 𝑃 𝑝𝑟𝑖𝑜𝑟 𝑧 Obtaining generated data 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑚 , 𝑥 𝑖 =𝐺 𝑧 𝑖 Update discriminator parameters 𝜃 𝑑 to maximize 𝑊 = 1 𝑚 𝑖=1 𝑚 𝐷 𝑥 𝑖 − 1 𝑚 𝑖=1 𝑚 𝐷 𝑥 𝑖 𝜃 𝑑 ← 𝜃 𝑑 +𝜂𝛻 𝑊 𝜃 𝑑 Sample another m noise samples 𝑧 1 , 𝑧 2 ,…, 𝑧 𝑚 from the prior 𝑃 𝑝𝑟𝑖𝑜𝑟 𝑧 Update generator parameters 𝜃 𝑔 to minimize 𝜃 𝑔 ← 𝜃 𝑔 −𝜂𝛻 𝑊 𝜃 𝑔 Learning D Repeat k times No sigmoid and log Weight clipping 位什麼要 1- 啊 Learning G Only Once

https://arxiv.org/abs/1701.07875

DCGAN generator (no batch normalization, bad structure): https://arxiv.org/abs/1701.07875 DCGAN generator: W-GAN GAN DCGAN generator (no batch normalization, bad structure): W-GAN GAN MLP generator: W-GAN GAN

Improved W-GAN [v1] Fri, 31 Mar 2017 19:25:00 GMT (5045kb,D) Russian-American mathematician GAN https://www.youtube.com/watch?v=ymWDGzpQdls Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron Courville, “Improved Training of Wasserstein GANs”, arXiv prepring, 2017

Improved WGAN 𝑊 𝑃 𝑑𝑎𝑡𝑎 , 𝑃 𝐺 𝑊 𝑃 𝑑𝑎𝑡𝑎 , 𝑃 𝐺 = max 𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧 𝐸 𝑥~ 𝑃 𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸 𝑥~ 𝑃 𝐺 𝐷 𝑥 𝑊 𝑃 𝑑𝑎𝑡𝑎 , 𝑃 𝐺 = max 𝐷 { 𝐸 𝑥~ 𝑃 𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸 𝑥~ 𝑃 𝐺 𝐷 𝑥 Gradient Penalty − 𝜆𝐸 𝑥~ 𝑃 𝑖𝑛𝑡𝑒𝑟 𝛻 𝑥 𝐷 𝑥 −1 2 } 𝑃 𝑑𝑎𝑡𝑎 𝑥~ 𝑃 𝑖𝑛𝑡𝑒𝑟 𝑃 𝐺

Improved WGAN https://arxiv.org/abs/1704.00028 通過小數據集上的實驗，概述了判別器中的權重剪枝是如何導致影響穩定性和性能的病態行為的。提出具有梯度懲罰的WGAN（WGAN with gradient penalty），從而避免同樣的問題。展示該方法相比標準WGAN擁有更快的收斂速度，並能生成更高質量的樣本。展示該方法如何提供穩定的GAN訓練：幾乎不需要超參數調參，成功訓練多種針對圖片生成和語言模型的GAN架構

Improved WGAN 𝑊 𝑃 𝑑𝑎𝑡𝑎 , 𝑃 𝐺 𝑓 𝑥 1 −𝑓 𝑥 2 ≤ 𝑥 1 − 𝑥 2 Improved WGAN The norm of the gradients smaller than 1 1-lipschitz function 𝑊 𝑃 𝑑𝑎𝑡𝑎 , 𝑃 𝐺 = max 𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧 𝐸 𝑥~ 𝑃 𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸 𝑥~ 𝑃 𝐺 𝐷 𝑥 ≈ max 𝐷 { 𝐸 𝑥~ 𝑃 𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸 𝑥~ 𝑃 𝐺 𝐷 𝑥 −𝜆 𝑥 𝑚𝑎𝑥 0, 𝛻 𝑥 𝐷 𝑥 −1 𝑑𝑥 } Force the norm of the gradient smaller than 1 − 𝜆𝐸 𝑥~ 𝑃 𝑖𝑛𝑡𝑒𝑟 𝑚𝑎𝑥 0, 𝛻 𝑥 𝐷 𝑥 −1 } Force the norm of the gradient close to 1 − 𝜆𝐸 𝑥~ 𝑃 𝑖𝑛𝑡𝑒𝑟 𝛻 𝑥 𝐷 𝑥 −1 2 }

Improved WGAN −𝜆 𝑥 𝑚𝑎𝑥 0, 𝛻 𝑥 𝐷 𝑥 −1 𝑑𝑥 } −𝜆 𝑥 𝑚𝑎𝑥 0, 𝛻 𝑥 𝐷 𝑥 −1 𝑑𝑥 } − 𝜆𝐸 𝑥~ 𝑃 𝑖𝑛𝑡𝑒𝑟 𝑚𝑎𝑥 0, 𝛻 𝑥 𝐷 𝑥 −1 } 𝑃 𝑑𝑎𝑡𝑎 𝑃 𝐺 Only give gradient constraint to the area between 𝑃 𝑑𝑎𝑡𝑎 and 𝑃 𝐺 Because they influence how 𝑃 𝐺 moves to 𝑃 𝑑𝑎𝑡𝑎 “Given that enforcing the Lipschitz constraint everywhere is intractable, enforcing it only along these straight lines seems sufficient and experimentally results in good performance.”

Improved WGAN − 𝜆𝐸 𝑥~ 𝑃 𝑖𝑛𝑡𝑒𝑟 𝑚𝑎𝑥 0, 𝛻 𝑥 𝐷 𝑥 −1 } − 𝜆𝐸 𝑥~ 𝑃 𝑖𝑛𝑡𝑒𝑟 𝑚𝑎𝑥 0, 𝛻 𝑥 𝐷 𝑥 −1 } − 𝜆𝐸 𝑥~ 𝑃 𝑖𝑛𝑡𝑒𝑟 𝛻 𝑥 𝐷 𝑥 −1 2 } 𝑃 𝑑𝑎𝑡𝑎 𝑃 𝐺 Largest gradient between 𝑃 𝑑𝑎𝑡𝑎 and 𝑃 𝐺 One may wonder why we penalize the norm of the gradient for differing from 1, instead of just penalizing large gradients. The reason is that the optimal critic … actually has gradients with norm 1 almost everywhere under Pr and Pg (check the proof in the appendix) Simply penalizing overly large gradients also works in theory, but experimentally we found that this approach converged faster and to better optima.

Only modifying the discriminator part Algorithm of Improved WGAN Only modifying the discriminator part In each training iteration: Sample m examples 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑚 from data distribution 𝑃 𝑑𝑎𝑡𝑎 𝑥 Sample m noise samples 𝑧 1 , 𝑧 2 ,…, 𝑧 𝑚 from the prior 𝑃 𝑝𝑟𝑖𝑜𝑟 𝑧 Obtaining generated data 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑚 , 𝑥 𝑖 =𝐺 𝑧 𝑖 Sampling 𝜀 1 , 𝜀 2 ,…, 𝜀 𝑚 from U[0-1] Update discriminator parameters 𝜃 𝑑 to maximize 𝑊 ′ = 1 𝑚 𝑖=1 𝑚 𝐷 𝑥 𝑖 − 1 𝑚 𝑖=1 𝑚 𝐷 𝑥 𝑖 𝜃 𝑑 ← 𝜃 𝑑 +𝜂𝛻𝑊′ 𝜃 𝑑 Learning D Repeat k times 𝑥 𝑖 ⟵ 𝜀 𝑖 𝑥 𝑖 + 1− 𝜀 𝑖 𝑥 𝑖 位什麼要 1- 啊 +𝜆 𝑖=1 𝑚 𝛻 𝑥 𝐷 𝑥 −1 2 Back to Adam

DCGAN LSGAN Original WGAN Improved WGAN G: CNN, D: CNN G: CNN (no normalization), D: CNN (no normalization) G: CNN (tanh), D: CNN(tanh)

DCGAN LSGAN Original WGAN Improved WGAN G: MLP, D: CNN G: CNN (bad structure), D: CNN G: 101 layer, D: 101 layer

Consider this matrix as an “image” Sentence Generation good bye. g o o d <space> …… 1 … 1 1 1 1 d … 1-of-N Encoding g … o … <space> Consider this matrix as an “image”

Sentence Generation Real sentence Generated 1 1 1 1 1 A binary classifier can immediately find the difference. JS divergence always 0 0.9 0.1 0.1 0.9 0.1 0.7 0.1 0.8 0.1 0.9 Can never be 1-of-N WGAN is helpful

Improved WGAN successfully generating sentences https://arxiv.org/abs/1704.00028

W-GAN – 唐詩鍊成感謝李仲翊同學提供實驗結果輸出 32 個字 (包含標點) 升雲白遲丹齋取，此酒新巷市入頭。黃道故海歸中後，不驚入得韻子門。據口容章蕃翎翎，邦貸無遊隔將毬。外蕭曾臺遶出畧，此計推上呂天夢。新來寳伎泉，手雪泓臺蓑。曾子花路魏，不謀散薦船。功持牧度機邈爭，不躚官嬉牧涼散。不迎白旅今掩冬，盡蘸金祇可停。玉十洪沄爭春風，溪子風佛挺橫鞋。盤盤稅焰先花齋，誰過飄鶴一丞幢。海人依野庇，為阻例沉迴。座花不佐樹，弟闌十名儂。入維當興日世瀕，不評皺。頭醉空其杯，駸園凋送頭。鉢笙動春枝，寶叅潔長知。官爲宻爛去，絆粒薛一靜。吾涼腕不楚，縱先待旅知。楚人縱酒待，一蔓飄聖猜。折幕故癘應韻子，徑頭霜瓊老徑徑。尚錯春鏘熊悽梅，去吹依能九將香。通可矯目鷃須浄，丹迤挈花一抵嫖。外子當目中前醒，迎日幽筆鈎弧前。庭愛四樹人庭好，無衣服仍繡秋州。更怯風流欲鴂雲，帛陽舊據畆婷儻。 http://leonli.eastasia.cloudapp.azure.com:8080/ CNN 32

More about Discrete Output SeqGAN Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient, AAAI, 2017 Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky, Adversarial Learning for Neural Dialogue Generation, arXiv preprint, 2017 Boundary seeking GAN R Devon Hjelm, Athul Paul Jacob, Tong Che, Kyunghyun Cho, Yoshua Bengio, “Boundary-Seeking Generative Adversarial Networks”, arXiv preprint, 2017 Gumbel-Softmax Matt J. Kusner, José Miguel Hernández-Lobato, GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution, arXiv preprint, 2016 Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu Song, Yoshua Bengio, Maximum-Likelihood Augmented Discrete Generative Adversarial Networks, arXiv preprint, 2017 https://arxiv.org/pdf/1611.01144.pdf

Conditional GAN Human judgment http://dalimeeting.org/dali2017/generative-adversarial-networks.html

Conditional GAN c1: a dog is running 𝑥 1 : c2: a bird is flying 𝑥 2 : 𝑥 1 : Conditional GAN c1: a dog is running 𝑥 2 : c2: a bird is flying Text to image by traditional supervised learning Image NN 𝑥 1 : c1: a dog is running as close as possible Text: “train” Target of NN output A blurry image!

Conditional GAN x = G(c,z) c: train G Prior distribution 𝑧 It is a distribution. Approximate the distribution below Text: “train” Target of NN output A blurry image!

Conditional GAN x = G(c,z) scalar scalar G 𝑧 Prior distribution c: train Image x is realistic or not + c and x are matched or not x is realistic or not D (v1) scalar 𝑥 D (v2) scalar 𝑐 𝑥 May generate x not related to c Positive example: (train , ) Negative example: (train , ) (cat , )

Image-to-image Translation 𝑧 x = G(c,z) 𝑐 Façade門面,正面,外觀,虛設的外表 That is the facade of the Palace. 那是宮殿的正面 https://arxiv.org/pdf/1611.07004

Image-to-image Translation Experimental results Image NN as close as possible Testing: It is blurry because it is the average of several images. input close

Image-to-image Experimental results Image G D 1/0 𝑧 Testing: input close GAN GAN + close

Unpaired Transformation - Cycle GAN, Disco GAN Transform an object from one domain to another without pair data Domain X Domain Y photo van Gogh Disco GAN: https://arxiv.org/pdf/1703.05192.pdf

Cycle GAN 𝐺 𝑋→𝑌 𝐷 𝑌 Domain X Domain Y Become similar to domain Y https://arxiv.org/abs/1703.10593 https://junyanz.github.io/CycleGAN/ Become similar to domain Y Domain X 𝐺 𝑋→𝑌 Not what we want ignore input 𝐷 𝑌 1/0 Berkeley AI Research (BAIR) laboratory, UC Berkeley Input image belongs to domain Y or not Domain Y

Cycle GAN 𝐺 𝑋→𝑌 𝐺 Y→X 𝐷 𝑌 Domain X Domain Y as close as possible Lack of information for reconstruction 𝐷 𝑌 1/0 Input image belongs to domain Y or not Domain Y

Cycle GAN 𝐺 𝑋→𝑌 𝐺 Y→X 𝐷 𝑌 𝐷 𝑋 𝐺 Y→X 𝐺 𝑋→𝑌 Domain X Domain Y as close as possible 𝐺 𝑋→𝑌 𝐺 Y→X 𝐷 𝑌 belongs to domain X or not belongs to domain Y or not 𝐷 𝑋 1/0 1/0 http://sunreader1309.pixnet.net/blog/post/394665458-%E6%94%9D%E5%BD%B1%E8%88%87%E7%95%AB%E4%BD%9C%EF%BC%8C%E6%A2%B5%E8%B0%B7%E8%87%AA%E7%95%AB%E5%83%8F%E7%9A%84%E7%9C%9F%E9%9D%A2%E7%9B%AE%3F-------------- 𝐺 Y→X 𝐺 𝑋→𝑌 as close as possible

Unpaired Transformation – 真人動畫化 http://qiita.com/Hi- king/items/8d36d9029ad1203aac55 http://www.iis.ee.ic.ac.uk/cxiong/database.html 把真人頭像變成動漫人物

Unpaired Transformation – 真人動畫化 Using the code: https://github.com/Hi- king/kawaii_creator It is not cycle GAN, Disco GAN 很有用線上課程

Evaluation Human judgment Daniel Jiwoong Im, Chris Dongjoo Kim, Hui Jiang, Roland Memisevic, “Generating images with recurrent adversarial networks”, arXiv preprint, 2016 Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, Wenjie Li, “Mode Regularized Generative Adversarial Networks”, ICLR, 2017 http://dalimeeting.org/dali2017/generative-adversarial-networks.html Lucas Theis, Aäron van den Oord, Matthias Bethge, “A note on the evaluation of generative models”, arXiv preprint, 2015

Likelihood Generator PG 𝐿= 1 𝑁 𝑖 𝑙𝑜𝑔 𝑃 𝐺 𝑥 𝑖 Log Likelihood: : real data (not observed during training) : generated data Prior Distribution Generator PG 𝑥 𝑖 𝐿= 1 𝑁 𝑖 𝑙𝑜𝑔 𝑃 𝐺 𝑥 𝑖 Log Likelihood: We cannot compute 𝑃 𝐺 𝑥 𝑖 . We can only sample from 𝑃 𝐺 .

Likelihood - Kernel Density Estimation https://stats.stackexchange.com/questions/244012/can-you-explain-parzen-window-kernel-density-estimation-in-laymans-terms/244023 Likelihood - Kernel Density Estimation Estimate the distribution of PG(x) from sampling Generator PG Each sample is the mean of a Gaussian with the same covariance A common way is the use the bandwidth that minimises the optimality criterion Now we have an approximation of 𝑃 𝐺 , so we can compute 𝑃 𝐺 𝑥 𝑖 for each real data 𝑥 𝑖 Then we can compute the likelihood.

Likelihood - Kernel Density Estimation https://arxiv.org/pdf/1511.01844.pdf Likelihood - Kernel Density Estimation How many samples? Weird results? Model Likelihood DBN 138 GAN 225 True Distribution 243 K-means 313 ?

Likelihood v.s. Quality Low likelihood, high quality? Generator High likelihood, low quality? Considering a model generating images from training set …… Generator PG 𝑃 𝐺 𝑥 𝑖 =0 Generated data Real data 𝑥 𝑖 X 0.99 X 0.01 G1 G2 𝐿= 1 𝑁 𝑖 𝑙𝑜𝑔 𝑃 𝐺 𝑥 𝑖 =−𝑙𝑜𝑔100+ 1 𝑁 𝑖 𝑙𝑜𝑔 𝑃 𝐺 𝑥 𝑖 4.6 100

Evaluate by Other Networks Well-trained CNN 𝑥 𝑃 𝑦|𝑥 Lower entropy means higher visual quality CNN 𝑥 1 𝑃 𝑦 1 | 𝑥 1 1 𝑁 𝑛 𝑃 𝑦 𝑛 | 𝑥 𝑛 CNN 𝑥 2 𝑃 𝑦 2 | 𝑥 2 CNN 𝑥 3 High entropy means high variety 𝑃 𝑦 3 | 𝑥 3 ……

Evaluate by Other Networks - Inception Score http://www.cs.virginia.edu/~vicente/vislang/slides/stackgan.pdf http://deeploria.gforge.inria.fr/lecun-20160412-nancy-loria.pdf

Evaluate by Other Networks - Inception Score Improved W-GAN

https://arxiv.org/pdf/1511.01844.pdf K-Nearest Neighbor Using k-nearest neighbor to check whether the generator generates new objects reverse

https://arxiv.org/pdf/1612.02136.pdf Missing Mode Discriminator always knows they are real with high confidence Missing anything?

Energy-based GAN Another point of view Human judgment http://dalimeeting.org/dali2017/generative-adversarial-networks.html

Original GAN Role of discriminator: lead the generator Discriminator Data (target) distribution Generated distribution Role of discriminator: lead the generator 𝐷 𝑥 𝐷 𝑥 𝐷 𝑥 𝐷 𝑥 When the data distribution and generated distribution is the same The discriminator will be useless (flat).

Original GAN The discriminator is flat in the end. Source: https://www.youtube.com/watch?v=ebMei6bYeWw (credit: Benjamin Striner)

Energy-based Model Evaluation Function We want to find an evaluation function F(x) Input: object x (e.g. images), output: scalar (how good x is) Real x has high F(x) F(x) can be a network We can find good x by F(x): Generate x with large F(x) How to find F(x)? Evaluation Function 𝑥 𝑠𝑐𝑎𝑙𝑎𝑟 F(x) real data

Energy-based GAN In the end …… 𝐹 𝑥 We want to find an evaluation function F(x) How to find F(x)? 𝐹 𝑥 In the end …… 𝐹 𝑥 𝐹 𝑥

Energy-based GAN Preview: Framework of structured learning (Energy-based Model) ML Lecture 21: Structured Learning - Introduction https://www.youtube.com/watch?v=5OYu0vxXEv8 ML Lecture 22: Structured Learning - Linear Model https://www.youtube.com/watch?v=HfPw40JPays ML Lecture 23: Structured Learning - Structured SVM https://www.youtube.com/watch?v=YjvGVVrCrhQ ML Lecture 24: Structured Learning - Sequence Labeling https://www.youtube.com/watch?v=o9FPSqobMys