Do GANs actually learn the distribution? Some theory and empirics.

Do GANs actually learn the distribution? Some theory and empirics.
GANs workshop CVPR’18 (Funding: NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla, Amazon, DARPA-SRC) Do GANs actually learn the distribution? Some theory and empirics. Sanjeev Arora Princeton University Institute for Advanced Study “Generalization and Equilibrium in GANs” ICML’17 [A+ Ge, Liang, Ma, Zhang Rong Ge Yingyu Liang Tengyu Ma Andrej Risteski Yi Zhang Do GANs actually learn the distribution? Some theory and empirics. ICLR’18 [A + Risteski, Zhang] FAIR + Stanford Duke UW (Madison) MIT

For purposes of this talk: GANs ≍ Distribution learners
(in line with canonical framework in [Goodfellow et al’14] + many followup works) NB: in many vision applications GANs used for image-to-image maps; not much theory that I’m aware of.

How to quantify success of training for such implicit models?
Dominant mode of analysis: Assume “very large” deep nets, training time, # training samples, etc…. Main message of rest of talk: Signs of trouble.

Things we’d like to understand about GANs
What distributions can be generated (possibly approximately) by your favorite ConvNet architecture of size N? Does an equilibrium exist? Under what conditions (incl. how much training time + samples) does training converge to an equilibrium? Assuming equilibrium is reached what guarantees do we have about the generator’s distribution? Is it close to the training distribution? [Goodfellow et al.’14] All’s well when nets, training samples, training time are “large enough.” [This talk] If discriminator is finite and modest-sized, this message is incorrect. (regardless of training time, # samples, training objective etc..)

Part 1: Equilibrium and generalization for finite size nets
(from A, Ge, Liang, Ma, Zhang ICML’17)

In search of “generalization” notion for generative models
“Manifold” Traditional training: Max Ex[log p(x)] “Generalization”= similar value for training/ test data. Generative Model N(O, I) Different from usual deep learning apps you know of. GANs come with no estimate of log-likelihood. Denoising Auto encoders (Vincent et al’08) Variational Autoencoder (Kingma-Welling’14) Dreal Data Distribution

Generative Adversarial Nets (GANs) [Goodfellow et al. 2014]
Real (1) or Fake (0) Dv Discriminator trained to output 1 on real inputs, and 0 on synthetic inputs. Generator trained to produce synthetic inputs that make discriminator output high values. Dsynth Dreal Gu h [Excellent resource: [Goodfellow’s survey] u= trainable parameters of Generator net v = trainable parameters of Discriminator net

GANs (W-GAN variant) Dsynth Dreal
“Difference in expected output on real vs synthetic images” Wasserstein GAN [Arjovsky et al’17] ** Real (1) or Fake (0) Dv Discriminator trained to output 1 on real inputs, and 0 on synthetic inputs. Generator trained to produce synthetic inputs that make discriminator output 1. Dsynth Dreal Gu h (** NB: Our results will apply to most standard GANs objectives.) u= trainable parameters of Generator net v = trainable parameters of Discriminator net

GANs: ”Winning”/Equilibrium
“Difference in expected output on real vs synthetic images” Wasserstein GAN [Arjovsky et al’17] GANs: ”Winning”/Equilibrium Real (1) or Fake (0) Dv Discriminator trained to output 1 on real inputs, and 0 on synthetic inputs. Generator trained to produce synthetic inputs that make discriminator output 1. Dsynth Dreal Gu h Generator “wins” if objective ≈ 0 and further training of discriminator doesn’t help. (An “Equilibrium.”) u= trainable parameters of Generator net v = trainable parameters of Discriminator net

Bad news [A., Ge, Liang, Ma, Zhang ICML’17] : If discriminator size = N, then ∃ generator that generates a distribution supported on O(Nlog N) images, and still wins against all possible discriminators. (for all standard training objectives: JS, Wasserstein etc.) (NB: Set of all possible images presumably has infinite support..)  Small discriminators inherently incapable of detecting mode collapse.

Counterintuitive nature of high dimensional geometry
In Rd, ∃ exp(d) directions whose pairwise angle is > 60 degrees and ∃ exp(d/e) special directions s.t. every other direction has angle at most e with one of these (“e-net”)

Why low-support distribution can “fool” small discriminators
Why low-support distribution can “fool” small discriminators (e-net argument; used in most proofs of the paper) Idea 1: Deep nets are “Lipschitz” with respect to trainable parameters. (Changing parameters by d changes the deep net’s output by < C d for some small C.) Idea 2: If # of parameters = N, then ≤ exp(N/e) “fundamentally distinct” deep nets. (all others are e-“close” to one of these…) (“e-net”) Dsynth : random sample of Nlog N/e2 images from Dreal . For every fixed discriminator, Pr[it e-discriminates between Dreal & Dsynth] ≤ exp(-N/e) Idea 2 + Union bound => No discriminator can e-discriminate.

Part 2: Encoder-Decoder GANs do not get around these issues
Interesting…. Original GANs idea seems to have problems. I think Encoder-Decoder GANs get around the issues you show… BiGAN [Donahue et al’17] Adversarially Learned Inference [Domoulin et al’17] Berkeley, Spring 2017 Part 2: Encoder-Decoder GANs do not get around these issues ( A, Risteski, Zhang ICLR’18)

Encoder/Decoder Intuition: “Manifold assumption”
0/1 X : Image Dv “Player 1” Image Code Encoder E Generator G “Player 2” Z: Its ”code” on manifold

Encoder/Decoder Intuition: “Manifold assumption”
0/1 “Player 1” (Discriminator) Dv x : image Image Code E “Player 2” z: Its ”code” on manifold G Generator Encoder

Intuition behind BiGANs: Manifold assumption
0/1 X : Image Dv “Player 1” Image Code Setting 1: E( ) Generator G Encoder E “Player 2” Z: Its ”code” on manifold

Intuition behind BiGANs: Manifold assumption
0/1 Player 1 trains to maximize discrimination probability between the two settings X : Image Dv “Player 1” Image Code Setting 2: G( ) Encoder is mapping images to their code… Player 2 trains to minimize …. Generator G Encoder E “Player 2” Z: Its ”code” on manifold Setting 1: E( )

Theorem: ∃ approx. equilibrium s. t
Theorem: ∃ approx. equilibrium s.t. player 1 is fooled, but generator produces distrib. of low support and encoder maps images to white noise. Worst imaginable failure mode Proof idea: Assume all images corrupted with white noise in fixed pattern: e.g., every 50th pixel is random. (clearly, still look realistic) Denote by x + h Encoder E : Map image x + h to its noise pattern h Generator G : Randomly partition all Gaussian seeds into small number of regions; associate each region with a single image. Given seed h, take image X associated with its region and output X + h h

How to check support size of generator’s distribution??
Theorem suggests GANs training objective not guaranteed to avoid mode-collapse (generator can ”win” using distributions with low support) Does this happen during real life training??? Part 3: Empirically detecting mode collapse (Birthday Paradox Test) (from A, Risteski, Zhang ICLR’18)

If you put 23 random people in a room, chance is > 1/2 that two of them share a birthday.
Suppose a distribution is supported on N images. Then Pr[sample of size √N has a duplicate image] > ½. Birthday paradox test* [A, Risteski, Zhang] : If a sample of size s has near-duplicate images with prob. > 1/2, then distribution has only s2 distinct images.

Duplicates: DC-GAN on CelebA (faces)
[Radford et al’15] Near-Duplicates found among 500 samples (Implied support size ≈ 250k) (Training set has size 200k) Diversity 1M for BiGANs [Donahue et al’17] and ALI [Dumoulin et al’17] (Encoder/Decoder + GANS) Support size seems to grow near-linearly with discriminator capacity; consistent w/ theory. (Needs more complete testing…)

Stacked GAN on CIFAR-10 [Huang et al 2016]
Duplicate Images Nearest image in training set Sample size for collisions: Truck: 100 Horse: 200 Dog: (Training set: 5K per Category)

Recap of talk Expository articles at (“Off the convex path” ) Theory suggests that GANs training can appear to succeed (near-equilibrium; generator wins, etc.) and yet the learnt distribution is far from target distribution. Tinkering with objective doesn’t help… Follow-up empirical study suggested this problem does arise in real-life training. Open: can real-life GANs training be modified to avoid such issues? (Analysis must take into account ome properties of the optimization path, or of the real-life distribution…) (Recent paper [Bai,Ma,Risteski’18] takes 1st step in accounting for distribution..)

Where to go from here? THANK YOU!
Expository articles at (“Off the convex path” ) Empirically check all GANs variants to see if any of them evade mode-collapse. Formalize other ways in which GANs are useful (e.g., as image-to-image mappings). Experiments suggest usefulness, but need a theoretical formalization..... Perhaps distribution learning is not the right way to do representation learning? (see expository article on A more complete theory of GANs, and formalization of generalization (ours is a first step..). Also, understanding of dynamics of training.. THANK YOU!

Do GANs actually learn the distribution? Some theory and empirics.

Similar presentations

Presentation on theme: "Do GANs actually learn the distribution? Some theory and empirics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Do GANs actually learn the distribution? Some theory and empirics.

Similar presentations

Presentation on theme: "Do GANs actually learn the distribution? Some theory and empirics."— Presentation transcript:

Similar presentations

About project

Feedback