Beyond Face Rotation: Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis Speaker: ZHAO Jian (zhaojian90@u.nus.edu) Homepage: https://zhaoj9014.github.io/ Affiliation: Learning and Vision Group, ECE, NUS
S 90° GT S 90° GT 90° S 75° S 45° S
𝐿 𝑡𝑣 = 𝑖,𝑗 ( ( 𝑥 𝑖,𝑗+1 − 𝑥 𝑖,𝑗 ) 2 +( 𝑥 𝑖+1,𝑗 − 𝑥 𝑖,𝑗 ) 2 ) 𝛽 2
Implementation Details LR: 0.0001, BATCH_SIZE =10, INPUT_SIZE = 128*128*3, BETA = 1, ALPHA = 0.001, LAMBDA_1 = 0.3, LAMBDA_2 = 0.001, LAMBDA_3 = 0.003, LAMBDA_4 = 0.0001
90° 75° 60° 45° 30° 15° GT
Quantitative Results
Underlying Problems for Re-implementation The “Template Landmark Location” for patch position aggregation is not given. How to fuse the feature maps of the 4 Landmark Located Patch Networks then (estimate the template with the frontal GT)? What if we cannot detect the 4 patches (left eye <-> right eye)? The network architecture for the discriminator is not given, which might be as same as the encoder of the global network of the generator. Training details are not given. Re-implementation is tricky, time-consuming, and cannot promise to work properly. **The weight for the pixel loss is too large, while the weights for other losses are too small. Thus, other losses seem to act as gimmick in this paper, which is not reasonable for GAN-based framework, and leading to severe overfitting problem. But we can have some improvements (e.g., acceleration, adaptive aggregation of losses, Siamese structure, learning to learn, more training data) and have a try. Q & A
Re-implementation results (preliminary) Dataset: Multi-PIE (4324 img, 250 sub, 0-90°) LR: 0.0001, BATCH_SIZE =10, INPUT_SIZE = 128*128*3, BETA = 1, ALPHA = 0.001, LAMBDA_1 = 0.3, LAMBDA_2 = 0.001, LAMBDA_3 = 0.003, LAMBDA_4 = 0.0001 Iterations: ~100k Training time: ~1d; Testing time: 20ms / img
Re-implementation results (preliminary) Problems: poor generalization capacity -> overfitting? / sub-optimal hyper parameters ? / unreasonable weights for losses? w/o pixel loss
Re-implementation results (preliminary) Problems: poor generalization capacity -> overfitting? / sub-optimal hyper parameters ? / unreasonable weights for losses? TP-GAN on other data
Present results
Possible solutions Incorporate all available Multi-PIE data for training TP-GAN. Improve and modify the pixel loss (l1 loss), since this supervision signal is too strong, leading the network overfitting quickly. In original version of TP-GAN, the discriminator seems not contribute too much to the optimization of the generator, which means that the authors are using the pixel loss (with large weight of 1.0) to make the generator memorize each frontal ground truth! Improvement on generator: Domain-Adversarial Training -> global generator & 4 local patch generator.
Possible solutions Improvement on discriminator: adopt Siamese-like discriminator and inject the dynamic convolution to capture more information with domain transfer learning (use pre-trained LightCNN to predict dynamic kernel weights of the discriminator to replace the ip loss) to optimize the generator for photorealistic frontal face synthesis. Tune relevant parameters.
Thank you!