Linear Storytelling: Projection Methods for StyleGANv2 Trained on Movies

Overview

Generative models for videos currently lack far behind models for images, both in terms of quality and in terms of how well we understand their internal workings. During my senior year, I worked with Jonas Wulff in Antonio Torralba’s lab on improving the quality of inverting StyleGANv2. We experimented with new loss functions and inversion methods. We achieved greatly improved temporal consistency in inversion with our two novel inversion methods. The first used a latent distance loss metric and the other utilized a novel scene based method.

Background on Generative Adversarial Networks

Generative adversarial networks are neural networks that are trained to be able to match their training data. They do this by training a generator network and a discriminator network side by side. The generator is trained to generate data points from random latent vectors that look like the training data to fool the discriminator. In theory, the generator eventually is able to create synthetic data that is indistinguishable from the training data.

Background on StyleGANv2

StyleGANv2 is a form of generative adversarial network that has had a lot of success in the world of images. It is able to create hyper realistic pictures of faces from random noise. StyleGAN introduced a GAN architecture where the input latent code is passed through a mapping neural network to create an intermediate latent code. This intermediate latent code is passed through into every layer of the generator during upscaling operations. This creates changes in the image at different scales and greatly increases the quality of generated images.

Background on the GAN Inversion Problem

GANs map random noise, or latents, to realistic data points similar to the training data. If one has an unseen data point, for example a face the GAN was not trained on, one may want to know what latent, when given to the generator, creates a similar face. This problem is GAN inversion. The naive way to solve this problem is to sample a random latent, pass it into the generator, compare its generated image against your target image using a perceptual loss, and backpropagate to update the latent. This is done for multiple iterations.

Experiments

We experimented with two improvements to the naive inversion method: a new temporally dependent latent distance loss metric and a novel scene-based inversion method.

Findings

Both of our inversion methods greatly improved reconstruction quality, temporal consistency, and have unique properties for their applications. Our sequential inversion method is suited for real-time applications while maintaining most of our temporal consistency and quality improvements. For scenarios where time is not limited, our scene dependent inversion method is able to better handle scene cuts and creates a linearly interpolatable latent. This latent may be useful for frame or motion interpolation.