I just finished implementing Generative Adversarial Networks. There are a few things that I'm confused about.
- In my experiments, training diverges if I use any type of distribution where regions with no density touch regions with high density. An extreme example of this is discrete distributions where all of the mass is concentrated at single points.
I think that this happens because there's a strong penalty for the generator assigning probability density to places where there shouldn't be any mass.
In the paper, the authors consider two different losses for the generator:
a: -1.0 * log(D(G(z)))
b: log(1.0 - D(G(z)))
As D(G(z)) -> 0.0, the first loss approaches infinity and the second loss approaches 0.0.
As D(G(z)) -> 1.0, the first loss approaches 0.0 and the second loss approaches negative infinity.
I think that the second loss is much better if the true distribution has discontinuities, because it allows the network to put some amount of mass in places that shouldn't have any mass without facing a severe penalty. I did a simple experiment where I trained the model to reproduce a mixture of a normal distribution and a gamma distribution. The gamma distribution has a discontinuity which the first loss has a hard time handling:
Loss a:
Loss b:
In both cases green is the true distribution, blue is the generated distributed, and the dots are D(G(z)) from the discriminator. As you can see, the second loss performs a lot better. Nonetheless, the paper advocates for the use of the first loss in section 3.
Perhaps there's another loss for the generator that gets the benefits of both?
I've found that the generator is a lot harder to optimize than the discriminator.
The experimental results in the paper are for real data. Has anyone done experiments showing that the model is calibrated for synthetic data?
[link][comment]