Evaluating diversity in machine language generation

Yesterday I was talking with Han Shao about the metric of diversity in machine language generation. Suppose there’re three systems A, B and C:

A generates 3 examples; each example is in a different pattern.
B generates 100 examples; each example is in one of 5 patterns, uniformly (i.e., each pattern has 20 examples).
C generates 100 examples, 96 of which are in pattern (a), while the rest 4 are in different other patterns respectively.

Which is the most diverse one? We found it’s difficult to quantitatively answer this question. But I somewhat convinced myself that it can be measured as follows, applying simple equations from information theory.

Let’s assume that each pattern is independent of any of the others. It’s also necessary to assume the observed empirical distribution to be the true distribution that the model represents, though we should let the models generate as many examples as they can to obtain a good estimation of the true distribution.

Let P_{\Theta}(x) denote the probability for model \Theta to generate pattern x, the entropy of such distribution is

H(P_\Theta ) = -\sum_x P_\Theta (x)\log P_\Theta(x)

Larger entropy typically means better diversity. We can then compute the entropy of the above three distributions:

H(P_A) = -3 \cdot \frac13 \log\frac13 = \log 3

H(P_B) = -5 \cdot \frac15 \log\frac15 = \log 5 > H(P_A)

H(P_C) = -\frac{96}{100} \log\frac{96}{100} - 4\cdot \frac{1}{100} \log\frac{1}{100} = \frac{100\log{100} - 96\log 96 }{100} < log 3 = H(P_A)

Leave a comment