This is a review note of the course TTIC 31230 Fundamentals of Deep Learning instructed by Prof. David McAllester.
Given two discrete distributions , the Kullback–Leibler divergence (KL divergence) between them can be written as
The cross entropy between them is
Given a fixed , there is no difference between minimizing
and minimizing
to find an optimal
. In fact, KL divergence is more natural than cross entropy to measure the difference between two distributions, as its lower bound is not affected by the distribution
— it can always reach 0 when
is the same distribution as
. However, most existing deep learning frameworks apply cross entropy instead of KL divergence. Is there any reason besides (slight) computational efficiency?
The answer is yes, and the reason is theoretical. First, we have to distinguish population and empirical distribution. Consider the problem of (English) language modeling, all the sentences in English form the population. We can never know the probability of any given sentence, as the number of possible sentences is infinity. However, we could take Wikipedia, or any other corpus, as a set of sampled sentences of the population, and assume that the sample is not biased (though it might not be always true). For a sentence (or any other instance) , let
denote the true probability of
(which we do not know),
denote the empirical probability of
(i.e., probability in the samples), and
denote the estimated probability of
using the trained model. With the unbiased assumption, we have
which is exactly what we computed by cross entropy loss, while
However, we are unable to know about for any sentence
, hence we are not able to measure
.
From this perspective of view, it is worth noting that we are measuring instead of
by cross entropy loss, assuming the data consists of samples are independently drawn from the true distribution.