Why cross entropy instead of KL divergence?

This is a review note of the course TTIC 31230 Fundamentals of Deep Learning instructed by Prof. David McAllester.

Given two discrete distributions P, Q, the Kullback–Leibler divergence (KL divergence) between them can be written as

KL(P \| Q) = \mathbb{E}_{y\sim P} \log\frac{P(y)}{Q(y)}

The cross entropy between them is

\begin{aligned} H(P, Q) &= -\mathbb{E}_{y \sim P} \log Q(y) = KL(P \| Q) - \mathbb{E}_{y \sim P}  \log P(y) \\&= KL(P \| Q) + H(P) \end{aligned}

Given a fixed P, there is no difference between minimizing H(P, Q) and minimizing KL(P \| Q) to find an optimal Q. In fact, KL divergence is more natural than cross entropy to measure the difference between two distributions, as its lower bound is not affected by the distribution P — it can always reach 0 when Q is the same distribution as P. However, most existing deep learning frameworks apply cross entropy instead of KL divergence. Is there any reason besides (slight) computational efficiency?

The answer is yes, and the reason is theoretical. First, we have to distinguish population and empirical distribution. Consider the problem of (English) language modeling, all the sentences in English form the population. We can never know the probability of any given sentence, as the number of possible sentences is infinity. However, we could take Wikipedia, or any other corpus, as a set of sampled sentences of the population, and assume that the sample is not biased (though it might not be always true). For a sentence (or any other instance) y, let Pop(y) denote the true probability of y (which we do not know), P(y) denote the empirical probability of y (i.e., probability in the samples), and Q(y) denote the estimated probability of y using the trained model. With the unbiased assumption, we have

H(Pop, Q) = -\mathbb{E}_{y\sim Pop} \log Q(y) =  -\mathbb{E}_{y\sim P} \log Q(y) = \sum_{y} -P(y)\log Q(y)

which is exactly what we computed by cross entropy loss, while

\begin{aligned}  KL(Pop \| Q) &= H(Pop, Q) - H(Pop) \\ H(Pop) &=  -\mathbb{E}_{y\sim Pop} \log Pop(y) \end{aligned}

However, we are unable to know about Pop(y) for any sentence y, hence we are not able to measure KL(Pop \| Q).

From this perspective of view, it is worth noting that we are measuring H(Pop, Q) instead of H(P, Q) by cross entropy loss, assuming the data consists of samples are independently drawn from the true distribution.