# Pursuing the state of the art or not: Standing on the overlap between academia and industry

Recent years, people doing artificial intelligence (AI) research enjoy pursuing the state of the art. Exciting work (e.g., ELMo, BERT, and GPT2) has been proposed and has been shown extremely useful on reaching the new state-of-the-art results on many downstream tasks (e.g., SentEval, GLUE or any other specific task like text classification).

There is a recent success of ELMo shown in the paper Unsupervised Latent Tree Induction with Deep Inside-Outside Recursive Autoencoders, NAACL 2019. The goal of the work is to parse a sentence into its constituency parse tree without any explicit labels, and it gains the best performance so far on the task. The authors applied several cutting-edge technologies to reach the nice performance: they used ELMo contextual embeddings as pretrained word embeddings, as well as SNLI and MultiNLI datasets (which are significantly larger than standard PTB training set) for training.

It was fun to see an F1 score of 56 for unsupervised constituency parsing on PTB full test set, however, the paper did not include the stuff I’m personally most interested in. I would be more excited to see an in-deep analysis showing if the gain comes from the novel and fancy model structure, or from ELMo, or just from the larger quantity of data. For example, two of the baselines, Parsing-Reading-Predict Network (PRPN; Shen et al., 2019) and Ordered Neurons (ON-LSTM; Shen et al., 2019), were trained on the standard PTB training set, and used no pretrained word embeddings. Then people could directly conclude that 100% of the performance gain come from the fantastic model structure.

There are two typical styles of AI research: boosting the performance (I call it B-style research), and analysis for boosting the performance (AB-style research). I would say I felt that there are more people doing B-style than AB-style research, but there also becomes more and more people doing AB-style one. In an ideal world, any improvement should come with reasons and analysis, but the real world is tough—sometimes, or usually, we just do better and don’t know why. A popular example is artificial neural networks: they work well, but why? What did they learned? This remains an open question to the machine learning community.

It is also worth noting that, there does not necessarily exist an explicit boundary between the two styles. one can definitely improve the performance while analyzing the performance in a detailed manner. That is my favorite style.

On the one hand, I am interested in work with high performance like everyone else. On the other hand, as a commonsense of almost all researches, research is not a competition. Pursuing the state of the art means nothing itself. Then what can we do with the SotA? Personally, I could come up with two ultimate goals of research: improving the world, and expanding the boundary of human knowledge, and I believe that most SotA methods in AI research have contributed (or at least aimed to contribute) to the former one—it is hard for me to say “a very tasks-specific model X works extremely well on task Y” is general knowledge, but for “a very task-specific model X work extremely well on task Y because of a general reason Z (which is also applicable to some other X’s and Y’s)” or even “X doesn’t work on Y for a general reason Z”, I would say yes it is knowledge!

In summary, I will fully appreciate AI papers having all or most of the following characters:
1. Reached nice performance: this is the least important one, but almost at any time, a reasonably nice performance is a requirement to show the work is on track.
2. Do fair comparison: the experimental settings (number of model parameters, using pretrained models or not, data for training) of the proposed model and baselines should be as same as we can. Yoav Goldberg had a slide that I appreciated much: for engineering, tune models; for science, tune baselines. Also, the engineering and science here may roughly refer to industry and academia, as well as B-style research and AB-style research.
3. Do nice analysis and ablation study: which parts of the model make it work well? Could the important modules be applied to other tasks, or models? Are there common features among all failure cases?

As current AI researchers, we are standing on the overlap between academia and industry. Reaching higher performance is a part of the joint goal for people in both, but I wish everyone in academia would tune baselines 🙂

# Why Cross Entropy instead of KL Divergence?

This is a review note of the course TTIC 31230 Fundamentals of Deep Learning instructed by Prof. David McAllester.

Given two discrete distributions $P, Q$, the Kullback–Leibler divergence (KL divergence) between them can be written as

$KL(P \| Q) = \mathbb{E}_{y\sim P} \log\frac{P(y)}{Q(y)}$

The cross entropy between them is

\begin{aligned} H(P, Q) &= -\mathbb{E}_{y \sim P} \log Q(y) = KL(P \| Q) - \mathbb{E}_{y \sim P} \log P(y) \\&= KL(P \| Q) + H(P) \end{aligned}

Given a fixed $P$, there is no difference between minimizing $H(P, Q)$ and minimizing $KL(P \| Q)$ to find an optimal $Q$. In fact, KL divergence is more natural than cross entropy to measure the difference between two distributions, as its lower bound is not affected by the distribution $P$ — it can always reach 0 when $Q$ is the same distribution as $P$. However, most existing deep learning frameworks apply cross entropy instead of KL divergence. Is there any reason besides (slight) computational efficiency?

The answer is yes, and the reason is theoretical. First, we have to distinguish population and empirical distribution. Consider the problem of (English) language modeling, all the sentences in English form the population. We can never know the probability of any given sentence, as the number of possible sentences is infinity. However, we could take Wikipedia, or any other corpus, as a set of sampled sentences of the population, and assume that the sample is not biased (though it might not be always true). For a sentence (or any other instance) $y$, let $Pop(y)$ denote the true probability of $y$ (which we do not know), $P(y)$ denote the empirical probability of $y$ (i.e., probability in the samples), and $Q(y)$ denote the estimated probability of $y$ using the trained model. With the unbiased assumption, we have

$H(Pop, Q) = -\mathbb{E}_{y\sim Pop} \log Q(y) = -\mathbb{E}_{y\sim P} \log Q(y) = \sum_{y} -P(y)\log Q(y)$

which is exactly what we computed by cross entropy loss, while

\begin{aligned} KL(Pop \| Q) &= H(Pop, Q) - H(Pop) \\ H(Pop) &= -\mathbb{E}_{y\sim Pop} \log Pop(y) \end{aligned}

However, we are unable to know about $Pop(y)$ for any sentence $y$, hence we are not able to measure $KL(Pop \| Q)$.

From this perspective of view, it is worth noting that we are measuring $H(Pop, Q)$ instead of $H(P, Q)$ by cross entropy loss, assuming the data consists of samples are independently drawn from the true distribution.

There is only one heroism in the world: to see the world as it is, and to love it.
–Romain Rolland

I first felt that life is meaningless when I was a sophomore undergrad. Although I cannot remind myself about how this feeling emerged, I still remember it grew rapidly, occupied my brain and made me extremely unhappy. For almost twenty years, I was told that the meaning of life is contributing to the society, as many mainland Chinese children were. However, I realized that claim totally ignores the thoughts of individuals, and thus it is definitely not true. On the other hand, considering the fact that all people will finally die, I was not able to find a substitution as well. Life became really depressing after starting thinking about this difficult philosophical problem. I could write neither a program nor a poem for several months.

Through years, I heard several talented peers got suicide. Before passing away, they claimed that they had found the true meaning of the world is meaningless, and they could not find anyone surrounding understood this. I was always sad hearing such kind of news, as I almost became one of them at some time.

I was fortunate to share my thoughts with my dear friends, and was even more fortunate that they were able to understand what I was talking about. They strongly supported me by simply saying “you are not alone”. Afterwards, the biography of Ludwig Wittgenstein by Ray Monk pulled me up from the hole of depression — I was so excited that many stories in the book had actually happened in my life. Indeed, nearly everyone who thinks about the meaning of life faces the situation of nada.

Life is like climbing mountains. Reaching the peak of meaninglessness is not the end, but the start of seeing higher peaks which are edges of the world. What should we do after that? Romain Rolland had already given an answer that I like much: to love our life, to live in the world, and to explore it.

Philosophy is always interesting to think about, though sometimes dangerous. Facing nada, we have no choice but thinking about it and being aware that everyone is with us!