Recent years, people doing artificial intelligence (AI) research enjoy pursuing the state of the art. Exciting work (e.g., ELMo, BERT, and GPT2) has been proposed and has been shown extremely useful on reaching the new state-of-the-art results on many downstream tasks (e.g., SentEval, GLUE or any other specific task like text classification).
There is a recent success of ELMo shown in the paper Unsupervised Latent Tree Induction with Deep Inside-Outside Recursive Autoencoders, NAACL 2019. The goal of the work is to parse a sentence into its constituency parse tree without any explicit labels, and it gains the best performance so far on the task. The authors applied several cutting-edge technologies to reach the nice performance: they used ELMo contextual embeddings as pretrained word embeddings, as well as SNLI and MultiNLI datasets (which are significantly larger than standard PTB training set) for training.
It was fun to see an F1 score of 56 for unsupervised constituency parsing on PTB full test set, however, the paper did not include the stuff I’m personally most interested in. I would be more excited to see an in-deep analysis showing if the gain comes from the novel and fancy model structure, or from ELMo, or just from the larger quantity of data. For example, two of the baselines, Parsing-Reading-Predict Network (PRPN; Shen et al., 2019) and Ordered Neurons (ON-LSTM; Shen et al., 2019), were trained on the standard PTB training set, and used no pretrained word embeddings. Then people could directly conclude that 100% of the performance gain come from the fantastic model structure.
There are two typical styles of AI research: boosting the performance (I call it B-style research), and analysis for boosting the performance (AB-style research). I would say I felt that there are more people doing B-style than AB-style research, but there also becomes more and more people doing AB-style one. In an ideal world, any improvement should come with reasons and analysis, but the real world is tough—sometimes, or usually, we just do better and don’t know why. A popular example is artificial neural networks: they work well, but why? What did they learned? This remains an open question to the machine learning community.
It is also worth noting that, there does not necessarily exist an explicit boundary between the two styles. one can definitely improve the performance while analyzing the performance in a detailed manner. That is my favorite style.
On the one hand, I am interested in work with high performance like everyone else. On the other hand, as a commonsense of almost all researches, research is not a competition. Pursuing the state of the art means nothing itself. Then what can we do with the SotA? Personally, I could come up with two ultimate goals of research: improving the world, and expanding the boundary of human knowledge, and I believe that most SotA methods in AI research have contributed (or at least aimed to contribute) to the former one—it is hard for me to say “a very tasks-specific model X works extremely well on task Y” is general knowledge, but for “a very task-specific model X work extremely well on task Y because of a general reason Z (which is also applicable to some other X’s and Y’s)” or even “X doesn’t work on Y for a general reason Z”, I would say yes it is knowledge!
In summary, I will fully appreciate AI papers having all or most of the following characters:
1. Reached nice performance: this is the least important one, but almost at any time, a reasonably nice performance is a requirement to show the work is on track.
2. Do fair comparison: the experimental settings (number of model parameters, using pretrained models or not, data for training) of the proposed model and baselines should be as same as we can. Yoav Goldberg had a slide that I appreciated much: for engineering, tune models; for science, tune baselines. Also, the engineering and science here may roughly refer to industry and academia, as well as B-style research and AB-style research.
3. Do nice analysis and ablation study: which parts of the model make it work well? Could the important modules be applied to other tasks, or models? Are there common features among all failure cases?
As current AI researchers, we are standing on the overlap between academia and industry. Reaching higher performance is a part of the joint goal for people in both, but I wish everyone in academia would tune baselines 🙂