On names

“You are Haoyue, not Freda. Be yourself.”

Fortunately, no one has ever said the above sentences to me, though I suspect some of my friends are hesitant to say so.

Yes, if I choose to be Haoyue, I believe kind people will learn to pronounce it correctly, though it is insane to request that non-Chinese native speakers pronounce “Haoyue” so that I can recognize it with nearly zero error.

It has been extremely frustrating for me to hear “hey bro” through chats from people who have only seen my name in Chinese characters—the Hao, which means “great,” is typically used in boys’ names. Personally, I can completely erase these unpleasant memories, by simply using the name Freda.

I adore and appreciate Chinese culture: I read and write poems in Chinese, and I converse with my Chinese friends in Chinese, while I prefer to be Freda, at least in English contexts. I have the reasons stated above for my name preference, but I believe that I do not need to say so in order for people to refer to me as Freda.

Chinese people have become more critical of other Chinese choosing English names for themselves, particularly in light of the current political situation in which nationalism is becoming increasingly popular among young Chinese people. However, I believe that one’s name should be up to themselves: on the one hand, if they choose to use their name in their native language, reasonable others should support them and do their best to learn how to pronounce it; on the other hand, and equally important, if they choose to use another name, reasonable others should respect their choice rather than forcing the person to “be themselves.” It’s wonderful to encourage friends to be themselves, but preferring a name that isn’t in their native language does not always (or almost never) mean they’re losing themselves.

Whenever I am Freda or Haoyue, I was, am, and will be myself.

Preserving the Variances: Transformers and Xavier Initialization

My committee recently assigned a paper on compressive transformers for my PhD qualifying exam; therefore I was trying to remind myself of the details of the original Transformers. The first thing that I came up with was the \sqrt{d_k} scaling process in the multi-head attention mechanism.

Dot-Product Mechanism in the Transformers

Let Q \in \mathbb{R}^{M \times d_k}, K \in \mathbb{R}^{N \times d_k}, V\in \mathbb{R}^{N\times d_v} denote the query vectors, the key matrix and the value matrix respectively. Given a query Q, the Transformers compute a weighted sum over all possible values by

\mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\frac{QK^\top}{\sqrt{d_k}}) V \in \mathbb{R}^{M \times d_v}

Here, the process is identical to classical dot-product based attention except for the scaling factor of \frac{1}{\sqrt{d_k}}. The authors claim that such a scaling factor prevents the dot product from growing too large in magnitude so that we can have reasonable gradients through softmax—this makes a lot of sense, but the question is: why do the authors choose \frac{1}{\sqrt{d_k}}, instead of values like \frac{1}{d_k} or \frac{1}{d_k^{2/3}} etc., to do the scaled multi-head attention?

Variance-Preserving Layers

The footnote in the original paper gives us some hint. Let us consider q, k \in \mathbb{R}^{d_k}. Assume that each dimension of either q or k is independently drawn from a zero-mean, unit-variance distribution. Recall that for independent random variables X, Y, we have

\begin{aligned} Var[XY] &= E[X^2Y^2] - E^2[XY] \\ &= (Cov[X^2, Y^2] + E[X^2]E[Y^2]) - (Cov[X,Y] + E[X]E[Y])^2  & ~~~~~(Cov[X,Y] = E[XY] - E[X]E[Y]) \\ &= E[X^2]E[Y^2] - E^2[X]E^2[Y] & ~~~~~(\text{the independence of } X, Y) \\ &= (Var[X] + E^2[X] )(Var[Y] + E^2[Y]) - E^2[X]E^2[Y] & ~~~~~(Var[X] = E[X^2] - E^2[X]) \\ &= Var[X]Var[Y] + Var[X]E^2[Y] + Var[Y]E^2[X] \end{aligned}

Thus, given that Var[q_i] = Var[k_i] = 1, and E[q_i] = E[k_i] = 0 for any i \in \{1, \ldots, d_k\},

Var[q\cdot k] = \sum_{i=1}^{d_k} Var[q_i k_i] = d_k

However, if we divide q \cdot k by \sqrt{d_k},

Var[\frac{q \cdot k}{\sqrt{d_k}}] = \sum_{i=1}^{d_k} Var[\frac{q_i k_i}{\sqrt{d_k}}] = \sum_{i=1}^{d_k} \frac{1}{d_k} Var[q_i k_i] = 1

This implies that the variance is (kind of) preserved before and after the dot product, no matter how large d_k is.

To make the above procedure meaningful in practice, we would also like to make each element in Q, K as close to the basic assumption (zero mean and unit variance) as possible. One may also want to consider Xavier initialization, which is also designed based on the above variance-preserving idea.

Xavier Initialization

Let x \in \mathbb{R}^n denote a vector, of which each element x_i is independently drawn from a zero-mean and unit-variance distribution. We compute the output y \in \mathbb{R}^m by

y_j = \sum_{i=1}^n x_i w_{i,j}

Xavier initialization randomly draws each w_{i,j} independently from the uniform distribution on the interval (-\sqrt{\frac3N}, \sqrt{\frac3N}), which gives zero-mean and unit variance properties to y_j as well.

Evaluating diversity in machine language generation

Yesterday I was talking with Han Shao about the metric of diversity in machine language generation. Suppose there’re three systems A, B and C:

A generates 3 examples; each example is in a different pattern.
B generates 100 examples; each example is in one of 5 patterns, uniformly (i.e., each pattern has 20 examples).
C generates 100 examples, 96 of which are in pattern (a), while the rest 4 are in different other patterns respectively.

Which is the most diverse one? We found it’s difficult to quantitatively answer this question. But I somewhat convinced myself that it can be measured as follows, applying simple equations from information theory.

Let’s assume that each pattern is independent of any of the others. It’s also necessary to assume the observed empirical distribution to be the true distribution that the model represents, though we should let the models generate as many examples as they can to obtain a good estimation of the true distribution.

Let P_{\Theta}(x) denote the probability for model \Theta to generate pattern x, the entropy of such distribution is

H(P_\Theta ) = -\sum_x P_\Theta (x)\log P_\Theta(x)

Larger entropy typically means better diversity. We can then compute the entropy of the above three distributions:

H(P_A) = -3 \cdot \frac13 \log\frac13 = \log 3

H(P_B) = -5 \cdot \frac15 \log\frac15 = \log 5 > H(P_A)

H(P_C) = -\frac{96}{100} \log\frac{96}{100} - 4\cdot \frac{1}{100} \log\frac{1}{100} = \frac{100\log{100} - 96\log 96 }{100} < log 3 = H(P_A)

Word embeddings: Distributional semantics and beyond.

How do you like word embeddings? I’ve been asked this question a lot of times. In fact, I had a chance to ask Martha Palmer a similar question: How do you like deep learning? She specifically mentioned that she likes word embeddings, which improves almost all downstream tasks in the neural network era, and is somewhat interpretable.

Are word embeddings perfect? Definitely not. In this post, I’d like to share some thoughts and potential next steps derived from conversations with folks working on natural language processing, linguistics and cognitive science. If you’re curious about a comprehensive survey of word embeddings, I’d strongly recommend this one.

Learning word embeddings from corpus.

The basic idea of word embeddings is mapping a word to a d-dimensional vector, i.e., embedding a word to a d-dimensional space. Such an idea is based on the distributional hypothesis of semantics: words that are used and occur in the same contexts tend to purport similar meanings. There’re two popular approaches: continuous bag-of-words (CBoW) and skip gram.

Many folks explain the difference between CBoW and skip gram as “predicting a word given context” (CBoW) vs. “predicting context given a word” (skip gram). I’ve also seen some posts saying “the skip gram model is just a flipped CBoW model”. Well it’s somewhat true, but the skip gram part was a bit confusing to me five years ago — at that time, I was an undergrad whose knowledge about NLP is just tf-idf. I was especially confused by the following questions: “How can you predict the context? Generating them all? How do you do that? Different positions are not equivalent, then how do you deal with each position?”

I would highlight my own understanding to the difference between CBoW and skipgram as “predicting one given all” (CBoW) vs. “predicting one given one” (skip gram). Mathematically, the CBoW objective function is to maximize the overall probability of the observed word given the context:

\begin{aligned} \log P(\textit{word} \mid \textit{context}_\textit{word}) &\propto     \textit{sim}(\mathbf{w}_\textit{word}, \frac{1}{| \textit{context}_\textit{word} |}\sum_{c \in  \textit{context}_\textit{word} } \mathbf{w}_c) \end{aligned}

where \mathbf{w}_\textit{word} represents the corresponding vector (embedding) of a specific word, and \textit{sim} represents a similarity metric between two vectors (larger means similar). Such similarity is usually measured by cosine, or negative Euclidean distance. Here, context is a set (bag) of words.

As for skip gram, the goal of the objective function is not changed, but it instead model P(\textit{word} \mid \textit{c}) for each context word respectively, assuming that each word in the context acts independently. Formally, the objective function becomes to maximize

\begin{aligned} \sum_{\textit{c} \in \textit{context}_\textit{word}} \log P(\textit{word} \mid c) \propto  \sum_{\textit{c} \in \textit{context}}  \textit{sim}( \mathbf{w}_\textit{word},  \mathbf{w}_\textit{c}) \end{aligned}

Unfortunately, there exists a trivial solution to this optimization problem: all words share one vector, then the similarity is maximized. Mikolov et al. proposed to train skip-gram with negative sampling, which not only maximizes the probability of the true word given context, but also simultaneously minimizes that of the (randomly sampled) false word. The objective functions are

\begin{aligned} & \text{maximize} \sum_{\textit{c} \in \textit{context}_\textit{word}} \log P( \textit{word}  \mid c) \propto  \sum_{\textit{c} \in \textit{context}}  \textit{sim}( \mathbf{w}_\textit{word},  \mathbf{w}_\textit{c}) \\  & \text{minimize} \sum_{\textit{c} \in \textit{context}_\textit{word}} \log P( \textit{other word} \mid c) \propto  \sum_{\textit{c} \in \textit{context}}  \textit{sim}( \mathbf{w}_\textit{other word},  \mathbf{w}_\textit{c})  \end{aligned}

There’re also some efficient tricks on training skip gram, e.g., using different vector representations for main word or a context, smoothing the unigram distribution when sampling negative (false) words, etc.

Interpretability of word embeddings.

For each word, the corresponding vector, as a whole, represents a meaning. According to this fact, word embeddings can be intuitively interpreted by looking at each word’s nearest neighbors. The word “cat” often has a large similarity with “dog”, but probably a smaller similarity with “philosophy”.

Is there some direction in the vector space representing some concrete properties (e.g., “singular or plural”, “present or past”, “animal or not”, “soft or not”, etc.)? QVec by Yulia Tsvetkov et al. may serve as an example of understanding what happened inside the vectors, where they propose to align each dimension of word embeddings to predefined linguistic property. Nevertheless, word embeddings might not be compositional — each dimension might represent multiple things, and several dimensions might represent one thing together. One can probably try to probe the correlation between a direction (instead of a dimension) and a word property.

Multi-sense word embeddings and metaphors.

A word can have multiple senses. As an example, “bank” can be the slope beside a body of water, as well as a depository financial institution. It’s essential to have context to fully understand which sense it’s referring to. Researchers have developed mutli-sense word embeddings, which assigns each sense a separated vector.

For “supervised” multi-sense word embeddings, the model requires a labeled word sense resource (say, WordNet in English). It somehow matches each word in the corpus to a predefined sense, and train typical word embeddings based on the labeled corpus. For “unsupervised” multi-sense word embeddings, folks (Huang et al., Neelakantan et al., Li et al.; inter alia) use clustering or probabilistic clustering techniques to select the corresponding sense for each word in the corpus, where the sense selection model and the word embeddings can be jointly optimized.

The idea of multi-sense word embeddings is simple and intuitive. However, it’s unfortunately beaten by global word embeddings on almost all downstream tasks (e.g., sentiment analysis, text classification etc.). Moreover, the unsupervised approaches ignore metaphors. Metaphorical usages often appear in a different context from literal ones, but they do have the same sense — clustering based on context might mistakenly classify metaphorical usages (especially those typical ones) to different senses.

Contextualized word embeddings.

Before introducing contextualized word embeddings, I wanted to propose two mechanisms about how humans understand words in a context.

(1) word-> sense -> contextualized word meaning

(2) word -> contextualized word meaning.

I personally prefer the latter. A distant demonstration would be as follows. Consider the following sentences:

(1) Cats are cute.

(2) (You have two American shorthair cats.) Your cats are cute.

Both “cats” are with the same sense. We don’t have a concrete picture about which cat in our mind when seeing the first sentence, but we do have it when seeing the second one. We don’t firstly map “your cats” to a general concept “cats”, and then generate the picture; instead, such a picture immediately emerges when we see the sentence. I wished I could see some work in psychology or psycholinguistics about this issue, but I failed in searching for them. Any suggestion in the comments is welcomed.

The idea of contextualized word embeddings is to also encode contextual information into the word embeddings. Same word in different sentences might have different embeddings. Contextualized word embeddings can be hidden states of a recurrent neural network based language model, hidden states of a transformer based language model, or such hidden states of models targeting in any downstream tasks. Some popular contextualized word embeddings trained on huge corpus (ELMo, BERT, inter alia) has achieved amazing performance on a bunch of downstream tasks.

Cross-Lingual word embeddings.

I also wanted to mention cross lingual word embeddings, as parallel words/sentences in different languages provide free grounding signals in training and interpreting word embeddings.

One interesting approach is aligning word embeddings in different languages. Suppose we have trained word embeddings in two languages separately, and would like to align the two embedding spaces. With the assumption that the word distribution is similar across all languages, we can train a linear (or non-linear, but linear is simple and effective) transformation based on a small set of pivot words in the two languages. We can do even more crazy things (e.g., unsupervised word translation!) using cross-lingual word embeddings and unsupervised aligning techniques.

Bias from the corpus and downstream task.

We definitely don’t want race or gender stereotypes in the learned word embeddings — women can be computer programmers and men can be housemakers as well. However, almost all learned word embeddings, more or less, think women are more likely to be housemakers compared with men. Formally, such stereotype is reflected as follows

\begin{aligned} \textit{sim}(\mathbf{w}_\textit{men},  \mathbf{w}_\textit{housemaker}) < \textit{sim}( \mathbf{w}_\textit{women},  \mathbf{w}_\textit{housemaker})\end{aligned}

I have to emphasis this is an unwanted feature, but not a bug — such phenomenon meets the distributional hypothesis; the bias comes from the corpus itself but the word embedding methods. It’s also great to see folks working on debiasing word embeddings and more general machine learning models (Bolukbasi et al., 2016; Zhao et al., 2019; Sap et al., 2019; inter alia).

Quantifying attention on each word in a deep learning model.

This topic is worth another post — I just wanted to briefly mention it. Recently, I’ve seen lots of work apply the following attention mechanism. Let’s look at a typical supervised sentiment classifier as an example:

Given the words in a sentence, the model first maps each word to its word embedding (\mathbf{w}_1,  \mathbf{w}_2, \cdots,  \mathbf{w}_n), and then pass them into a long-short term memory cell (LSTM) or a bidirectional LSTM to get the hidden states after seeing each word, denoted as (\mathbf{h}_1,  \mathbf{h}_2, \cdots,  \mathbf{h}_n). Next, it computes “attention” based on the hidden states using a multi-layer perceptron (MLP) as follows

\begin{aligned} \alpha_i = \frac{\exp(MLP(\mathbf{h}_i))}{\sum_{j=1}^n  \exp(MLP(\mathbf{h}_j)) } \end{aligned}

and combines all the hidden states to get a fixed-dimensional sentence representation

\textit{repr} = \sum_{i=1}^n \alpha_i \mathbf{h}_i

Finally, such representation is passed into another MLP for sentiment classification.

Is such attention interpretable? Do they reliably assign higher weights to those positions corresponding to important words to the downstream tasks? My answer is no, as it’s affected by the contextualized step a lot. While \mathbf{w}_i only carries information from the i-th word, \mathbf{h}_i actually carries the information from both the i-th word and its context. It doesn’t make sense to look at \alpha_i to extract important words to downstream tasks.

How to extract such important words? I’d recommend looking at the L1/L2 norm of the Jacobian (and I’ve also done it in two of my papers(1, 2)!):

\mathbf{J} = \frac{\partial\mathcal{L}}{\partial \mathbf{W}}

where \mathcal{L} is the final loss value calculated from the input sentence and the gold label, \mathbf{W} = (\mathbf{w}_1,  \mathbf{w}_2, \cdots,  \mathbf{w}_n ). ||\mathbf{J}_i||_k , in some sense, represents the importance of the i-th word. Here, ||\cdot||_k represents k-norm, where we usually use k=1 or 2.

Beyond distributional hypothesis?

Antonyms can often appear in a similar context: For example, “The restaurant is nice.” vs. “The restaurant is terrible.” However, most word embeddings (even the contextualized ones) cannot capture “nice” and “terrible” here are antonyms. I’ve only seen models capturing such relation of sentiment-bearing words in sentiment classification, but there’s few work showing that other neutral antonyms (e.g., short vs. long/tall, thin vs. thick) can also be identified in an unsupervised way. I’d be definitely interested in seeing more work on this!

Why cross entropy instead of KL divergence?

This is a review note of the course TTIC 31230 Fundamentals of Deep Learning instructed by Prof. David McAllester.

Given two discrete distributions P, Q, the Kullback–Leibler divergence (KL divergence) between them can be written as

KL(P \| Q) = \mathbb{E}_{y\sim P} \log\frac{P(y)}{Q(y)}

The cross entropy between them is

\begin{aligned} H(P, Q) &= -\mathbb{E}_{y \sim P} \log Q(y) = KL(P \| Q) - \mathbb{E}_{y \sim P}  \log P(y) \\&= KL(P \| Q) + H(P) \end{aligned}

Given a fixed P, there is no difference between minimizing H(P, Q) and minimizing KL(P \| Q) to find an optimal Q. In fact, KL divergence is more natural than cross entropy to measure the difference between two distributions, as its lower bound is not affected by the distribution P — it can always reach 0 when Q is the same distribution as P. However, most existing deep learning frameworks apply cross entropy instead of KL divergence. Is there any reason besides (slight) computational efficiency?

The answer is yes, and the reason is theoretical. First, we have to distinguish population and empirical distribution. Consider the problem of (English) language modeling, all the sentences in English form the population. We can never know the probability of any given sentence, as the number of possible sentences is infinity. However, we could take Wikipedia, or any other corpus, as a set of sampled sentences of the population, and assume that the sample is not biased (though it might not be always true). For a sentence (or any other instance) y, let Pop(y) denote the true probability of y (which we do not know), P(y) denote the empirical probability of y (i.e., probability in the samples), and Q(y) denote the estimated probability of y using the trained model. With the unbiased assumption, we have

H(Pop, Q) = -\mathbb{E}_{y\sim Pop} \log Q(y) =  -\mathbb{E}_{y\sim P} \log Q(y) = \sum_{y} -P(y)\log Q(y)

which is exactly what we computed by cross entropy loss, while

\begin{aligned}  KL(Pop \| Q) &= H(Pop, Q) - H(Pop) \\ H(Pop) &=  -\mathbb{E}_{y\sim Pop} \log Pop(y) \end{aligned}

However, we are unable to know about Pop(y) for any sentence y, hence we are not able to measure KL(Pop \| Q).

From this perspective of view, it is worth noting that we are measuring H(Pop, Q) instead of H(P, Q) by cross entropy loss, assuming the data consists of samples are independently drawn from the true distribution.

Everyone Faces Nada

There is only one heroism in the world: to see the world as it is, and to love it.
–Romain Rolland

I first felt that life is meaningless when I was a sophomore undergrad. Although I cannot remind myself about how this feeling emerged, I still remember it grew rapidly, occupied my brain and made me extremely unhappy. For almost twenty years, I was told that the meaning of life is contributing to the society, as many mainland Chinese children were. However, I realized that claim totally ignores the thoughts of individuals, and thus it is definitely not true. On the other hand, considering the fact that all people will finally die, I was not able to find a substitution as well. Life became really depressing after starting thinking about this difficult philosophical problem. I could write neither a program nor a poem for several months.

Through years, I heard several talented peers got suicide. Before passing away, they claimed that they had found the true meaning of the world is meaningless, and they could not find anyone surrounding understood this. I was always sad hearing such kind of news, as I almost became one of them at some time.

I was fortunate to share my thoughts with my dear friends, and was even more fortunate that they were able to understand what I was talking about. They strongly supported me by simply saying “you are not alone”. Afterwards, the biography of Ludwig Wittgenstein by Ray Monk pulled me up from the hole of depression — I was so excited that many stories in the book had actually happened in my life. Indeed, nearly everyone who thinks about the meaning of life faces the situation of nada.

Life is like climbing mountains. Reaching the peak of meaninglessness is not the end, but the start of seeing higher peaks which are edges of the world. What should we do after that? Romain Rolland had already given an answer that I like much: to love our life, to live in the world, and to explore it.

Philosophy is always interesting to think about, though sometimes dangerous. Facing nada, we have no choice but thinking about it and being aware that everyone is with us!