I recently learned about two different flavors of the Word2Vec model for word embeddings using the original paper and this Tensorflow tutorial. The two architectures are known as Skip-gram and Continuous Bag of Words (CBOW). In both models, the idea is to assign words to vectors in a high dimensional space such that words that are similar in concept or meaning turn into nearby vectors.
Consider the sentence as input text for an embedding model:
The quick brown fox jumped over the lazy dog.
In both models, we take our input text and break it up into “windows” of size
num_skips centered around a target word, with words in a window being a maximum distance of
skip_window from the target word. The words in a window that are not the target word are called context words.
For example, with
skip_window = 1 and
num_skips = 2, the windows we get from the above sentence are:
[“The”, “quick”, “brown”], [“quick”, “brown”, “fox”], [“brown”, “fox”, “jumped”], …
In the Skip-gram model, for each window in a batch of data, given the target word we want to predict which words will be its context words. In the case of the example window [“quick”, “brown”, “fox”], there are two correct answers (both “quick” and “fox” are context words for the target word “brown”), which gives us two training observations to feed into the model: the data word “brown”, whose label is “quick”, and the data word “brown”, whose label is “fox.”
For each training observation, we start with the target word, look at its embedding, map it back into the vocabulary space using certain weights and biases, and apply the softmax function. Taking the cross entropy of the resulting vector with each one-hot encoded vector corresponding to a context word contributes to the loss of the model.
In CBOW, on the other hand, given an entire context for a word, we want to predict the target window. Each window from the input data is now just one training observation. For instance, the window [“quick”, “brown”, “fox”] produces a single training observation with data [“quick”, “fox”] whose label is “brown.”
Given all the words in the context data, we look at their embeddings and sum them to get a single vector in the embedding space, then apply weights, biases, and softmax as before to predict the probability that each vocabulary word is the desired target word.
The cross entropy of the resulting vector with the one-hot encoded target word contributes to the loss.
Given the architectures of the two models, it seems believable that Skip-gram would perform better than CBOW on small data sets. The reason is that in Skip-gram, each window produces
num_skipstraining observations, whereas in CBOW, it produces just one. Therefore Skip-gram gets much more training data from the same data set.