Understanding fastText
fastText
is Facebook's open source software to obtain word embeddings (the original paper). Given a document indexed by $n$ and represented by list of n-gram vectors $\{x_1, x_2,\cdots, x_N\}$, the objective their system trying to optimize is
$$ -\frac{1}{N} \sum_{n=1}^N y_n \log(f(BA x_n)) $$
where $B$ and $A$ are weight matrix factorized for performance consideration, $y_n$ is the class label, and $f(\cdot)$ is the softmax function.
Despite the empirical gains reported in the paper, I find this formulation quite unusual as generally we first obtain the representation of entire document. For example, if I use the average of the n-gram vectors as the representation of document, i.e. $\frac{1}{N} \sum_{n=1}^N x_n$, then the objective I will use is
$$ -y_n \log(f(BA \frac{1}{N} \sum_{n=1}^N x_n)) $$
So is my understanding about fastText
correct? What is the rationale for they to do this?
Topic representation word-embeddings nlp
Category Data Science