🏷️sec_approx_train
Recall our discussions in :numref:sec_word2vec
.
The main idea of the skip-gram model is
using softmax operations to calculate
the conditional probability of
generating a context word eq_skip-gram-softmax
,
whose corresponding logarithmic loss is given by
the opposite of :eqref:eq_skip-gram-log
.
Due to the nature of the softmax operation,
since a context word may be anyone in the
dictionary eq_skip-gram-log
contains the summation
of items as many as the entire size of the vocabulary.
Consequently,
the gradient calculation
for the skip-gram model
in :eqref:eq_skip-gram-grad
and that
for the continuous bag-of-words model
in :eqref:eq_cbow-gradient
both contain
the summation.
Unfortunately,
the computational cost
for such gradients
that sum over
a large dictionary
(often with
hundreds of thousands or millions of words)
is huge!
In order to reduce the aforementioned computational complexity, this section will introduce two approximate training methods: negative sampling and hierarchical softmax. Due to the similarity between the skip-gram model and the continuous bag of words model, we will just take the skip-gram model as an example to describe these two approximate training methods.
🏷️subsec_negative-sampling
Negative sampling modifies the original objective function.
Given the context window of a center word
where
eq_sigma-f
Let us begin by
maximizing the joint probability of
all such events in text sequences
to train word embeddings.
Specifically,
given a text sequence of length
$$ \prod_{t=1}^{T} \prod_{-m \leq j \leq m,\ j \neq 0} P(D=1\mid w^{(t)}, w^{(t+j)}).$$
:eqlabel:eq-negative-sample-pos
However,
:eqref:eq-negative-sample-pos
only considers those events
that involve positive examples.
As a result,
the joint probability in
:eqref:eq-negative-sample-pos
is maximized to 1
only if all the word vectors are equal to infinity.
Of course,
such results are meaningless.
To make the objective function
more meaningful,
negative sampling
adds negative examples sampled
from a predefined distribution.
Denote by eq-negative-sample-pos
as
where the conditional probability is approximated through
events
$$ P(w^{(t+j)} \mid w^{(t)}) =P(D=1\mid w^{(t)}, w^{(t+j)})\prod_{k=1,\ w_k \sim P(w)}^K P(D=0\mid w^{(t)}, w_k).$$
:eqlabel:eq-negative-sample-conditional-prob
Denote by
eq-negative-sample-conditional-prob
is
$$ \begin{aligned} -\log P(w^{(t+j)} \mid w^{(t)}) =& -\log P(D=1\mid w^{(t)}, w^{(t+j)}) - \sum_{k=1,\ w_k \sim P(w)}^K \log P(D=0\mid w^{(t)}, w_k)\ =&- \log, \sigma\left(\mathbf{u}{i{t+j}}^\top \mathbf{v}{i_t}\right) - \sum{k=1,\ w_k \sim P(w)}^K \log\left(1-\sigma\left(\mathbf{u}{h_k}^\top \mathbf{v}{i_t}\right)\right)\ =&- \log, \sigma\left(\mathbf{u}{i{t+j}}^\top \mathbf{v}{i_t}\right) - \sum{k=1,\ w_k \sim P(w)}^K \log\sigma\left(-\mathbf{u}{h_k}^\top \mathbf{v}{i_t}\right). \end{aligned} $$
We can see that
now the computational cost for gradients
at each training step
has nothing to do with the dictionary size,
but linearly depends on
As an alternative approximate training method,
hierarchical softmax
uses the binary tree,
a data structure
illustrated in :numref:fig_hi_softmax
,
where each leaf node
of the tree represents
a word in dictionary
Denote by fig_hi_softmax
.
Hierarchical softmax approximates the conditional probability in :eqref:eq_skip-gram-softmax
as
where function eq_sigma-f
,
and
To illustrate,
let us calculate
the conditional probability
of generating word fig_hi_softmax
.
This requires dot products
between the word vector
fig_hi_softmax
) from the root to
$$P(w_3 \mid w_c) = \sigma(\mathbf{u}_{n(w_3, 1)}^\top \mathbf{v}c) \cdot \sigma(-\mathbf{u}{n(w_3, 2)}^\top \mathbf{v}c) \cdot \sigma(\mathbf{u}{n(w_3, 3)}^\top \mathbf{v}_c).$$
Since
eq_hi-softmax-sum-one
Fortunately, since
- Negative sampling constructs the loss function by considering mutually independent events that involve both positive and negative examples. The computational cost for training is linearly dependent on the number of noise words at each step.
- Hierarchical softmax constructs the loss function using the path from the root node to the leaf node in the binary tree. The computational cost for training is dependent on the logarithm of the dictionary size at each step.
- How can we sample noise words in negative sampling?
- Verify that :eqref:
eq_hi-softmax-sum-one
holds. - How to train the continuous bag of words model using negative sampling and hierarchical softmax, respectively?