[gensim:11747] doc2vec infer_vector for DBOW mode

Discussion:

Yue

2018-11-07 05:24:09 UTC

Hello, I used the DBOW model without training word vectors, which
corresponds to the setting dm=0, dbow_words=0 in the model. After I trained
the model in this setting, what is happening when I call infer_vector on an
unseen document?

Was there some kind of back propagation going on that tries to maximize
some probability in the objective function (I used negative sampling)? If
so, why was infer_vector so quick given that it also goes through 30
iterations (I used 30 in training the model). I noticed that I got a vector
output immediately after I call infer_vector.

Also since the word vectors are left in their initial randomized state, are
we actually only mapping documents to their relative positions in space to
reflect their semantic relationship, while the word vectors don't reflect
any semantic relationship between the words? If so, what is making this
mode working well compared to the mixed mode where we also train the word
vectors simultaneously (dm=0, dbow_words=1)? I don't get how we could get
document vectors that reflect semantic relationship between documents
without having word vectors reflecting semantic relationship between words.

Thank you,
Yue

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

2018-11-08 15:41:42 UTC

Permalink

`infer_vector()` treats the list-of-words much like it was another training
text. It creates an an initially-random, low-magnitude vector as a
candidate-vector for the text, then runs the model with that candidate
vector to try to predict the text's words, then backpropagates corrections.
However, everything in the trained model is held constant, and only the
single candidate-vector is adjusted. (During real training, the model's
internal weights would also be adjusted.)

By default, it will do this over the text for as many passes as were
specified for training during model initialization. Doing 30 passes over a
single text isn't going to take very long.

In this Doc2Vec 'Paragraph Vector' algorithm, document vectors aren't
directly composed out of per-word-vectors. Rather, per-document-vectors are
incrementall adjusted to become better at predicting document words. And in
this pure PV-DBOW mode (`dm=0, dbow_words=0`), there's no role for
word-vectors at all - so the fact that word-vectors are allocated,
initialized, and untouched in this mode is just a side-effect of sharing
code paths with the other modes.

This PV-DBOW mode is fast, and the doc-vectors often work well, simply
because in order to predict similar words, doc-vectors for similar
documents get nudged via training to be closer together.

Modes that add word-training, such as by enabling interleaved skip-gram
word-vector training (`dm=0, dbow_words=1`) or by moving to PV-DM mode
(`dm=1`), are also simultaneously trying to make the word-vectors be
predictive of nearby words, too. That might help or hurt the doc-vector
quality, for a particular final task evaluation, because the model now is
splitting its time/goal-seeking between making the standalone doc-vectors
better, and making the word-vectors better.

- Gordon

Post by Yue
Hello, I used the DBOW model without training word vectors, which
corresponds to the setting dm=0, dbow_words=0 in the model. After I trained
the model in this setting, what is happening when I call infer_vector on an
unseen document?
Was there some kind of back propagation going on that tries to maximize
some probability in the objective function (I used negative sampling)? If
so, why was infer_vector so quick given that it also goes through 30
iterations (I used 30 in training the model). I noticed that I got a vector
output immediately after I call infer_vector.
Also since the word vectors are left in their initial randomized state,
are we actually only mapping documents to their relative positions in space
to reflect their semantic relationship, while the word vectors don't
reflect any semantic relationship between the words? If so, what is making
this mode working well compared to the mixed mode where we also train the
word vectors simultaneously (dm=0, dbow_words=1)? I don't get how we could
get document vectors that reflect semantic relationship between documents
without having word vectors reflecting semantic relationship between words.
Thank you,
Yue

Yue

2018-11-14 20:36:24 UTC

Permalink

Hi Gordon, thank you for the answers. I have three follow-up questions
regarding the pure PV-DBOW mode.
First, I remembered you mentioned in other posts that the parameter
'window' becomes irrelevant in pure PV-DBOW. Does the model then sample
words from the entire paragraph? Is there still a concept of context
somewhere? In the original doc2vec paper,
https://cs.stanford.edu/~quocle/paragraph_vector.pdf, in section 2.3,
the authors say "In reality, what this means is that at each iteration of
stochastic gradient descent, we sample a *text window*, then *sample a
random word from the text window* and form a classification task given the
Paragraph Vector." It seems there is still a window involved. How was this
implemented in Gensim doc2vec?

Second, in the pure PV-DBOW mode, are the vector size of the paragraph
vectors and the word vectors the same? I thought that since there is no
concatenation involved in this mode, they are the of the same size.

Third, in the inference stage for pure PV-DBOW mode with negative sampling,
what besides the word vectors are fixed, could you please give a
description? For example, in the original doc2vec paper, in section 2.2,
the authors say (this is for the softmax method) "In summary, the algorithm
itself has two key stages: 1) training to get word vectors W , softmax
weights U, b and paragraph vectors D on already seen paragraphs; and 2) âthe
inference stageâ to get paragraph vectors D for new paragraphs (never seen
before) by adding more columns in D and gradient descending on D *while
holding W, U, b **fixed*." I wonder what weights for negative sampling are
fixed in the inference stage using negative sampling.

Thank you,
Yue

Post by Gordon Mohr
`infer_vector()` treats the list-of-words much like it was another
training text. It creates an an initially-random, low-magnitude vector as a
candidate-vector for the text, then runs the model with that candidate
vector to try to predict the text's words, then backpropagates corrections.
However, everything in the trained model is held constant, and only the
single candidate-vector is adjusted. (During real training, the model's
internal weights would also be adjusted.)
By default, it will do this over the text for as many passes as were
specified for training during model initialization. Doing 30 passes over a
single text isn't going to take very long.
In this Doc2Vec 'Paragraph Vector' algorithm, document vectors aren't
directly composed out of per-word-vectors. Rather, per-document-vectors are
incrementall adjusted to become better at predicting document words. And in
this pure PV-DBOW mode (`dm=0, dbow_words=0`), there's no role for
word-vectors at all - so the fact that word-vectors are allocated,
initialized, and untouched in this mode is just a side-effect of sharing
code paths with the other modes.
This PV-DBOW mode is fast, and the doc-vectors often work well, simply
because in order to predict similar words, doc-vectors for similar
documents get nudged via training to be closer together.
Modes that add word-training, such as by enabling interleaved skip-gram
word-vector training (`dm=0, dbow_words=1`) or by moving to PV-DM mode
(`dm=1`), are also simultaneously trying to make the word-vectors be
predictive of nearby words, too. That might help or hurt the doc-vector
quality, for a particular final task evaluation, because the model now is
splitting its time/goal-seeking between making the standalone doc-vectors
better, and making the word-vectors better.
- Gordon

Gordon Mohr

2018-11-14 23:36:50 UTC

Permalink

Post by Yue
Hi Gordon, thank you for the answers. I have three follow-up questions
regarding the pure PV-DBOW mode.
First, I remembered you mentioned in other posts that the parameter
'window' becomes irrelevant in pure PV-DBOW. Does the model then sample
words from the entire paragraph? Is there still a concept of context
somewhere? In the original doc2vec paper,
https://cs.stanford.edu/~quocle/paragraph_vector.pdf, in section 2.3,
the authors say "In reality, what this means is that at each iteration of
stochastic gradient descent, we sample a *text window*, then *sample a
random word from the text window* and form a classification task given
the Paragraph Vector." It seems there is still a window involved. How was
this implemented in Gensim doc2vec?

In pure PV-DBOW (without `dbow_words` set), the `window` parameter has no
effect on anything. The only 'window' is the whole document. In the Doc2Vec
implementation (like the Word2Vec implementation it was based on), there's
no random sampling of target words â rather each word in the text gets a
chance, in order, to be the 'center' predicted word.

Second, in the pure PV-DBOW mode, are the vector size of the paragraph

Post by Yue
vectors and the word vectors the same? I thought that since there is no
concatenation involved in this mode, they are the of the same size.

In all modes, word-vectors and doc-vectors are the always the same size.
Specifically:

* In PV-DM-with-sum-or-average, this is necessary to combine the word- and
doc- vectors.
* If interleaving skip-gram training with PV-DBOW using the `dbow_words`
option, this is necessary because the word-vectors and doc-vectors are
interchangeably supplied as input-activations to the same-sized neural
network.
* In pure PV-DBOW, word-vectors aren't actually involved in training, so
the answer has no meaning (even though word-vectors of the same size are
allocated/initialized because of shared code paths, they're irrelevant to
the process)
* Theoretically, PV-DM with a concatenative input layer (`dm_concat=1`)
*could* have different-sized word-vectors and doc-vectors, but it'd make
the implementation much more complex with no clear benefit
Third, in the inference stage for pure PV-DBOW mode with negative sampling,

Post by Yue
what besides the word vectors are fixed, could you please give a
description? For example, in the original doc2vec paper, in section 2.2,
the authors say (this is for the softmax method) "In summary, the
algorithm itself has two key stages: 1) training to get word vectors W ,
softmax weights U, b and paragraph vectors D on already seen paragraphs;
and 2) âthe inference stageâ to get paragraph vectors D for new
paragraphs (never seen before) by adding more columns in D and gradient
descending on D *while holding W, U, b **fixed*." I wonder what weights
for negative sampling are fixed in the inference stage using negative
sampling.

In pure PV-DBOW, the word-vectors are more than just "fixed" - they're
totally irrelevant/unconsulted, just like they were irrelevant/unconsulted
during training.

During inference, the only thing that varies with backpropagated
corrections over the multiple inference passes is the single new candidate
doc-vector. Everything else about the model is frozen against further
changes. (Otherwise, each inference would mutate the model changing its
behavior for future inference!)

- Gordon

Post by Yue

Yue

2018-11-20 05:37:53 UTC

Permalink

Pardon me being obtuse here Gordon. But could you please elaborate on your

"In pure PV-DBOW, the word-vectors are more than just "fixed" - they're
totally irrelevant/unconsulted, just like they were irrelevant/unconsulted
during training.

During inference, the only thing that varies with backpropagated

corrections over the multiple inference passes is the single new candidate
doc-vector. *Everything else about the model is frozen against further
changes*. (Otherwise, each inference would mutate the model changing its
behavior for future inference!)"

I understand that in the inference stage, the only things being nudged were
the paragraph vectors of the documents that are being inferred. But could
you specifically point out what are "frozen against further changes"? For
example, in section 2.2 of the original doc2vec paper
https://cs.stanford.edu/~quocle/paragraph_vector.pdf, the authors say W,U,b
were fixed during the inference, where W,U,b come from the softmax
equation, y = b + U h(wtâk , ..., wt+k ; W ), equation (1) in the paper.
For negative sampling, is there also an equation for y in which there are
similar parameters such as W, U, b (I see that you pointed out that W here
in not only fixed, but is irrelevant) that are being fixed during the
inference? I just want to find a more exact way to refer to the things that
are fixed during inference. I did not find an answer in related doc2vec or
word2vec papers.

Thank you,
Yue

Post by Yue
Hi Gordon, thank you for the answers. I have three follow-up questions
regarding the pure PV-DBOW mode.
First, I remembered you mentioned in other posts that the parameter
'window' becomes irrelevant in pure PV-DBOW. Does the model then sample
words from the entire paragraph? Is there still a concept of context
somewhere? In the original doc2vec paper,
https://cs.stanford.edu/~quocle/paragraph_vector.pdf, in section 2.3,
the authors say "In reality, what this means is that at each iteration
of stochastic gradient descent, we sample a *text window*, then *sample
a random word from the text window* and form a classification task given
the Paragraph Vector." It seems there is still a window involved. How
was this implemented in Gensim doc2vec?

In pure PV-DBOW (without `dbow_words` set), the `window` parameter has no
effect on anything. The only 'window' is the whole document. In the Doc2Vec
implementation (like the Word2Vec implementation it was based on), there's
no random sampling of target words â rather each word in the text gets a
chance, in order, to be the 'center' predicted word.
Second, in the pure PV-DBOW mode, are the vector size of the paragraph

Post by Yue
vectors and the word vectors the same? I thought that since there is no
concatenation involved in this mode, they are the of the same size.

In all modes, word-vectors and doc-vectors are the always the same size.
* In PV-DM-with-sum-or-average, this is necessary to combine the word- and
doc- vectors.
* If interleaving skip-gram training with PV-DBOW using the `dbow_words`
option, this is necessary because the word-vectors and doc-vectors are
interchangeably supplied as input-activations to the same-sized neural
network.
* In pure PV-DBOW, word-vectors aren't actually involved in training, so
the answer has no meaning (even though word-vectors of the same size are
allocated/initialized because of shared code paths, they're irrelevant to
the process)
* Theoretically, PV-DM with a concatenative input layer (`dm_concat=1`)
*could* have different-sized word-vectors and doc-vectors, but it'd make
the implementation much more complex with no clear benefit
Third, in the inference stage for pure PV-DBOW mode with negative

Post by Yue
sampling, what besides the word vectors are fixed, could you please give a
description? For example, in the original doc2vec paper, in section 2.2,
the authors say (this is for the softmax method) "In summary, the
algorithm itself has two key stages: 1) training to get word vectors W ,
softmax weights U, b and paragraph vectors D on already seen paragraphs;
and 2) âthe inference stageâ to get paragraph vectors D for new
paragraphs (never seen before) by adding more columns in D and gradient
descending on D *while holding W, U, b **fixed*." I wonder what weights
for negative sampling are fixed in the inference stage using negative
sampling.

In pure PV-DBOW, the word-vectors are more than just "fixed" - they're
totally irrelevant/unconsulted, just like they were irrelevant/unconsulted
during training.
During inference, the only thing that varies with backpropagated
corrections over the multiple inference passes is the single new candidate
doc-vector. Everything else about the model is frozen against further
changes. (Otherwise, each inference would mutate the model changing its
behavior for future inference!)
- Gordon

Post by Yue