[gensim:11769] Some general guidelines for adjusting Doc2Vec parameters based on the number of documents that are available for training?

Discussion:

Alex H.

2018-11-12 19:03:37 UTC

Permalink

I have over 30 million emails (each with greater than 100 word tokens) that
I would like to train paragraph embeddings with Doc2Vec. I wonder if there
are any general guidelines for how to adjust the parameters based on the
size of the data set. For example, if we have more documents, does it mean
we can reduce the vector_size, and perhaps increase the min_count size and
the max_vocab_size? With more documents, can we set a lower number of
epochs with no ill effect or should we increase the epochs. Thanks!

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

2018-11-12 22:11:22 UTC

Permalink

The particulars depend a lot on your data & intended application, so
tinkering within your own evaluation-setup is necessary. But a few areas to
explore:

* generally more data can support *larger*-sized vectors, but those take
more time to train, more memory, and larger isn't always better for
downstream tasks

* as your dataset & vocabulary grow, words with just a few instances may be
less interesting, and the added model size for a larger vocabulary may
become a concern, so increasing `min_count` can make sense.

* I wouldn't suggest ever using `max_vocab_size` unless it was the only way
to avoid memory errors; it forces a very crude pruning during the initial
vocabulary scan. On the other hand, `max_final_vocab` can be used, instead
of or in addition to `min_count`, to indicate: "keep no more than this many
of the most-frequent vocabulary words".

* in the typical case where every document has its own unique tag, that
doctag's vector only improves when that one document is being trained into
the model. So, reducing the number of passes likely makes that vector
worse, and more data/training on other vectors can't make up the
difference. (Note that this isn't the case for word-vectors: having twice
as much data, and thus twice as many natural examples of a word's use
spread throughout the corpus, might justify half as many training passes.)
An `epochs` value of 10-20 is still likely to be a good starting point,
perhaps trying more if your own evaluations can confirm improvement.

* larger corpuses may benefit from a more-aggressive `sample` parameter
(smaller value, eg `1e-05` or `1e-06`), discarding more of the
most-frequent words â and thus perhaps improving influence of less-common
words, or freeing time for more passes or expansion of other parameters
that would otherwise slow training

* larger corpuses may do just as well with a smaller `negative` parameter
(saving some training time)

* a previously-fixed parameter, `ns_exponent`, is now available for
adjustment (see <https://github.com/RaRe-Technologies/gensim/pull/2093>)
and the research paper mentioned in that issue (which motivated the change)
suggests that non-default values may especially help improve vectors for
classification contexts.

- Gordon

Post by Alex H.
I have over 30 million emails (each with greater than 100 word tokens)
that I would like to train paragraph embeddings with Doc2Vec. I wonder if
there are any general guidelines for how to adjust the parameters based on
the size of the data set. For example, if we have more documents, does it
mean we can reduce the vector_size, and perhaps increase the min_count size
and the max_vocab_size? With more documents, can we set a lower number of
epochs with no ill effect or should we increase the epochs. Thanks!

Alex H.

2018-11-13 00:28:04 UTC

Permalink

Thanks Gordon, that's very helpful!

Post by Gordon Mohr
The particulars depend a lot on your data & intended application, so
tinkering within your own evaluation-setup is necessary. But a few areas to
* generally more data can support *larger*-sized vectors, but those take
more time to train, more memory, and larger isn't always better for
downstream tasks
* as your dataset & vocabulary grow, words with just a few instances may
be less interesting, and the added model size for a larger vocabulary may
become a concern, so increasing `min_count` can make sense.
* I wouldn't suggest ever using `max_vocab_size` unless it was the only
way to avoid memory errors; it forces a very crude pruning during the
initial vocabulary scan. On the other hand, `max_final_vocab` can be used,
instead of or in addition to `min_count`, to indicate: "keep no more than
this many of the most-frequent vocabulary words".
* in the typical case where every document has its own unique tag, that
doctag's vector only improves when that one document is being trained into
the model. So, reducing the number of passes likely makes that vector
worse, and more data/training on other vectors can't make up the
difference. (Note that this isn't the case for word-vectors: having twice
as much data, and thus twice as many natural examples of a word's use
spread throughout the corpus, might justify half as many training passes.)
An `epochs` value of 10-20 is still likely to be a good starting point,
perhaps trying more if your own evaluations can confirm improvement.
* larger corpuses may benefit from a more-aggressive `sample` parameter
(smaller value, eg `1e-05` or `1e-06`), discarding more of the
most-frequent words â and thus perhaps improving influence of less-common
words, or freeing time for more passes or expansion of other parameters
that would otherwise slow training
* larger corpuses may do just as well with a smaller `negative` parameter
(saving some training time)
* a previously-fixed parameter, `ns_exponent`, is now available for
adjustment (see <https://github.com/RaRe-Technologies/gensim/pull/2093>)
and the research paper mentioned in that issue (which motivated the change)
suggests that non-default values may especially help improve vectors for
classification contexts.
- Gordon