Alex H.
2018-11-12 19:03:37 UTC
I have over 30 million emails (each with greater than 100 word tokens) that
I would like to train paragraph embeddings with Doc2Vec. I wonder if there
are any general guidelines for how to adjust the parameters based on the
size of the data set. For example, if we have more documents, does it mean
we can reduce the vector_size, and perhaps increase the min_count size and
the max_vocab_size? With more documents, can we set a lower number of
epochs with no ill effect or should we increase the epochs. Thanks!
I would like to train paragraph embeddings with Doc2Vec. I wonder if there
are any general guidelines for how to adjust the parameters based on the
size of the data set. For example, if we have more documents, does it mean
we can reduce the vector_size, and perhaps increase the min_count size and
the max_vocab_size? With more documents, can we set a lower number of
epochs with no ill effect or should we increase the epochs. Thanks!
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.