[gensim:11729] Tuning Hyperparameters of doc2vec

Discussion:

Rajat Mehta

2018-10-30 14:45:06 UTC

After spending days I am still not able to figure out how to automate the
hyperparameter tuning process of doc2vec. It would be really helpful for me
if someone who has implemented something similar can share some thoughts or
show a code snippet of how to implement GridSearchCV or any other
hyperparameter tuning process.

Here's how I am training my doc2vec model :

def train_doc2vec(

self,
X: List[List[str]],
epochs: int=10,
learning_rate: float=0.0002) -> gensim.models.doc2vec:

tagged_documents = list()

for idx, w in enumerate(X):
td = TaggedDocument(to_unicode(str.encode(' '.join(w))).split(), [str(idx)])
tagged_documents.append(td)

model = Doc2Vec(**self.params_doc2vec)
model.build_vocab(tagged_documents)

model.train(tagged_documents,
total_examples=model.corpus_count,
epochs=model.epochs)

return model

Best Regards,
Rajat

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rajat Mehta

2018-11-12 11:48:51 UTC

Permalink

Anyone who could give me some insights on this?

Post by Rajat Mehta
After spending days I am still not able to figure out how to automate the
hyperparameter tuning process of doc2vec. It would be really helpful for me
if someone who has implemented something similar can share some thoughts or
show a code snippet of how to implement GridSearchCV or any other
hyperparameter tuning process.
def train_doc2vec(
self,
X: List[List[str]],
epochs: int=10,
tagged_documents = list()
td = TaggedDocument(to_unicode(str.encode(' '.join(w))).split(), [str(idx)])
tagged_documents.append(td)
model = Doc2Vec(**self.params_doc2vec)
model.build_vocab(tagged_documents)
model.train(tagged_documents,
total_examples=model.corpus_count,
epochs=model.epochs)
return model
Best Regards,
Rajat

Gordon Mohr

2018-11-12 17:48:49 UTC

Permalink

Neither the native gensim `Doc2Vec` class nor your `train_doc2vec()`
function are suitable to be plugged into scikit-learn APIs like
`GridSearchCV`.

However, there is a `Doc2Vec`-wrapper class `D2VTransformer` inside
gensim's `sklearn_api.d2vmodel` module, that can be used inside
`sckikit-learn` pipelines/corss-validation options. See docs at:

https://radimrehurek.com/gensim/sklearn_api/d2vmodel.html

Still, `Doc2Vec` is itself an unsupervised technique and does not itself
make scorable predictions. The doc-vectors need to be scored against some
downstream task In order to tune meta-parameters. So, you've got to choose
or make your own repeatable quantitative evaluation.

If using the `scikit-learn` classes, that'd likely be some other classifier
downstream of `D2VTransformer` which learns to predict labels, using the
output of the `D2VTransformer` along with some known-labeled training data.

But once you have your own preferred quality-evaluation method, doing a
grid-search in your own code is just: (1) methodically try lots of
meta-parameter combinations; (2) evaluate each resulting D2V model in a
fair way; (3) pick the best-scoring set.

- Gordon

Post by Rajat Mehta
Anyone who could give me some insights on this?