[gensim:11788] Doc2vec model: compute similarity on a corpus obtained using a pre-trained doc2vec model

Gordon Mohr

2018-11-19 23:22:52 UTC

`infer_vector()` only calculates vectors for single text examples. From its
single return vector, there's no meaningful notion of `most_similar()`.

You can calculate the pairwise similarity between any pair of vectors,
either in your own calculations or using a utility static method such as
the `WordEmbeddingsKeyedVectors.cosine_similarities()` method (which
compares a single vector against an array of one or more vectors. See:

https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.cosine_similarities

...or to view the source as a model for your own code...

https://github.com/RaRe-Technologies/gensim/blob/7e4965ee6c9d4e200dae6fb089b46c2ebc27e159/gensim/models/keyedvectors.py

The various gensim `KeyedVectors` classes don't currently offer dynamic
addition of your own vectors (such as those returned by
`infer_vector()`)... so you'd have to collect those in your own object &
calculate/sort the most-similars, or after creating a large group of
vectors patch them into some `KeyedVectors` instance to re-use its
`most_similar()` logic.

A version of `infer_vector()` that works on batches of documents has been
sometimes discussed, or improved versions of `KeyedVectors` for wider
usage, but I don't know if any such improvements are likely soon.

- Gordon

Post by JosÃ© Santos
Hi there,
I have a model based on doc2vec trained on multiple documents. I would
like to use that model to infer the vectors of another document, which I
want to use as the corpus for comparison. So, when I look for the most
similar sentence to one I introduce, it uses this new document vectors
instead of the trained corpus.
Currently, I am using the infer_vector() to compute the vector for each
one of the sentences of the new document, but I can't use the
'numpy.ndarray' object has no attribute 'most_similar'.
I would like to know if there's any way that I can compute these vectors
for the new document that will allow the use of the most_similar()
function, or if I have to compute the similarity between each one of the
sentences of the new document and the sentence I introduce individually (in
this case, is there any implementation in Gensim that allows me to compute
the cosine similarity between 2 vectors?).
I am new to Gensim and NLP, and I'm open to your suggestions.
Thank you,
JosÃ©

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.