[gensim:11790] Learning Doc2Vec from a pre-trained Word2Vec

Discussion:

Nafla Alrumayyan

2018-11-19 23:24:22 UTC

hello there,

My task required to train the Doc2vec model using *my pre-trained word2vecs*
.

Actually, I m using a special kind of word vector. So now I have the word
vectors and I need it to train the gensim Doc2vec and find most_similar docs

I read about intersect_word2vec_format and other solution but its available
only for Python 2

Is there any solution to do so with Python 3

Thanks
Nafla

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

2018-11-20 03:34:45 UTC

Permalink

Doc2Vec doesn't require word-vectors as an input. If you have enough data
to train the doc-vectors, you can train directly on that. (And if you're
using a mode of Doc2Vec that involves word-vectors, it can train those from
scratch simultaneous with the doc-vector training, from the same training
data.)

So, why do you want to use pre-trained word-vectors? What benefits do you
expect, and have you tried just doing a typical Doc2Vec training?

Some tricks with the experimental `intersect_word2vec_format()` could
sort-of work for pre-seeding a Doc2Vec model with external word-vectors, in
earlier versions of gensim, but that was broken by a recent refactoring.
(Python 2-vs-3 shouldn't be a factor at all.) Still, if you had a pressing
need, you could use that method's code as a model, and run the same sort of
patching-in-some-word-vectors process from outside the model. (It's only a
few dozen lines of code, and everythin it did inside the model can still be
applied to the model from outside the model.)

- Gordon

Post by Nafla Alrumayyan
hello there,
My task required to train the Doc2vec model using *my pre-trained
word2vecs*.
Actually, I m using a special kind of word vector. So now I have the
word vectors and I need it to train the gensim Doc2vec and find
most_similar docs
I read about intersect_word2vec_format and other solution but its
available only for Python 2
Is there any solution to do so with Python 3
Thanks
Nafla

Nafla Alrumayyan

2018-11-20 17:59:11 UTC

Permalink

Actually, Yes i had a pressing need to do this. I'm using the word2vec
implementation illustrated in this paper "Task-oriented Word Embedding for
Text Classification"

So I want to use the output (word2vec vectors)using the previous method
mentions in the paper to train the typical Doc2vec.

Yes, I traied the typical Doc2vec training..

Nafla Alrumayyan

2018-11-20 18:40:47 UTC

Permalink

Actually, Yes I tried the typical Doc2vec but doesn't serve my need.

I'm using a special word2vec implementation mention in this paper " "

So now I have the output Word vectors from the previous method and I want
to use it with typical Doc2vec to compute the similarity between the
documents

Is there any way to do this with python3

Thanks Nafla

Gordon Mohr

2018-11-20 18:55:23 UTC

Permalink

(I saw your missing paper reference in your prior deleted message:
http://aclweb.org/anthology/C18-1172 )

Thanks! That paper itself doesn't seem to use `Paragraph Vectors` (the
algorithm in gensim's Doc2Vec), so I suppose your hope is that its style of
word-vectors might help in some text-classification problem you are facing?

Since that'd be a novel mixture of two techniques, and further the usual
use of Doc2Vec isn't seeded with word-vectors, the merging of the two would
take the kind of custom work described in my previous message. (And, again,
there's no Python 2-vs-3 issue â everything gensim works equally well in
both. It's just the `intersect_word2vec_format()` method doesn't directly
work in Doc2Vec in recent gensim versions. It could still be a model for
your own code.)

I only skimmed the paper, but it seems somewhat similar to some prior work
mixing classification into the training of word-vectors, such as:

* Yahoo's 'queryCategorizr', which intersperses known-categories into texts
to make word-embeddings more category-sensitive:
https://astro.temple.edu/~tua95067/grbovic2015wwwB.pdf
* Facebook's FastText classification mode, which trains the word-vectors to
predict known-categories (instead of just nearby-words), and then they work
better on text-categorization problems even when just modeling a text as an
average-of-word-vectors: https://arxiv.org/abs/1607.01759

I mention them because they might get similar classification benefits to
your proposed new hybrid approach, without new code. (It's not clear if
your main aim is evaluating a new publishable technique, or just to get
some classification problem done well.)

The `queryCategorizr` method can be approximated in gensim Word2Vec by
simple text-preprocessing (and indeed is vaguely similar to Doc2Vec
itself). Gensim has partial FastText support â though not its
'classification' mode, for which you could use Facebook's code.

- Gordon

Post by Nafla Alrumayyan
Actually, Yes I tried the typical Doc2vec but doesn't serve my need.
I'm using a special word2vec implementation mention in this paper " "
So now I have the output Word vectors from the previous method and I want
to use it with typical Doc2vec to compute the similarity between the
documents
Is there any way to do this with python3
Thanks Nafla

Nafla Alrumayyan

2018-11-21 22:24:49 UTC

Permalink

Yes, this is the paper thank you:)

Yah, I know what this paper do- but my task (it is not a classification
problem) quite similar because I need the model to learn the word
embedding based on specific words in the corpus (the corpus contains 5
different categories) the documents are not labeled

After that, I need to use the word vector with gensim Doc2vec to measure
the similarity between the documents and use the infer_vector() with new
unseen
documents to measure the similarity with other, it is a kind of labeling
the documents using the similarity

Nafla Alrumayyan

2018-11-21 22:26:25 UTC

Permalink

Do you think this is usful ??

For everyone interested: train_model.py
<https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fjhlau%2Fdoc2vec%2Fblob%2Fmaster%2Ftrain_model.py&sa=D&sntz=1&usg=AFQjCNGBhNtCKBG3pTLL10magPomuQIqoA>