[gensim:11853] Questions about a multi-lingual word2vec version

Zakaria K

2018-12-02 11:45:09 UTC

Hello there,

I am working on a modified version of gensim word2vec (the cython version)
that allows multi-lingual training, where it takes 2 aligned sentences from
a sentence aligned corpus and trains them together. Beforehand, I capture
some multilingual features during scan_vocab and then use those features in
train_batch_sg to find the relevant multi-lingual window size.

Questions:

1- Why is MAX_SENTENCE_LEN equal to 10000, is there any reason for a batch
to be specifically this size?

2- In my modified version I call fast_sentence_sg_neg() 4 times, 2 times
for the mono-lingual window size on each sentence, and 2 times for the
multi-lingual window size again on each sentence. I have shared an example
[1] of this process below. Would this affect the training process, since I
am not respecting the prior order of fast_sentence_sg_neg()? or is there
something else I should be careful about for example MAX_SENTENCE_LEN in
the first question?

[image: dwqwdqdw.PNG] <about:invalid#zClosurez>
3- For multi-thread training, my modified version is faster when I train
with one worker than multiple workers. Can you think of any ideas as to why
this behavior is happening to me?

Gensim version 3.2.0

Please let me know if you want to know any other details. By the way I
would love to share my work with you once the paper I'm working on gets
published.
Thank you very much for your time,

Best regards,
Zakaria

[1] Luong, Thang, Hieu Pham, and Christopher D. Manning. "Bilingual word
representations with monolingual quality in mind." *Proceedings of the 1st
Workshop on Vector Space Modeling for Natural Language Processing*. 2015.

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.