Discussion:
[gensim:11773] Question about using multiple tags for Doc2Vec
Alex H.
2018-11-13 00:36:19 UTC
Permalink
As I mentioned in my previous post, I am trying to build a Doc2Vec model
with a set of emails. Each document has at least one tag. The tag that all
emails have are their respective, unique email IDs. In addition to that, a
large chunk of the emails have a second tag, which is the email sender's
email address. The idea of including the email sender tag is that perhaps
the algorithm can capture patterns that are characteristic of specific
email senders. However, a subset of the emails don't have the email sender
tag because of missing metadata. I wonder what the algorithm would do with
documents that don't have the second tag. Thanks!
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Gordon Mohr
2018-11-13 02:19:10 UTC
Permalink
Only the non-default "PV-DM with input-concatenation" mode (`dm=1,
dm_concat=1`) needs a constant number of doc-tags per text. And, that mode
results in much larger models, much slower training, and has rarely had a
demonstrable benefit – so it is best considered experimental.

So, in most modes, you can create `TaggedDocument` texts that have one,
two, or more tags and mix them together in the same training corpus, and
training will handle it just fine.

Note that when you repeat a doc-tag across many texts, it's somewhat
analogous to having an extra synthetic training document which is the
concatenation of all those texts, which trains that shared doc-tag.

Adding such extra tags will require extra training time and *might*
somewhat weaken the strength of the other per-document tags. The training
process is a sort of tug-of-war of all-against-all, with the model and each
vector being nudged towards increased text predictiveness when it is being
trained, but not necessarily in a way that helps other vectors. So at some
margins the quality of these extra 'sender' vectors may trade off against
the quality of the 'id' vectors.

It might still help, especially if there are meaningful sender-related
word-usage patterns, and the value of having such 'sender vectors' to you
is high. But it's not automatically beneficial or harmless, so you may want
to test it both ways.

- Gordon
Post by Alex H.
As I mentioned in my previous post, I am trying to build a Doc2Vec model
with a set of emails. Each document has at least one tag. The tag that all
emails have are their respective, unique email IDs. In addition to that, a
large chunk of the emails have a second tag, which is the email sender's
email address. The idea of including the email sender tag is that perhaps
the algorithm can capture patterns that are characteristic of specific
email senders. However, a subset of the emails don't have the email sender
tag because of missing metadata. I wonder what the algorithm would do with
documents that don't have the second tag. Thanks!
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...