[gensim:11667] Number of iterations (epochs)

Gordon Mohr

2018-10-09 19:16:15 UTC

Can you give an example of what you mean?

Three of the things you mention primarily create word-vectors (Word2Vec,
FastText, GloVe) â and so wouldn't be directly comparable to Doc2Vec's main
output of document-vectors.

To the extent that some modes of Doc2Vec also create word-vectors, the
training is essentially identical to Word2Vec â so the quality of
word-vectors for the same number of training passes will be very, very
similar (only changed somewhat by the simultaneous interleaved updates of
extra doc-vectors). I'd be surprised if you had an example showing that
similar training-parameters on similar modes of Doc2Vec and Word2Vec
resulted in much-different end word-vector quality.

With regard to the doc-vectors themselves, published Doc2Vec work tends to
use 10-20 iterations or more, to ensure that each doc-vector gets those
10-20 update cycles on its own text. It's similar to hoping for a word
corpus havng 10-20 or more diverse examples of a word, throughout the whole
span of the corpus. (With a large enough corpus concerned only with
word-representations, the last 10 occurrences of word X may be just as
good/diverse/representative as the first 10 occurrences, so you might get
good word-vectors with just 1 or a few passes: every word was seen in
enough varied contexts, early and middle and late in the model's evolution.
With a unique doc-vector that only applies to a unique text, only explicit
multiple iterations can allow it to participate in influencing the model,
and being influenced by the improving model, early and middle and late.)

GloVe uses a totally different method of dimensionality-reduction from the
co-occurrences statistics compared to the other neural-network-based
algorithms you list, so any "iterations"-like parameter in GloVe's setup is
unlikely to be comparable, in terms of total calculations required, to the
other algorithms, even if its parameter name or potential range-of-values
seem the same.

FastText usually involves training vectors for many more subword tokens in
addition to full words â a larger model with more cross-interference (or
reinforcement) between the many token-vectors being updated. Whether that
requires more or less training for a certain corpus/task, I'd have to see
experimental results, but it's certainly different enough that any
expectation that the exact same number of passes would be equivalent for
end-results would be unjustified. FastText vs Word2Vec does different
kinds/amounts of calculations on different factorings of the original data,
with the intent that one result might be better than the other â so why
would an adjustable parameter (training epochs) have the same optimal
value, just because it has a similar name?

So, each 'epoch' actually involves a different amount of computation, on a
different number of free parameters, in each of these algorithms. Because
of that, perhaps a better aspect-of-comparison would be total training-time
to reach a particular, or best-achievable, performance. Maybe N FastText
epochs take as long as M Word2Vec epochs, because of its extra subword
training, but both take roughly the same T seconds to achieve the same Q
score on a downstream evaluation. That they both reach Q in T seconds, or
that one reaches a better Q' that the other can't reach no matter how much
runtime it's given, are each more relevant than any comparison between
internal parameters N, M.

- Gordon

Post by t***@gmail.com
I'm cautious: why Word2Vec, Doc2Vec, FastText, and GloVe require very
different number of iterations for a comparable result on the same corpus?

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.