[gensim:11796] Word2Vec training loss

Discussion:

Heta Saraiya

2018-11-20 19:16:59 UTC

Hi,

I am training a dataset using Word2Vec and saving the training loss after
each epoch. But the training loss does not decrease after some epochs but
it increases. Can you give me any idea of why this happens?

Thanks

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

2018-11-21 17:39:33 UTC

Permalink

Up through at least gensim 3.6.0, the loss value reported may not be very
sensible, only resetting the tally each call to train(), rather than each
internal epoch. There are some potential fixes forthcoming in this pending
pull-request:

https://github.com/RaRe-Technologies/gensim/pull/2135

In the meantime, the losses after each separate call to `train()` using the
same `epochs` should be comparable, but:

* in general, most users *shouldn't* be calling `train()` more than once â
it's very easy and common to mismanage parameters like the
`alpha`/`min_alpha` â and perhaps that's happening for you

* there's no guarantee the loss will always decrease; in general when
things are working the loss will head lower over time but eventually reach
some point where the model is doing as well as it can given the inherent
limitations of its technique and complexity â it has "converged" on its
optimal state. After that, it can only get better at some training examples
by worsening performance on others, and thus loss will jitter a little
up-and-down over time, but no longer trend downward. You may have reached
that point.

* a lower loss isn't always better for generalized performance on
downstream applications. Making the model larger â such as by using a
larger vector dimensionality â will usually make the attainable level of
loss lower, because there's then extra room in the model to essentially
memorize idiosyncratic cases in the training data. But such deep adaptation
to just the training data is "overfitting", tending to make the model less
useful on any out-of-training-set data, because rather than learning
rough-but-reusable general patterns, it's just learned mechanistic rules
from the limited training data.

So, it you see the loss going wildly wrong, you might be mishandling calls
to `train()`. But if it's just jittering around a best-achievable value
after sufficient training, that's the normal state of a 'converged' model
that's been trained as much as necessary. And while loss is definitionally
the thing that the single Word2Vec model is locally optimizing, it's not
the thing to optimize in the whole system of model-plus-downstream-uses.
That should be some quantitative measurement of model quality specific to
your downstream tasks, and the smallest-loss Word2Vec model is unlikely to
be the best-general-performance model for downstream tasks.

- Gordon

Post by Heta Saraiya
Hi,
I am training a dataset using Word2Vec and saving the training loss after
each epoch. But the training loss does not decrease after some epochs but
it increases. Can you give me any idea of why this happens?
Thanks

Heta Saraiya

2018-11-26 18:19:37 UTC

Permalink

Hi,

I am using callbacks to get loss after each epoch. But what I do not
understand is why is the loss oscillating (some times increases then
decreases and then again increases). Can that be due to some hyper
paramater value? I am new to using word2vec and would like suggestions on
how I can check if my model is performing better?

Thanks

Post by Gordon Mohr
Up through at least gensim 3.6.0, the loss value reported may not be very
sensible, only resetting the tally each call to train(), rather than each
internal epoch. There are some potential fixes forthcoming in this pending
https://github.com/RaRe-Technologies/gensim/pull/2135
In the meantime, the losses after each separate call to `train()` using
* in general, most users *shouldn't* be calling `train()` more than once â
it's very easy and common to mismanage parameters like the
`alpha`/`min_alpha` â and perhaps that's happening for you
* there's no guarantee the loss will always decrease; in general when
things are working the loss will head lower over time but eventually reach
some point where the model is doing as well as it can given the inherent
limitations of its technique and complexity â it has "converged" on its
optimal state. After that, it can only get better at some training examples
by worsening performance on others, and thus loss will jitter a little
up-and-down over time, but no longer trend downward. You may have reached
that point.
* a lower loss isn't always better for generalized performance on
downstream applications. Making the model larger â such as by using a
larger vector dimensionality â will usually make the attainable level of
loss lower, because there's then extra room in the model to essentially
memorize idiosyncratic cases in the training data. But such deep adaptation
to just the training data is "overfitting", tending to make the model less
useful on any out-of-training-set data, because rather than learning
rough-but-reusable general patterns, it's just learned mechanistic rules
from the limited training data.
So, it you see the loss going wildly wrong, you might be mishandling calls
to `train()`. But if it's just jittering around a best-achievable value
after sufficient training, that's the normal state of a 'converged' model
that's been trained as much as necessary. And while loss is definitionally
the thing that the single Word2Vec model is locally optimizing, it's not
the thing to optimize in the whole system of model-plus-downstream-uses.
That should be some quantitative measurement of model quality specific to
your downstream tasks, and the smallest-loss Word2Vec model is unlikely to
be the best-general-performance model for downstream tasks.
- Gordon

Gordon Mohr

2018-11-26 19:46:15 UTC

Permalink

You'll have to show more details of your code, metaparameters, data
type/quality/size, output values, and gensim version for me to give any
more specific answer than previously.

(Are you possibly affected by the bug I linked? Are you calling train more
than once? Does the loss decrease for a while and then jitter up and down
around some level, which as I've explained is normal behavior â because
loss can't go down forever unless the model is big enough to 'memorize' the
data, which would result in an unhelpful 'overfit' model? Is there anything
atypical about your data or parameter choices? Etc.)

- Gordon

Post by Heta Saraiya
Hi,
I am using callbacks to get loss after each epoch. But what I do not
understand is why is the loss oscillating (some times increases then
decreases and then again increases). Can that be due to some hyper
paramater value? I am new to using word2vec and would like suggestions on
how I can check if my model is performing better?
Thanks

Post by Gordon Mohr
Up through at least gensim 3.6.0, the loss value reported may not be very
sensible, only resetting the tally each call to train(), rather than each
internal epoch. There are some potential fixes forthcoming in this pending
https://github.com/RaRe-Technologies/gensim/pull/2135
In the meantime, the losses after each separate call to `train()` using
* in general, most users *shouldn't* be calling `train()` more than once
â it's very easy and common to mismanage parameters like the
`alpha`/`min_alpha` â and perhaps that's happening for you
* there's no guarantee the loss will always decrease; in general when
things are working the loss will head lower over time but eventually reach
some point where the model is doing as well as it can given the inherent
limitations of its technique and complexity â it has "converged" on its
optimal state. After that, it can only get better at some training examples
by worsening performance on others, and thus loss will jitter a little
up-and-down over time, but no longer trend downward. You may have reached
that point.
* a lower loss isn't always better for generalized performance on
downstream applications. Making the model larger â such as by using a
larger vector dimensionality â will usually make the attainable level of
loss lower, because there's then extra room in the model to essentially
memorize idiosyncratic cases in the training data. But such deep adaptation
to just the training data is "overfitting", tending to make the model less
useful on any out-of-training-set data, because rather than learning
rough-but-reusable general patterns, it's just learned mechanistic rules
from the limited training data.
So, it you see the loss going wildly wrong, you might be mishandling
calls to `train()`. But if it's just jittering around a best-achievable
value after sufficient training, that's the normal state of a 'converged'
model that's been trained as much as necessary. And while loss is
definitionally the thing that the single Word2Vec model is locally
optimizing, it's not the thing to optimize in the whole system of
model-plus-downstream-uses. That should be some quantitative measurement of
model quality specific to your downstream tasks, and the smallest-loss
Word2Vec model is unlikely to be the best-general-performance model for
downstream tasks.
- Gordon

Post by Heta Saraiya
Hi,
I am training a dataset using Word2Vec and saving the training loss
after each epoch. But the training loss does not decrease after some epochs
but it increases. Can you give me any idea of why this happens?
Thanks

Heta Saraiya

2018-11-26 20:50:47 UTC

Permalink

class EpochSaver(CallbackAny2Vec):
def __init__(self,path_prefix):
self.epoch=0
def on_epoch_end(self,model):
print("Epoch_"+str(self.epoch))
print("Training loss:"+str(model.get_latest_training_loss()))))
self.epoch+=1
sentences = GetSentences('/filer/corpus')
epoch_saver=EpochSaver("model_some")
model = Word2Vec(sentences, min_count=5, sg=1, size=45, window=8,iter=10,
workers=4,compute_loss=True,callbacks=[epoch_saver])

This is how I get the training done as well as get loss. The sentence is
iterable of my data.
The data size is approximately 1gb. Gensim version is 3.6.0.
Loss does not decrease and then steadies to some value. Its value is always
changing but never decreases to some range.

Post by Gordon Mohr
You'll have to show more details of your code, metaparameters, data
type/quality/size, output values, and gensim version for me to give any
more specific answer than previously.
(Are you possibly affected by the bug I linked? Are you calling train more
than once? Does the loss decrease for a while and then jitter up and down
around some level, which as I've explained is normal behavior â because
loss can't go down forever unless the model is big enough to 'memorize' the
data, which would result in an unhelpful 'overfit' model? Is there anything
atypical about your data or parameter choices? Etc.)
- Gordon

Post by Gordon Mohr
Up through at least gensim 3.6.0, the loss value reported may not be
very sensible, only resetting the tally each call to train(), rather than
each internal epoch. There are some potential fixes forthcoming in this
https://github.com/RaRe-Technologies/gensim/pull/2135
In the meantime, the losses after each separate call to `train()` using
* in general, most users *shouldn't* be calling `train()` more than once
â it's very easy and common to mismanage parameters like the
`alpha`/`min_alpha` â and perhaps that's happening for you
* there's no guarantee the loss will always decrease; in general when
things are working the loss will head lower over time but eventually reach
some point where the model is doing as well as it can given the inherent
limitations of its technique and complexity â it has "converged" on its
optimal state. After that, it can only get better at some training examples
by worsening performance on others, and thus loss will jitter a little
up-and-down over time, but no longer trend downward. You may have reached
that point.
* a lower loss isn't always better for generalized performance on
downstream applications. Making the model larger â such as by using a
larger vector dimensionality â will usually make the attainable level of
loss lower, because there's then extra room in the model to essentially
memorize idiosyncratic cases in the training data. But such deep adaptation
to just the training data is "overfitting", tending to make the model less
useful on any out-of-training-set data, because rather than learning
rough-but-reusable general patterns, it's just learned mechanistic rules
from the limited training data.
So, it you see the loss going wildly wrong, you might be mishandling
calls to `train()`. But if it's just jittering around a best-achievable
value after sufficient training, that's the normal state of a 'converged'
model that's been trained as much as necessary. And while loss is
definitionally the thing that the single Word2Vec model is locally
optimizing, it's not the thing to optimize in the whole system of
model-plus-downstream-uses. That should be some quantitative measurement of
model quality specific to your downstream tasks, and the smallest-loss
Word2Vec model is unlikely to be the best-general-performance model for
downstream tasks.
- Gordon

Post by Heta Saraiya
Hi,
I am training a dataset using Word2Vec and saving the training loss
after each epoch. But the training loss does not decrease after some epochs
but it increases. Can you give me any idea of why this happens?
Thanks

Gordon Mohr

2018-11-27 00:43:48 UTC

Permalink

What loss values are printed?

Is the progression any different if using more iterations? (Say, `iter=20`.)

- Gordon

Post by Heta Saraiya
self.epoch=0
print("Epoch_"+str(self.epoch))
print("Training loss:"+str(model.get_latest_training_loss()))))
self.epoch+=1
sentences = GetSentences('/filer/corpus')
epoch_saver=EpochSaver("model_some")
model = Word2Vec(sentences, min_count=5, sg=1, size=45, window=8,iter=10,
workers=4,compute_loss=True,callbacks=[epoch_saver])
This is how I get the training done as well as get loss. The sentence is
iterable of my data.
The data size is approximately 1gb. Gensim version is 3.6.0.
Loss does not decrease and then steadies to some value. Its value is
always changing but never decreases to some range.

Post by Gordon Mohr
You'll have to show more details of your code, metaparameters, data
type/quality/size, output values, and gensim version for me to give any
more specific answer than previously.
(Are you possibly affected by the bug I linked? Are you calling train
more than once? Does the loss decrease for a while and then jitter up and
down around some level, which as I've explained is normal behavior â
because loss can't go down forever unless the model is big enough to
'memorize' the data, which would result in an unhelpful 'overfit' model? Is
there anything atypical about your data or parameter choices? Etc.)
- Gordon

Post by Gordon Mohr
Up through at least gensim 3.6.0, the loss value reported may not be
very sensible, only resetting the tally each call to train(), rather than
each internal epoch. There are some potential fixes forthcoming in this
https://github.com/RaRe-Technologies/gensim/pull/2135
In the meantime, the losses after each separate call to `train()` using
* in general, most users *shouldn't* be calling `train()` more than
once â it's very easy and common to mismanage parameters like the
`alpha`/`min_alpha` â and perhaps that's happening for you
* there's no guarantee the loss will always decrease; in general when
things are working the loss will head lower over time but eventually reach
some point where the model is doing as well as it can given the inherent
limitations of its technique and complexity â it has "converged" on its
optimal state. After that, it can only get better at some training examples
by worsening performance on others, and thus loss will jitter a little
up-and-down over time, but no longer trend downward. You may have reached
that point.
* a lower loss isn't always better for generalized performance on
downstream applications. Making the model larger â such as by using a
larger vector dimensionality â will usually make the attainable level of
loss lower, because there's then extra room in the model to essentially
memorize idiosyncratic cases in the training data. But such deep adaptation
to just the training data is "overfitting", tending to make the model less
useful on any out-of-training-set data, because rather than learning
rough-but-reusable general patterns, it's just learned mechanistic rules
from the limited training data.
So, it you see the loss going wildly wrong, you might be mishandling
calls to `train()`. But if it's just jittering around a best-achievable
value after sufficient training, that's the normal state of a 'converged'
model that's been trained as much as necessary. And while loss is
definitionally the thing that the single Word2Vec model is locally
optimizing, it's not the thing to optimize in the whole system of
model-plus-downstream-uses. That should be some quantitative measurement of
model quality specific to your downstream tasks, and the smallest-loss
Word2Vec model is unlikely to be the best-general-performance model for
downstream tasks.
- Gordon

Post by Heta Saraiya
Hi,
I am training a dataset using Word2Vec and saving the training loss
after each epoch. But the training loss does not decrease after some epochs
but it increases. Can you give me any idea of why this happens?
Thanks

Heta Saraiya

2018-11-27 06:44:10 UTC

Permalink

Training loss printed are huge numbers. And they do not change even when I
increase the no. of iterations. They just decrease and increase in small
amounts from start.

-Heta

Post by Gordon Mohr
What loss values are printed?
Is the progression any different if using more iterations? (Say, `iter=20`.)
- Gordon

Post by Gordon Mohr
You'll have to show more details of your code, metaparameters, data
type/quality/size, output values, and gensim version for me to give any
more specific answer than previously.
(Are you possibly affected by the bug I linked? Are you calling train
more than once? Does the loss decrease for a while and then jitter up and
down around some level, which as I've explained is normal behavior â
because loss can't go down forever unless the model is big enough to
'memorize' the data, which would result in an unhelpful 'overfit' model? Is
there anything atypical about your data or parameter choices? Etc.)
- Gordon

Post by Gordon Mohr
Up through at least gensim 3.6.0, the loss value reported may not be
very sensible, only resetting the tally each call to train(), rather than
each internal epoch. There are some potential fixes forthcoming in this
https://github.com/RaRe-Technologies/gensim/pull/2135
In the meantime, the losses after each separate call to `train()`
* in general, most users *shouldn't* be calling `train()` more than
once â it's very easy and common to mismanage parameters like the
`alpha`/`min_alpha` â and perhaps that's happening for you
* there's no guarantee the loss will always decrease; in general when
things are working the loss will head lower over time but eventually reach
some point where the model is doing as well as it can given the inherent
limitations of its technique and complexity â it has "converged" on its
optimal state. After that, it can only get better at some training examples
by worsening performance on others, and thus loss will jitter a little
up-and-down over time, but no longer trend downward. You may have reached
that point.
* a lower loss isn't always better for generalized performance on
downstream applications. Making the model larger â such as by using a
larger vector dimensionality â will usually make the attainable level of
loss lower, because there's then extra room in the model to essentially
memorize idiosyncratic cases in the training data. But such deep adaptation
to just the training data is "overfitting", tending to make the model less
useful on any out-of-training-set data, because rather than learning
rough-but-reusable general patterns, it's just learned mechanistic rules
from the limited training data.
So, it you see the loss going wildly wrong, you might be mishandling
calls to `train()`. But if it's just jittering around a best-achievable
value after sufficient training, that's the normal state of a 'converged'
model that's been trained as much as necessary. And while loss is
definitionally the thing that the single Word2Vec model is locally
optimizing, it's not the thing to optimize in the whole system of
model-plus-downstream-uses. That should be some quantitative measurement of
model quality specific to your downstream tasks, and the smallest-loss
Word2Vec model is unlikely to be the best-general-performance model for
downstream tasks.
- Gordon

Post by Heta Saraiya
Hi,
I am training a dataset using Word2Vec and saving the training loss
after each epoch. But the training loss does not decrease after some epochs
but it increases. Can you give me any idea of why this happens?
Thanks

Gordon Mohr

2018-11-27 06:58:10 UTC

Permalink

Vague descriptions like "huge numbersâŠ [that] do not change" might be
useful in a high-level summary from someone who knows what's going on,
because they've applied their expertise and understanding to compress lots
of details down to the essentials.

But for someone else to figure out an issue that's stumping you, you should
provide the exact details they request. It shouldn't be hard to cut & paste
the actual numerical output you're seeing. Otherwise it's nearly impossible
to help.

- Gordon

Post by Heta Saraiya
Training loss printed are huge numbers. And they do not change even when I
increase the no. of iterations. They just decrease and increase in small
amounts from start.
-Heta

Post by Gordon Mohr
What loss values are printed?
Is the progression any different if using more iterations? (Say, `iter=20`.)
- Gordon

Post by Heta Saraiya
self.epoch=0
print("Epoch_"+str(self.epoch))
print("Training loss:"+str(model.get_latest_training_loss()))))
self.epoch+=1
sentences = GetSentences('/filer/corpus')
epoch_saver=EpochSaver("model_some")
model = Word2Vec(sentences, min_count=5, sg=1, size=45,
window=8,iter=10, workers=4,compute_loss=True,callbacks=[epoch_saver])
This is how I get the training done as well as get loss. The sentence is
iterable of my data.
The data size is approximately 1gb. Gensim version is 3.6.0.
Loss does not decrease and then steadies to some value. Its value is
always changing but never decreases to some range.

Post by Gordon Mohr
You'll have to show more details of your code, metaparameters, data
type/quality/size, output values, and gensim version for me to give any
more specific answer than previously.
(Are you possibly affected by the bug I linked? Are you calling train
more than once? Does the loss decrease for a while and then jitter up and
down around some level, which as I've explained is normal behavior â
because loss can't go down forever unless the model is big enough to
'memorize' the data, which would result in an unhelpful 'overfit' model? Is
there anything atypical about your data or parameter choices? Etc.)
- Gordon

Post by Gordon Mohr
Up through at least gensim 3.6.0, the loss value reported may not be
very sensible, only resetting the tally each call to train(), rather than
each internal epoch. There are some potential fixes forthcoming in this
https://github.com/RaRe-Technologies/gensim/pull/2135
In the meantime, the losses after each separate call to `train()`
* in general, most users *shouldn't* be calling `train()` more than
once â it's very easy and common to mismanage parameters like the
`alpha`/`min_alpha` â and perhaps that's happening for you
* there's no guarantee the loss will always decrease; in general when
things are working the loss will head lower over time but eventually reach
some point where the model is doing as well as it can given the inherent
limitations of its technique and complexity â it has "converged" on its
optimal state. After that, it can only get better at some training examples
by worsening performance on others, and thus loss will jitter a little
up-and-down over time, but no longer trend downward. You may have reached
that point.
* a lower loss isn't always better for generalized performance on
downstream applications. Making the model larger â such as by using a
larger vector dimensionality â will usually make the attainable level of
loss lower, because there's then extra room in the model to essentially
memorize idiosyncratic cases in the training data. But such deep adaptation
to just the training data is "overfitting", tending to make the model less
useful on any out-of-training-set data, because rather than learning
rough-but-reusable general patterns, it's just learned mechanistic rules
from the limited training data.
So, it you see the loss going wildly wrong, you might be mishandling
calls to `train()`. But if it's just jittering around a best-achievable
value after sufficient training, that's the normal state of a 'converged'
model that's been trained as much as necessary. And while loss is
definitionally the thing that the single Word2Vec model is locally
optimizing, it's not the thing to optimize in the whole system of
model-plus-downstream-uses. That should be some quantitative measurement of
model quality specific to your downstream tasks, and the smallest-loss
Word2Vec model is unlikely to be the best-general-performance model for
downstream tasks.
- Gordon

Post by Heta Saraiya
Hi,
I am training a dataset using Word2Vec and saving the training loss
after each epoch. But the training loss does not decrease after some epochs
but it increases. Can you give me any idea of why this happens?
Thanks

Heta Saraiya

2018-11-27 17:10:27 UTC

Permalink

Sorry for vague description. I was not sure. I have attached file with loss
values after each iteration.

Post by Gordon Mohr
Vague descriptions like "huge numbersâŠ [that] do not change" might be
useful in a high-level summary from someone who knows what's going on,
because they've applied their expertise and understanding to compress lots
of details down to the essentials.
But for someone else to figure out an issue that's stumping you, you
should provide the exact details they request. It shouldn't be hard to cut
& paste the actual numerical output you're seeing. Otherwise it's nearly
impossible to help.
- Gordon

Post by Heta Saraiya
Training loss printed are huge numbers. And they do not change even when
I increase the no. of iterations. They just decrease and increase in small
amounts from start.
-Heta

Post by Gordon Mohr
What loss values are printed?
Is the progression any different if using more iterations? (Say, `iter=20`.)
- Gordon

Post by Heta Saraiya
self.epoch=0
print("Epoch_"+str(self.epoch))
print("Training loss:"+str(model.get_latest_training_loss()))))
self.epoch+=1
sentences = GetSentences('/filer/corpus')
epoch_saver=EpochSaver("model_some")
model = Word2Vec(sentences, min_count=5, sg=1, size=45,
window=8,iter=10, workers=4,compute_loss=True,callbacks=[epoch_saver])
This is how I get the training done as well as get loss. The sentence
is iterable of my data.
The data size is approximately 1gb. Gensim version is 3.6.0.
Loss does not decrease and then steadies to some value. Its value is
always changing but never decreases to some range.

Post by Gordon Mohr
You'll have to show more details of your code, metaparameters, data
type/quality/size, output values, and gensim version for me to give any
more specific answer than previously.
(Are you possibly affected by the bug I linked? Are you calling train
more than once? Does the loss decrease for a while and then jitter up and
down around some level, which as I've explained is normal behavior â
because loss can't go down forever unless the model is big enough to
'memorize' the data, which would result in an unhelpful 'overfit' model? Is
there anything atypical about your data or parameter choices? Etc.)
- Gordon

Post by Gordon Mohr
Up through at least gensim 3.6.0, the loss value reported may not be
very sensible, only resetting the tally each call to train(), rather than
each internal epoch. There are some potential fixes forthcoming in this
https://github.com/RaRe-Technologies/gensim/pull/2135
In the meantime, the losses after each separate call to `train()`
* in general, most users *shouldn't* be calling `train()` more than
once â it's very easy and common to mismanage parameters like the
`alpha`/`min_alpha` â and perhaps that's happening for you
* there's no guarantee the loss will always decrease; in general
when things are working the loss will head lower over time but eventually
reach some point where the model is doing as well as it can given the
inherent limitations of its technique and complexity â it has "converged"
on its optimal state. After that, it can only get better at some training
examples by worsening performance on others, and thus loss will jitter a
little up-and-down over time, but no longer trend downward. You may have
reached that point.
* a lower loss isn't always better for generalized performance on
downstream applications. Making the model larger â such as by using a
larger vector dimensionality â will usually make the attainable level of
loss lower, because there's then extra room in the model to essentially
memorize idiosyncratic cases in the training data. But such deep adaptation
to just the training data is "overfitting", tending to make the model less
useful on any out-of-training-set data, because rather than learning
rough-but-reusable general patterns, it's just learned mechanistic rules
from the limited training data.
So, it you see the loss going wildly wrong, you might be mishandling
calls to `train()`. But if it's just jittering around a best-achievable
value after sufficient training, that's the normal state of a 'converged'
model that's been trained as much as necessary. And while loss is
definitionally the thing that the single Word2Vec model is locally
optimizing, it's not the thing to optimize in the whole system of
model-plus-downstream-uses. That should be some quantitative measurement of
model quality specific to your downstream tasks, and the smallest-loss
Word2Vec model is unlikely to be the best-general-performance model for
downstream tasks.
- Gordon

Post by Heta Saraiya
Hi,
I am training a dataset using Word2Vec and saving the training loss
after each epoch. But the training loss does not decrease after some epochs
but it increases. Can you give me any idea of why this happens?
Thanks

Gordon Mohr

2018-11-27 19:10:10 UTC

Permalink

The printed loss value never goes down. That's the issue mentioned atop my
first reply: that the reported number is just a running tally from the
start of the `train()` call.

So the real numbers to care about are the differences between each printed
number. This would be the more interesting number to print; here I've
calculated them via a spreadsheet:

Printed Loss Last Epoch Loss
71964 71964
109095 37131
167496 58401
234446 66950
300772 66327
367244 66472
433101 65857
488015 54914
553470 65455
621009 67540
686966 65957
744763 57797
811454 66691
875197 63743
940066 64869
1004814 64748
1068990 64176
1133891 64900
1196838 62948
1262514 65676
1328977 66463

Those are quite strange, in that rather than improving for a while, they
really only improve on the 2nd epoch, before jittering around within a
tight range. That's more typical near the end of training, and indicates
the model has learned as much as it can.

Are you sure this output is from the metaparameters and training code you
showed earlier?

Are you sure your `GetSentences()` code is working properly, providing text
that can be learned from?

For example, what does the following print after your `sentences` is
defined:

print(sum(1 for _ in sentences)) # total count of training examples
first = iter(sentences).next() # get 1st item
print(len(first)) # 1st item's length in words
print(first[0:3]) # 1st item's 1st 3 words

As a separate note, none of the total or per-epoch loss numbers actually
seem 'huge' to me, as it's a tally over all examples in a large dataset.
And if you train a model on a larger dataset, this kind of summed loss
value will go even higher (and when it reaches its best value, still be
higher) than a smaller dataset â even if the model is better at the end,
simply because more examples have been tallied together to get the number.

- Gordon

Post by Heta Saraiya
Sorry for vague description. I was not sure. I have attached file with
loss values after each iteration.

Post by Heta Saraiya
Training loss printed are huge numbers. And they do not change even when
I increase the no. of iterations. They just decrease and increase in small
amounts from start.
-Heta

Post by Gordon Mohr
What loss values are printed?
Is the progression any different if using more iterations? (Say, `iter=20`.)
- Gordon

Post by Heta Saraiya
self.epoch=0
print("Epoch_"+str(self.epoch))
print("Training loss:"+str(model.get_latest_training_loss()))))
self.epoch+=1
sentences = GetSentences('/filer/corpus')
epoch_saver=EpochSaver("model_some")
model = Word2Vec(sentences, min_count=5, sg=1, size=45,
window=8,iter=10, workers=4,compute_loss=True,callbacks=[epoch_saver])
This is how I get the training done as well as get loss. The sentence
is iterable of my data.
The data size is approximately 1gb. Gensim version is 3.6.0.
Loss does not decrease and then steadies to some value. Its value is
always changing but never decreases to some range.

Post by Gordon Mohr
You'll have to show more details of your code, metaparameters, data
type/quality/size, output values, and gensim version for me to give any
more specific answer than previously.
(Are you possibly affected by the bug I linked? Are you calling train
more than once? Does the loss decrease for a while and then jitter up and
down around some level, which as I've explained is normal behavior â
because loss can't go down forever unless the model is big enough to
'memorize' the data, which would result in an unhelpful 'overfit' model? Is
there anything atypical about your data or parameter choices? Etc.)
- Gordon

Post by Gordon Mohr
Up through at least gensim 3.6.0, the loss value reported may not
be very sensible, only resetting the tally each call to train(), rather
than each internal epoch. There are some potential fixes forthcoming in
https://github.com/RaRe-Technologies/gensim/pull/2135
In the meantime, the losses after each separate call to `train()`
* in general, most users *shouldn't* be calling `train()` more than
once â it's very easy and common to mismanage parameters like the
`alpha`/`min_alpha` â and perhaps that's happening for you
* there's no guarantee the loss will always decrease; in general
when things are working the loss will head lower over time but eventually
reach some point where the model is doing as well as it can given the
inherent limitations of its technique and complexity â it has "converged"
on its optimal state. After that, it can only get better at some training
examples by worsening performance on others, and thus loss will jitter a
little up-and-down over time, but no longer trend downward. You may have
reached that point.
* a lower loss isn't always better for generalized performance on
downstream applications. Making the model larger â such as by using a
larger vector dimensionality â will usually make the attainable level of
loss lower, because there's then extra room in the model to essentially
memorize idiosyncratic cases in the training data. But such deep adaptation
to just the training data is "overfitting", tending to make the model less
useful on any out-of-training-set data, because rather than learning
rough-but-reusable general patterns, it's just learned mechanistic rules
from the limited training data.
So, it you see the loss going wildly wrong, you might be
mishandling calls to `train()`. But if it's just jittering around a
best-achievable value after sufficient training, that's the normal state of
a 'converged' model that's been trained as much as necessary. And while
loss is definitionally the thing that the single Word2Vec model is locally
optimizing, it's not the thing to optimize in the whole system of
model-plus-downstream-uses. That should be some quantitative measurement of
model quality specific to your downstream tasks, and the smallest-loss
Word2Vec model is unlikely to be the best-general-performance model for
downstream tasks.
- Gordon

Post by Heta Saraiya
Hi,
I am training a dataset using Word2Vec and saving the training
loss after each epoch. But the training loss does not decrease after some
epochs but it increases. Can you give me any idea of why this happens?
Thanks

Heta Saraiya

2018-11-28 03:01:42 UTC

Permalink

Okay thank you so much for the help. I only have one more question. If I
change the paramaters and train again then can I compare loss values to the
current values to see which model performs better?

Thanks

Gordon Mohr

2018-11-28 08:54:29 UTC

Permalink

Post by Heta Saraiya
Okay thank you so much for the help. I only have one more question. If I
change the paramaters and train again then can I compare loss values to the
current values to see which model performs better?

No, as mentioned previously, the loss is not a reliable indicator of
overall model quality. The model with the lowest loss could perform worse
on real tasks â as in the given example of an overfit model. It's just an
indicator of training progress, and when loss stops improving it's a hint
that further training can't help.

Further, many of the parameters change the type/amount of training that
happens. For example, a different 'negative' value means more
negative-examples are trained. A different 'window' means more
(context->target) examples are constructed. A different `sample` value
drops a different proportion of words. A different 'min_count' drops
different low-frequency words. The loss values are at best just comparable
within a single model, over the course of its training.

Is there a reason you can't share the `sentences` output I suggested to
debug your problem? Did you try that at all, and did it lead lead you to
discover an error you were making that explained the prior atypical loss
behavior?

- Gordon

Post by Heta Saraiya
Thanks

Heta Saraiya

2018-11-29 06:04:40 UTC

Permalink

The output of sentences you shared gave me:
print(sum(1 for _ in sentences)) # total count of training examples
1565475
first = iter(sentences).next() # get 1st item
print(len(first)) # 1st item's length in words 91

I have also attached new training loss after I ran it again.

If I cannot compare 2 training loss from different models then how can I
know which paramters are better suitable for my data?

Thanks

Post by Gordon Mohr

No, as mentioned previously, the loss is not a reliable indicator of
overall model quality. The model with the lowest loss could perform worse
on real tasks â as in the given example of an overfit model. It's just an
indicator of training progress, and when loss stops improving it's a hint
that further training can't help.
Further, many of the parameters change the type/amount of training that
happens. For example, a different 'negative' value means more
negative-examples are trained. A different 'window' means more
(context->target) examples are constructed. A different `sample` value
drops a different proportion of words. A different 'min_count' drops
different low-frequency words. The loss values are at best just comparable
within a single model, over the course of its training.
Is there a reason you can't share the `sentences` output I suggested to
debug your problem? Did you try that at all, and did it lead lead you to
discover an error you were making that explained the prior atypical loss
behavior?
- Gordon

Post by Heta Saraiya
Thanks

Post by Heta Saraiya
Hi,
I am training a dataset using Word2Vec and saving the training loss
after each epoch. But the training loss does not decrease after some epochs
but it increases. Can you give me any idea of why this happens?
Thanks

Gordon Mohr

2018-11-29 12:41:53 UTC

Permalink

Post by Gordon Mohr
print(sum(1 for _ in sentences)) # total count of training examples
1565475
first = iter(sentences).next() # get 1st item
print(len(first)) # 1st item's length in words 91

And what about the output of the 3rd print statement, "print(first[0:3]) #
1st item's 1st 3 words"?

Also, it would be better to simply run all four suggested lines as given,
after `sentences` was created, then copy & paste the exact 3 lines of
output, rather than pasting results at the end of each line. Now, I'm less
sure that all lines were run together, in order. (Doing that would have
also checked for another common error in peoples' corpus-iterable-object.
If you've collected the results for different lines in different runs, the
output isn't as useful. If you got any errors trying to run the 4 suggested
lines, that'd be useful info.)

I have also attached new training loss after I ran it again.
Those are very odd results, in that the difference-in-loss becomes 0 after
10 iterations.

I suspect some or all of:

(1) An error in your difference calculation/display;
(2) A problem with your training corpus; running all 4 requested lines
together would help identify or rule out some of these potential problems.
(3) You've been changing other things about your parameters/code at the
same time as you're following my suggestions, introducing new problems. For
example, your previous strange output was for 20 iterations, and showed
essentially no decrease-in-epoch-loss over 20 passes. This new output shows
25 iterations, and a decrease-in-epoch-loss for the 1st 10 passes, then the
odd stabilization at per-epoch loss of 0. So it looks like you're trying
several things at the same time, without sharing all the details of what
you've changed, making it very hard to guess what could be causing that
output.

Post by Gordon Mohr
If I cannot compare 2 training loss from different models then how can I
know which paramters are better suitable for my data?

As mentioned in my 1st response on this thread:

"And while loss is definitionally the thing that the single Word2Vec model
is locally optimizing, it's not the thing to optimize in the whole system
of model-plus-downstream-uses. That should be some quantitative measurement
of model quality specific to your downstream tasks, and the smallest-loss
Word2Vec model is unlikely to be the best-general-performance model for
downstream tasks."

That means: you have to test the resulting model/word-vectors on some
version of the real task(s) where you want to use word-vectors. That's the
only real measure of whether you've chosen good parameters.

If you don't have a way to run such a test, you could look at other more
generic measures - there's a method `evaluate_word_analogies()` on the
word-vectors object (`model.wv`) that can be fed a series of word-analogy
problems from the original Google word2vec.c release, and return a score on
that task. But of course that may not test your corpus's most important
words, and further, word-vectors that do best on analogies may not do best
for classification problems, or info-retrieval, or other tasks. To know
which parameters are best for your project, you need to check them against
some version of that task.

- Gordon

Thanks

Post by Gordon Mohr

No, as mentioned previously, the loss is not a reliable indicator of
overall model quality. The model with the lowest loss could perform worse
on real tasks â as in the given example of an overfit model. It's just an
indicator of training progress, and when loss stops improving it's a hint
that further training can't help.
Further, many of the parameters change the type/amount of training that
happens. For example, a different 'negative' value means more
negative-examples are trained. A different 'window' means more
(context->target) examples are constructed. A different `sample` value
drops a different proportion of words. A different 'min_count' drops
different low-frequency words. The loss values are at best just comparable
within a single model, over the course of its training.
Is there a reason you can't share the `sentences` output I suggested to
debug your problem? Did you try that at all, and did it lead lead you to
discover an error you were making that explained the prior atypical loss
behavior?
- Gordon

Post by Heta Saraiya
Thanks

Post by Heta Saraiya
Hi,
I am training a dataset using Word2Vec and saving the training loss
after each epoch. But the training loss does not decrease after some epochs
but it increases. Can you give me any idea of why this happens?
Thanks

Heta Saraiya

2018-11-29 17:46:19 UTC

Permalink

On Thursday, November 29, 2018 at 7:41:53 AM UTC-5, Gordon Mohr wrote:

I ran all 4 lines together before running the training. I just copied the
output to end of line to make it easier to understand. Also the 3rd line fo
print(first[0:3])) I got the words from my sentences.
My sentences are not english sentences we are training on assembly language
instruction.

Post by Gordon Mohr

Post by Gordon Mohr
print(sum(1 for _ in sentences)) # total count of training examples
1565475
first = iter(sentences).next() # get 1st item
print(len(first)) # 1st item's length in words 91

I just calcuated loss by subtracting difference between 2 epochs i.e.
current-previous. I have also printed the original values to get the value
before subtracting. Also I have not changed any other paramater than no. of
iteration. The results from before were not for the whole dataset as it
didnt take whole dataset. This time I made sure to get the whole dataset. I
am not sure what 0 means for training loss. Does it mean that the loss is
stabilized and there will be no more change in it or is it an error?

Post by Gordon Mohr
Those are very odd results, in that the difference-in-loss becomes 0 after
10 iterations.
(1) An error in your difference calculation/display;
(2) A problem with your training corpus; running all 4 requested lines
together would help identify or rule out some of these potential problems.
(3) You've been changing other things about your parameters/code at the
same time as you're following my suggestions, introducing new problems. For
example, your previous strange output was for 20 iterations, and showed
essentially no decrease-in-epoch-loss over 20 passes. This new output shows
25 iterations, and a decrease-in-epoch-loss for the 1st 10 passes, then the
odd stabilization at per-epoch loss of 0. So it looks like you're trying
several things at the same time, without sharing all the details of what
you've changed, making it very hard to guess what could be causing that
output.

Post by Gordon Mohr
If I cannot compare 2 training loss from different models then how can I
know which paramters are better suitable for my data?

"And while loss is definitionally the thing that the single Word2Vec model
is locally optimizing, it's not the thing to optimize in the whole system
of model-plus-downstream-uses. That should be some quantitative measurement
of model quality specific to your downstream tasks, and the smallest-loss
Word2Vec model is unlikely to be the best-general-performance model for
downstream tasks."
That means: you have to test the resulting model/word-vectors on some
version of the real task(s) where you want to use word-vectors. That's the
only real measure of whether you've chosen good parameters.
If you don't have a way to run such a test, you could look at other more
generic measures - there's a method `evaluate_word_analogies()` on the
word-vectors object (`model.wv`) that can be fed a series of word-analogy
problems from the original Google word2vec.c release, and return a score on
that task. But of course that may not test your corpus's most important
words, and further, word-vectors that do best on analogies may not do best
for classification problems, or info-retrieval, or other tasks. To know
which parameters are best for your project, you need to check them against
some version of that task.
- Gordon
Thanks

Post by Gordon Mohr

Post by Heta Saraiya
Okay thank you so much for the help. I only have one more question. If
I change the paramaters and train again then can I compare loss values to
the current values to see which model performs better?

No, as mentioned previously, the loss is not a reliable indicator of
overall model quality. The model with the lowest loss could perform worse
on real tasks â as in the given example of an overfit model. It's just an
indicator of training progress, and when loss stops improving it's a hint
that further training can't help.
Further, many of the parameters change the type/amount of training that
happens. For example, a different 'negative' value means more
negative-examples are trained. A different 'window' means more
(context->target) examples are constructed. A different `sample` value
drops a different proportion of words. A different 'min_count' drops
different low-frequency words. The loss values are at best just comparable
within a single model, over the course of its training.
Is there a reason you can't share the `sentences` output I suggested to
debug your problem? Did you try that at all, and did it lead lead you to
discover an error you were making that explained the prior atypical loss
behavior?
- Gordon

Post by Heta Saraiya
Thanks

Post by Heta Saraiya
Hi,
I am training a dataset using Word2Vec and saving the training loss
after each epoch. But the training loss does not decrease after some epochs
but it increases. Can you give me any idea of why this happens?
Thanks

Gordon Mohr

2018-11-29 21:29:39 UTC

Permalink

Post by Heta Saraiya
I ran all 4 lines together before running the training. I just copied the
output to end of line to make it easier to understand. Also the 3rd line fo
print(first[0:3])) I got the words from my sentences.
My sentences are not english sentences we are training on assembly
language instruction.
That's useful information, The two common errors I was hoping to rule-out

with the complete output were:

(1) corpus iterables that can't restart for a 2nd iteration (which would
trigger an error)
(2) providing strings, rather than lists-of-words, as examples (which would
show up as the 1st three words being just letters)

(1) might still be an issue, if there's a problem with your GetSentences.

If you saw multi-character tokens as the `first[0:3]` printed output, then
you don't literally have (2). But it's possible there's so little
semantic-relatedness, in context-windows of your domain of assembly
language instructions, that Word2Vec can't learn much. (And that would be a
bit analogous to the prospect of trying to train word2vec on individual
characters instead of words.) If so, treating n-grams of multiple
instructions might be more word-like, and thus more of a fit for Word2Vec,
but that's just speculation.

Post by Heta Saraiya

Post by Gordon Mohr

Post by Gordon Mohr
print(sum(1 for _ in sentences)) # total count of training examples
1565475
first = iter(sentences).next() # get 1st item
print(len(first)) # 1st item's length in words 91

I can't yet imagine any mechanism whereby changing just the iterations,
from 20 to 25, would change the pattern from the 1st output you showed â
essentially no change in epoch-loss over 20 passes â to the pattern in the
2nd output you showed â epoch-loss starting large, plummeting to a tight
range, then to 0 long before all iterations are done. Specifically:

72899680
8575336
8470784
8353016
8254568
8131600
8021816
7928136
3582792
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0

This is beyond bizarre, and is likely indicative of multiple irregularities
in your code and corpus. It indicates no further model adjustment is
happening â normal stabilization ('convergence') of a useful model will be
at some non-zero loss level. You should revert to 20 iterations and see if
you can get the old behavior. You should enable INFO logging and watch the
output for any other suspicious timings/progress-reports (like later epochs
completing instantly compared to the time taken on earlier epochs).

If you can make a small, self-contained example using a shareable portion
of your data, or similar public data, that can reproduce either of these
epoch-loss behaviors, you could share it completely, and it'd probably be
obvious what's going wrong. But without that, I have no further guesses.

- Gordon

Post by Heta Saraiya
Those are very odd results, in that the difference-in-loss becomes 0 after

Post by Gordon Mohr
10 iterations.
(1) An error in your difference calculation/display;
(2) A problem with your training corpus; running all 4 requested lines
together would help identify or rule out some of these potential problems.
(3) You've been changing other things about your parameters/code at the
same time as you're following my suggestions, introducing new problems. For
example, your previous strange output was for 20 iterations, and showed
essentially no decrease-in-epoch-loss over 20 passes. This new output shows
25 iterations, and a decrease-in-epoch-loss for the 1st 10 passes, then the
odd stabilization at per-epoch loss of 0. So it looks like you're trying
several things at the same time, without sharing all the details of what
you've changed, making it very hard to guess what could be causing that
output.

Post by Gordon Mohr
If I cannot compare 2 training loss from different models then how can I
know which paramters are better suitable for my data?

"And while loss is definitionally the thing that the single Word2Vec
model is locally optimizing, it's not the thing to optimize in the whole
system of model-plus-downstream-uses. That should be some quantitative
measurement of model quality specific to your downstream tasks, and the
smallest-loss Word2Vec model is unlikely to be the best-general-performance
model for downstream tasks."
That means: you have to test the resulting model/word-vectors on some
version of the real task(s) where you want to use word-vectors. That's the
only real measure of whether you've chosen good parameters.
If you don't have a way to run such a test, you could look at other more
generic measures - there's a method `evaluate_word_analogies()` on the
word-vectors object (`model.wv`) that can be fed a series of word-analogy
problems from the original Google word2vec.c release, and return a score on
that task. But of course that may not test your corpus's most important
words, and further, word-vectors that do best on analogies may not do best
for classification problems, or info-retrieval, or other tasks. To know
which parameters are best for your project, you need to check them against
some version of that task.
- Gordon
Thanks

Post by Gordon Mohr

Post by Heta Saraiya
Okay thank you so much for the help. I only have one more question. If
I change the paramaters and train again then can I compare loss values to
the current values to see which model performs better?

No, as mentioned previously, the loss is not a reliable indicator of
overall model quality. The model with the lowest loss could perform worse
on real tasks â as in the given example of an overfit model. It's just an
indicator of training progress, and when loss stops improving it's a hint
that further training can't help.
Further, many of the parameters change the type/amount of training that
happens. For example, a different 'negative' value means more
negative-examples are trained. A different 'window' means more
(context->target) examples are constructed. A different `sample` value
drops a different proportion of words. A different 'min_count' drops
different low-frequency words. The loss values are at best just comparable
within a single model, over the course of its training.
Is there a reason you can't share the `sentences` output I suggested to
debug your problem? Did you try that at all, and did it lead lead you to
discover an error you were making that explained the prior atypical loss
behavior?
- Gordon

Post by Heta Saraiya
Thanks

Post by Heta Saraiya
Hi,
I am training a dataset using Word2Vec and saving the training loss
after each epoch. But the training loss does not decrease after some epochs
but it increases. Can you give me any idea of why this happens?
Thanks