[gensim:11748] Getting Poor Results from Doc2Vec

Discussion:

d***@gmail.com

2018-11-08 15:16:38 UTC

Hi,
I loaded about 2 million documents into Doc2Vec using this

model = Doc2Vec(size=10, hs=1, alpha=0.025, min_alpha=0.025, min_count=3,
dm=0)

trained for 20 epochs.

I use a large portion of one of the document texts to infer_vector and then
use those to
docvecs.most_similar

and the top 10 results literally have nothing in common with the submitted
text. Even the document
it came from didn't show up in the results.

I realize people will clean the text, strip stopwords, stem, etc. but I'm
just trying to get back a known document and
it doesn't do that.

What am I missing here?

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

2018-11-08 15:51:04 UTC

Permalink

There are likely other issues with your not-shown preprocessing code,
training code, testing code, or text corpus.

The metaparameters shown in the single line of code shared are already
fishy:

* typical vector-sizes are 100 dimensions or more. (Smaller sizes may be
helpful for tiny datasets, but these algorithms generally require larger
datasets to show their advantages, so 'toy-sized' examples will rarely be
satisfying.)

* why choose non-default `hs=1`, `min_count=3`, 'min_alpha=0.025`? That
last choice is especially nonsensical: the usual form of training requires
the learning-rate to decrease, not stay constant, over training.

Enable logging at the INFO level. Watch the log output for warnings or
confirmation that the expected amount of training, on the right number of
words/texts, is occurring. If you're still having problems show the full
code that performs the training & testing, and describe your training
corpus in more details (such as number of documents & typical
words-per-document).

- Gordon

Post by d***@gmail.com
Hi,
I loaded about 2 million documents into Doc2Vec using this
model = Doc2Vec(size=10, hs=1, alpha=0.025, min_alpha=0.025, min_count=3,
dm=0)
trained for 20 epochs.
I use a large portion of one of the document texts to infer_vector and
then use those to
docvecs.most_similar
and the top 10 results literally have nothing in common with the submitted
text. Even the document
it came from didn't show up in the results.
I realize people will clean the text, strip stopwords, stem, etc. but I'm
just trying to get back a known document and
it doesn't do that.
What am I missing here?

d***@gmail.com

2018-11-08 18:08:04 UTC

Permalink

Gordon,
Thanks for the reply. My entire code is here. I will make some
adjustments per your advice and see how it turns out.

Darren

model = Doc2Vec(size=10, hs=1, alpha=0.025, min_alpha=0.025, min_count=3,
dm=0) # use fixed learning rate
tagged = []

docs = {}
i = 0
for result in results_gen:
try:
tdoc = TaggedDocument(words=result['_source']['claims'].split(),
tags=[result['_source']['appId'],result['_source']['pgpub']])
tagged += [tdoc]
i += 1
print("Tagged Doc["+str(i)+"]): "+result['_id'])
except:
print(traceback.format_exc())
print("ERROR")

print(len(tagged))
model.build_vocab(tagged)
for epoch in range(20):
print("EPOCH: ",epoch)
model.train(tagged, epochs=model.iter, total_examples=model.corpus_count)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay

import json
model.save("doc2vec_docs.model")
print("Saved.")

Post by Gordon Mohr
There are likely other issues with your not-shown preprocessing code,
training code, testing code, or text corpus.
The metaparameters shown in the single line of code shared are already
* typical vector-sizes are 100 dimensions or more. (Smaller sizes may be
helpful for tiny datasets, but these algorithms generally require larger
datasets to show their advantages, so 'toy-sized' examples will rarely be
satisfying.)
* why choose non-default `hs=1`, `min_count=3`, 'min_alpha=0.025`? That
last choice is especially nonsensical: the usual form of training requires
the learning-rate to decrease, not stay constant, over training.
Enable logging at the INFO level. Watch the log output for warnings or
confirmation that the expected amount of training, on the right number of
words/texts, is occurring. If you're still having problems show the full
code that performs the training & testing, and describe your training
corpus in more details (such as number of documents & typical
words-per-document).
- Gordon

Post by d***@gmail.com
Hi,
I loaded about 2 million documents into Doc2Vec using this
model = Doc2Vec(size=10, hs=1, alpha=0.025, min_alpha=0.025, min_count=3,
dm=0)
trained for 20 epochs.
I use a large portion of one of the document texts to infer_vector and
then use those to
docvecs.most_similar
and the top 10 results literally have nothing in common with the
submitted text. Even the document
it came from didn't show up in the results.
I realize people will clean the text, strip stopwords, stem, etc. but I'm
just trying to get back a known document and
it doesn't do that.
What am I missing here?

Gordon Mohr

2018-11-08 19:07:36 UTC

Permalink

You do 20 loops of decreasing the learning-rate `alpha` by 0.002, so a
total decrease of 0.04. But you only start at an alpha of 0.025. So many of
your training passes will occur with a nonsensical negative `alpha` value,
meaning the model for those passes is trying to increase its training-loss.

`train()` itself can manage `alpha` properly for you, with a single call.
It not recommended to call it more than once in your own loop, unless
you're sure you know why you need to.

What online guide/example did you use as the template for this code? I'd
like to get it corrected - because it's a nonsensical pattern, but wherever
it's still appearing, a lot of people are copying it and making the similar
mistakes.

- Gordon

Post by d***@gmail.com
Gordon,
Thanks for the reply. My entire code is here. I will make some
adjustments per your advice and see how it turns out.
Darren
model = Doc2Vec(size=10, hs=1, alpha=0.025, min_alpha=0.025, min_count=3,
dm=0) # use fixed learning rate
tagged = []
docs = {}
i = 0
tdoc = TaggedDocument(words=result['_source']['claims'].split(),
tags=[result['_source']['appId'],result['_source']['pgpub']])
tagged += [tdoc]
i += 1
print("Tagged Doc["+str(i)+"]): "+result['_id'])
print(traceback.format_exc())
print("ERROR")
print(len(tagged))
model.build_vocab(tagged)
print("EPOCH: ",epoch)
model.train(tagged, epochs=model.iter,
total_examples=model.corpus_count)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
import json
model.save("doc2vec_docs.model")
print("Saved.")

Post by d***@gmail.com
Hi,
I loaded about 2 million documents into Doc2Vec using this
model = Doc2Vec(size=10, hs=1, alpha=0.025, min_alpha=0.025,
min_count=3, dm=0)
trained for 20 epochs.
I use a large portion of one of the document texts to infer_vector and
then use those to
docvecs.most_similar
and the top 10 results literally have nothing in common with the
submitted text. Even the document
it came from didn't show up in the results.
I realize people will clean the text, strip stopwords, stem, etc. but
I'm just trying to get back a known document and
it doesn't do that.
What am I missing here?

Rajat Mehta

2018-11-09 13:15:24 UTC

Permalink

Hi Gordon,

I believe most of the people are copying this code from this web page:
https://rare-technologies.com/doc2vec-tutorial/

Thanks,
Rajat

Post by Gordon Mohr
You do 20 loops of decreasing the learning-rate `alpha` by 0.002, so a
total decrease of 0.04. But you only start at an alpha of 0.025. So many of
your training passes will occur with a nonsensical negative `alpha` value,
meaning the model for those passes is trying to increase its training-loss.
`train()` itself can manage `alpha` properly for you, with a single call.
It not recommended to call it more than once in your own loop, unless
you're sure you know why you need to.
What online guide/example did you use as the template for this code? I'd
like to get it corrected - because it's a nonsensical pattern, but wherever
it's still appearing, a lot of people are copying it and making the similar
mistakes.
- Gordon

Post by d***@gmail.com
Hi,
I loaded about 2 million documents into Doc2Vec using this
model = Doc2Vec(size=10, hs=1, alpha=0.025, min_alpha=0.025,
min_count=3, dm=0)
trained for 20 epochs.
I use a large portion of one of the document texts to infer_vector and
then use those to
docvecs.most_similar
and the top 10 results literally have nothing in common with the
submitted text. Even the document
it came from didn't show up in the results.
I realize people will clean the text, strip stopwords, stem, etc. but
I'm just trying to get back a known document and
it doesn't do that.
What am I missing here?
--

You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

2018-11-09 18:38:22 UTC

Permalink

That is a severely-outdated example that should be corrected, removed, or
further disclaimered (beyond the easy-to-miss notice on the top to consult
the improved tutorial notebook).

But that code also won't run without extra changes to how `train()` is
called, and doesn't choose contradictory numbers that result in a negative
learning-rate.

So I believe the actual path of this antipattern to most people is via some
other examples, and would like to hear from "***@gmail.com" what
actual online sources were used as a model.

- Gordon

Post by Rajat Mehta
Hi Gordon,
https://rare-technologies.com/doc2vec-tutorial/
Thanks,
Rajat

Post by d***@gmail.com
Gordon,
Thanks for the reply. My entire code is here. I will make some
adjustments per your advice and see how it turns out.
Darren
model = Doc2Vec(size=10, hs=1, alpha=0.025, min_alpha=0.025,
min_count=3, dm=0) # use fixed learning rate
tagged = []
docs = {}
i = 0
tdoc = TaggedDocument(words=result['_source']['claims'].split(),
tags=[result['_source']['appId'],result['_source']['pgpub']])
tagged += [tdoc]
i += 1
print("Tagged Doc["+str(i)+"]): "+result['_id'])
print(traceback.format_exc())
print("ERROR")
print(len(tagged))
model.build_vocab(tagged)
print("EPOCH: ",epoch)
model.train(tagged, epochs=model.iter,
total_examples=model.corpus_count)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
import json
model.save("doc2vec_docs.model")
print("Saved.")

Post by Gordon Mohr
There are likely other issues with your not-shown preprocessing code,
training code, testing code, or text corpus.
The metaparameters shown in the single line of code shared are already
* typical vector-sizes are 100 dimensions or more. (Smaller sizes may
be helpful for tiny datasets, but these algorithms generally require larger
datasets to show their advantages, so 'toy-sized' examples will rarely be
satisfying.)
* why choose non-default `hs=1`, `min_count=3`, 'min_alpha=0.025`?
That last choice is especially nonsensical: the usual form of training
requires the learning-rate to decrease, not stay constant, over training.
Enable logging at the INFO level. Watch the log output for warnings or
confirmation that the expected amount of training, on the right number of
words/texts, is occurring. If you're still having problems show the full
code that performs the training & testing, and describe your training
corpus in more details (such as number of documents & typical
words-per-document).
- Gordon

Post by d***@gmail.com
Hi,
I loaded about 2 million documents into Doc2Vec using this
model = Doc2Vec(size=10, hs=1, alpha=0.025, min_alpha=0.025,
min_count=3, dm=0)
trained for 20 epochs.
I use a large portion of one of the document texts to infer_vector and
then use those to
docvecs.most_similar
and the top 10 results literally have nothing in common with the
submitted text. Even the document
it came from didn't show up in the results.
I realize people will clean the text, strip stopwords, stem, etc. but
I'm just trying to get back a known document and
it doesn't do that.
What am I missing here?
--

d***@gmail.com

2018-11-11 16:08:48 UTC

Permalink

Hi,
I updated my code per Gordon's request. but unfortunately, the results
are still "non-sensical".
I realize there is some guesswork involved in determing the optimal
hyper-parameters, but alas I only have so much time to hunt for such things.

model = Doc2Vec(size=600, window=10, min_count=3, alpha=0.025,
min_alpha=0.001, dm=0, iter=5,
dm_mean=1, dbow_words=1, workers=NUM_WORKERS)

tagged = []

for result in results_gen:
try:
text= result['text']
tdoc = TaggedDocument(words=text.split(), tags=[label1,label2])
tagged += [tdoc]
print("Tagged Doc["+str(i)+"]): "+result['_id'])
except:
print(traceback.format_exc())
print("ERROR")

model.build_vocab(tagged)
model.train(tagged,epochs=model.iter, start_alpha=0.025,
end_alpha=0.001,total_examples=model.corpus_count)
model.save("doc2vec_spec.model")
print("Saved.")

-----EVALUATE
text = """....."""

model = Doc2Vec.load("doc2vec_spec.model")
print("Done.")
print(len(model.wv.vocab)," unique words.")

vector = model.infer_vector(text.split())
print("VECTOR:",vector)
#print("MOST SIMILAR:",model.most_similar(positive=[vector], topn=10))
docvecs = model.docvecs.most_similar(positive=[vector], topn=10)

for vec in docvecs:
print(vec)

------

So far I can't get any logical or expected results, that a simple search
engine (Solr, Lucene, etc) would give.

My purpose here is to evaluate gensim Doc2Vec to see how it's out of the
box similarity stacks up to basic search (which we are moving away from).

Darren

Post by Gordon Mohr
That is a severely-outdated example that should be corrected, removed, or
further disclaimered (beyond the easy-to-miss notice on the top to consult
the improved tutorial notebook).
But that code also won't run without extra changes to how `train()` is
called, and doesn't choose contradictory numbers that result in a negative
learning-rate.
So I believe the actual path of this antipattern to most people is via
what actual online sources were used as a model.
- Gordon

Post by Rajat Mehta
Hi Gordon,
https://rare-technologies.com/doc2vec-tutorial/
Thanks,
Rajat

Post by Gordon Mohr
You do 20 loops of decreasing the learning-rate `alpha` by 0.002, so a
total decrease of 0.04. But you only start at an alpha of 0.025. So many of
your training passes will occur with a nonsensical negative `alpha` value,
meaning the model for those passes is trying to increase its training-loss.
`train()` itself can manage `alpha` properly for you, with a single
call. It not recommended to call it more than once in your own loop, unless
you're sure you know why you need to.
What online guide/example did you use as the template for this code? I'd
like to get it corrected - because it's a nonsensical pattern, but wherever
it's still appearing, a lot of people are copying it and making the similar
mistakes.
- Gordon

Post by d***@gmail.com
Gordon,
Thanks for the reply. My entire code is here. I will make some
adjustments per your advice and see how it turns out.
Darren
model = Doc2Vec(size=10, hs=1, alpha=0.025, min_alpha=0.025,
min_count=3, dm=0) # use fixed learning rate
tagged = []
docs = {}
i = 0
tdoc =
TaggedDocument(words=result['_source']['claims'].split(),
tags=[result['_source']['appId'],result['_source']['pgpub']])
tagged += [tdoc]
i += 1
print("Tagged Doc["+str(i)+"]): "+result['_id'])
print(traceback.format_exc())
print("ERROR")
print(len(tagged))
model.build_vocab(tagged)
print("EPOCH: ",epoch)
model.train(tagged, epochs=model.iter,
total_examples=model.corpus_count)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
import json
model.save("doc2vec_docs.model")
print("Saved.")

Post by Gordon Mohr
There are likely other issues with your not-shown preprocessing code,
training code, testing code, or text corpus.
The metaparameters shown in the single line of code shared are already
* typical vector-sizes are 100 dimensions or more. (Smaller sizes may
be helpful for tiny datasets, but these algorithms generally require larger
datasets to show their advantages, so 'toy-sized' examples will rarely be
satisfying.)
* why choose non-default `hs=1`, `min_count=3`, 'min_alpha=0.025`?
That last choice is especially nonsensical: the usual form of training
requires the learning-rate to decrease, not stay constant, over training.
Enable logging at the INFO level. Watch the log output for warnings or
confirmation that the expected amount of training, on the right number of
words/texts, is occurring. If you're still having problems show the full
code that performs the training & testing, and describe your training
corpus in more details (such as number of documents & typical
words-per-document).
- Gordon

Post by d***@gmail.com
Hi,
I loaded about 2 million documents into Doc2Vec using this
model = Doc2Vec(size=10, hs=1, alpha=0.025, min_alpha=0.025,
min_count=3, dm=0)
trained for 20 epochs.
I use a large portion of one of the document texts to infer_vector
and then use those to
docvecs.most_similar
and the top 10 results literally have nothing in common with the
submitted text. Even the document
it came from didn't show up in the results.
I realize people will clean the text, strip stopwords, stem, etc. but
I'm just trying to get back a known document and
it doesn't do that.
What am I missing here?
--

You received this message because you are subscribed to the Google
Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

2018-11-11 18:58:06 UTC

Permalink

Thoughts:

* it'd still be useful to me and others to know what online guide was the
original basis for your original approach

* it's not clear what you mean by "nonsensical"; typically if enough
training is happening, supplying the same words to `infer_vector()` as one
of the training-set documents will return that same document's unique
ID-tag as the top hit, or one of the top hits

* a usual `iter` (aka `epochs`) value in published Doc2Vec work is 10-20
(and sometimes more); the 5 you're now using is very short training for
bulk training, and very short inference for `infer_vector()`; if your
documents are short this will hurt even more

* without seeing what your `results_gen` is, it's unclear the word-tokens
are of the proper form; the loop as currently shown uses the exact same
`label1` and `label2` for *all* `TaggedDocuments`, meaning the final model
would only know two unique tags, which have essentially been trained with
just one mega-document (making for a worthless model without any idea of
many-document contrasts)

* using more-than-one tag per text is essentially an advanced/experimental
technique; I wouldn't recommend adding that complication until after having
a more basic setup working

* all your deviations from defaults seem unmotivated:
** 600 dimensions could be overkill
** `dm=0, window=10, dbow_words=1` mean word-vectors are getting about 10x
more training-attention than doc-vectors (which could hurt their quality)
** a smaller `min_count` than the default can hurt; with a corpus of 2
million docs and possibly a very-large vocabulary it'd be more common to
use a larger-than-default `min_count` to improve results than a
smaller-than-default value.

So my suggestion would be to keep the model very simple until you're
starting to see useful results, and only then tinker with things like
larger vectors, larger windows, different min_counts,
multiple-tags-per-doc, etc.

For example:

# ...fix creation of `tagged` list to ensure it has 2 million docs,
each with one unique ID-tag...

model = Doc2Vec(tagged, dm=0, iter=20, workers=4) # PV-DBOW, leave
everything else default

# ...then run your ad-hoc doc-vector quality probes...

Unless there's something seriously weird or broken in your corpus, looking
for the `most_similar()` of an `infer_vector()` for a text that was in the
training set should return the same text's id as one of the very-top hits.

- Gordon

Post by d***@gmail.com
Hi,
I updated my code per Gordon's request. but unfortunately, the results
are still "non-sensical".
I realize there is some guesswork involved in determing the optimal
hyper-parameters, but alas I only have so much time to hunt for such things.
model = Doc2Vec(size=600, window=10, min_count=3, alpha=0.025,
min_alpha=0.001, dm=0, iter=5,
dm_mean=1, dbow_words=1, workers=NUM_WORKERS)
tagged = []
text= result['text']
tdoc = TaggedDocument(words=text.split(), tags=[label1,label2])
tagged += [tdoc]
print("Tagged Doc["+str(i)+"]): "+result['_id'])
print(traceback.format_exc())
print("ERROR")
model.build_vocab(tagged)
model.train(tagged,epochs=model.iter, start_alpha=0.025,
end_alpha=0.001,total_examples=model.corpus_count)
model.save("doc2vec_spec.model")
print("Saved.")
-----EVALUATE
text = """....."""
model = Doc2Vec.load("doc2vec_spec.model")
print("Done.")
print(len(model.wv.vocab)," unique words.")
vector = model.infer_vector(text.split())
print("VECTOR:",vector)
#print("MOST SIMILAR:",model.most_similar(positive=[vector], topn=10))
docvecs = model.docvecs.most_similar(positive=[vector], topn=10)
print(vec)
------
So far I can't get any logical or expected results, that a simple search
engine (Solr, Lucene, etc) would give.
My purpose here is to evaluate gensim Doc2Vec to see how it's out of the
box similarity stacks up to basic search (which we are moving away from).
Darren

Post by Rajat Mehta
Hi Gordon,
https://rare-technologies.com/doc2vec-tutorial/
Thanks,
Rajat

Post by Gordon Mohr
You do 20 loops of decreasing the learning-rate `alpha` by 0.002, so a
total decrease of 0.04. But you only start at an alpha of 0.025. So many of
your training passes will occur with a nonsensical negative `alpha` value,
meaning the model for those passes is trying to increase its training-loss.
`train()` itself can manage `alpha` properly for you, with a single
call. It not recommended to call it more than once in your own loop, unless
you're sure you know why you need to.
What online guide/example did you use as the template for this code?
I'd like to get it corrected - because it's a nonsensical pattern, but
wherever it's still appearing, a lot of people are copying it and making
the similar mistakes.
- Gordon

Post by d***@gmail.com
Gordon,
Thanks for the reply. My entire code is here. I will make some
adjustments per your advice and see how it turns out.
Darren
model = Doc2Vec(size=10, hs=1, alpha=0.025, min_alpha=0.025,
min_count=3, dm=0) # use fixed learning rate
tagged = []
docs = {}
i = 0
tdoc =
TaggedDocument(words=result['_source']['claims'].split(),
tags=[result['_source']['appId'],result['_source']['pgpub']])
tagged += [tdoc]
i += 1
print("Tagged Doc["+str(i)+"]): "+result['_id'])
print(traceback.format_exc())
print("ERROR")
print(len(tagged))
model.build_vocab(tagged)
print("EPOCH: ",epoch)
model.train(tagged, epochs=model.iter,
total_examples=model.corpus_count)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
import json
model.save("doc2vec_docs.model")
print("Saved.")

Post by Gordon Mohr
There are likely other issues with your not-shown preprocessing code,
training code, testing code, or text corpus.
The metaparameters shown in the single line of code shared are
* typical vector-sizes are 100 dimensions or more. (Smaller sizes may
be helpful for tiny datasets, but these algorithms generally require larger
datasets to show their advantages, so 'toy-sized' examples will rarely be
satisfying.)
* why choose non-default `hs=1`, `min_count=3`, 'min_alpha=0.025`?
That last choice is especially nonsensical: the usual form of training
requires the learning-rate to decrease, not stay constant, over training.
Enable logging at the INFO level. Watch the log output for warnings
or confirmation that the expected amount of training, on the right number
of words/texts, is occurring. If you're still having problems show the full
code that performs the training & testing, and describe your training
corpus in more details (such as number of documents & typical
words-per-document).
- Gordon
On Thursday, November 8, 2018 at 10:16:38 AM UTC-5,

Post by d***@gmail.com
Hi,
I loaded about 2 million documents into Doc2Vec using this
model = Doc2Vec(size=10, hs=1, alpha=0.025, min_alpha=0.025,
min_count=3, dm=0)
trained for 20 epochs.
I use a large portion of one of the document texts to infer_vector
and then use those to
docvecs.most_similar
and the top 10 results literally have nothing in common with the
submitted text. Even the document
it came from didn't show up in the results.
I realize people will clean the text, strip stopwords, stem, etc.
but I'm just trying to get back a known document and
it doesn't do that.
What am I missing here?
--