[gensim:11650] doc2vec optimal parameters for topic modeling task

Discussion:

Yue

2018-10-05 04:35:57 UTC

Hi, I have a little more than *70,000 science articles* (typically 4 pages
in length) in text files, which each on average consists of *3,000 tokens*.
I also have another 900 target articles (also on science) of similar
length, and for each of these target articles I want to find several most
similar articles from the 70,000 articles. So it is like finding articles
that have similar topics to the target articles and then matching them.

I approached this task by first training the doc2vec model on the 70,000
articles, and then using .infer_vectors and .most_similar on each of the
target articles. To test the results, I sampled the matching results by
looking into some of the articles that got matched. The results certainly
made sense. However, since there is no telling of absolute best matchings
and I could only tell the quality of the results by sampling a few of them,
I wonder if the parameters I used for model training are optimal.

The parameters I used are:

{'dm':0,'*vector_size*':300,'*window*':3,'alpha':0.05,'min_alpha':0.025,
'min_count':15,'workers':12,'*epochs*':30,'hs':0,'negative':5,'dbow_words':1
}

I am particularly interested in window, vector_size, and epochs (welcome
suggestions on other parameters as well). I saw that for window, people
usually use 5. My reason for using 3 is that for science articles, the
topics can be determined best by the key terms, less so by the structure of
the sentences. So by using a smaller window I should be best able to
capture the key terms. Another important reason is that the text files I
have for the articles are converted from PDFs. After the conversion, the
order of the sections in the original articles has changed a lot. For
example, the Introduction section may come before the abstract section, and
some paragraphs may end up intertwined and so on. Besides, lots of the
equations and messy symbols that don't really help determining the topics
were also in the text. So while the word order within sections and
paragraphs is mostly retained, the section and paragraph order may not be.
Thus I thought that a bigger window would cover words that are in far away
paragraphs and thus makes the context incoherent.
As for my vector_size, and epochs, are they suitable for the 70,000
articles that are typically 3,000 tokens in length?

I also did this task on abstracts for both the training articles and the
target articles. The abstracts are typically *140 tokens* in length. For
abstracts matching, the parameters I used are:
{'dm':0,'*vector_size*':100,'*window*':5,'alpha':0.05,'min_alpha':0.025,
'min_count':2,'workers':12,'*epochs*':30,'hs':0,'negative':5,'dbow_words':1}

For abstracts, there's no issues of section order or paragraph order, so I
used a bigger window. And given that the abstracts are much shorter, I used
a smaller vector size. I also decreased the min_count parameter for
abstracts. Do these parameters I used make sense? I would really appreciate
if someone could provide some insights.

Thank you,
Yue

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

2018-10-06 04:27:00 UTC

Permalink

Ultimately the best parameters are whatever works on your data/task. It's
always best if you can devise a consistent repeatable way of scoring a
model, so that you can automate trials across a larger range of parameters.

With even just a few crude models whose parameters were left at the
defaults, or tuned by eyeballing, you could then compare the top-N results
of different models, manually review them, and store the results for future
automated tests. For example, for one query-doc, look at the top-5
most-similar results for two contrasting models. Of those up-to-10 unique
results, hand-pick the best 5. Now you have a bunch of fixed quality
assertions to test against later models: "for this query-doc, each of these
5 docs should rank higher than these other reviewed-but-unchosen docs that
I, the reviewer, thought weren't as good". The more you do this, the better
your automated score may get, and the more likely it will mimic what
similar users might want.

(And once you have a deployed system, actual user behavior â which results
they click on, which they quickly come back from to click elsewhere or
reformulate their query around, etc â may also generate hints about which
docs should rank highly for which queries.)

The original 'Paragraph Vector' papers also used pre-existing human-curated
categorization systems â Wikipedia categories, or categories of Arxiv
papers â as a source of automated evaluation data. It picked random
triplets of docs â 2 in the same category, one not â and gave a model a
point each time it reported the same-category docs as closer to each other
than the 3rd doc. You might also imagine taking real docs and splitting
them in half, or into fake docs each with every-other-word of the original
doc â and testing how well a model puts such two
synthetic-docs-from-the-same-real-doc closer to each other than to other
random docs. Such tricks won't necessarily simulate real user impressions
of relatedness well, but they can create lots of repeatable scoring trials
that *might* approximate user impressions.

Regarding your reported setup:

- while `infer_vector()` is useful for docs not available at training time,
or for evaluating some held-out set as an estimate of how well the model
will work on an unbounded stream of future docs, if you really just need to
map 900 "unknown" docs against some existing literature of 70000 docs, you
could include the 900 docs in your bulk training, as well.

- your corpus and doc sizes are similar to those used in other published
`Doc2Vec` work, so it's reasonable to try this algorithm

- some work suggests, at least in the word-vector case, that larger
`window` values tend to group vectors by topical-domain, while smaller ones
group words by syntactic interchangeability. But, as with other parameters,
it's best to test by rigorous scoring, because intuitions about when
narrow-vs-wide windows help can be misleading.

- if the doc-vectors are the main thing â and you don't separately need
word-vectors â you can leave `dbow_words=0` off, making `dm=0` pure
"PV-DBOW" mode. Training will be faster â the `window` value is then
irrelevant â and the doc-vectors might be just as good or better,
especially for short texts like the abstracts.

- the most typical `epochs` used in published work is 10 or 20, though if
more help in your tests, by all means use larger values

- while using a larger-than-default initial `alpha` learning-rate, as
you've done with the choice of 0.05, is often done and might help in some
modes, it's very atypical to set the ending `min_alpha` so high. Usual
stochastic-gradient-descent ends training at a tiny near-negligible
learning rate; you probably don't want to change that unless you have proof
your nonstandard choice is helping.

- you might want to train the abstracts & full articles together - it's
possible the abstracts would get stronger vectors if based on a model
backed by the larger corpus. (And, the vectors for each woudl then be the
same size and "in the same space", and thus comparable.)

You're in a reasonable neighborhood for everything, it's just a matter of
tinkering & iteratively evaluating!

- Gordon

Post by Yue
Hi, I have a little more than *70,000 science articles* (typically 4
pages in length) in text files, which each on average consists of *3,000
tokens*. I also have another 900 target articles (also on science) of
similar length, and for each of these target articles I want to find
several most similar articles from the 70,000 articles. So it is like
finding articles that have similar topics to the target articles and then
matching them.
I approached this task by first training the doc2vec model on the 70,000
articles, and then using .infer_vectors and .most_similar on each of the
target articles. To test the results, I sampled the matching results by
looking into some of the articles that got matched. The results certainly
made sense. However, since there is no telling of absolute best matchings
and I could only tell the quality of the results by sampling a few of them,
I wonder if the parameters I used for model training are optimal.
{'dm':0,'*vector_size*':300,'*window*':3,'alpha':0.05,'min_alpha':0.025,
'min_count':15,'workers':12,'*epochs*':30,'hs':0,'negative':5,'dbow_words'
:1}
I am particularly interested in window, vector_size, and epochs (welcome
suggestions on other parameters as well). I saw that for window, people
usually use 5. My reason for using 3 is that for science articles, the
topics can be determined best by the key terms, less so by the structure of
the sentences. So by using a smaller window I should be best able to
capture the key terms. Another important reason is that the text files I
have for the articles are converted from PDFs. After the conversion, the
order of the sections in the original articles has changed a lot. For
example, the Introduction section may come before the abstract section, and
some paragraphs may end up intertwined and so on. Besides, lots of the
equations and messy symbols that don't really help determining the topics
were also in the text. So while the word order within sections and
paragraphs is mostly retained, the section and paragraph order may not be.
Thus I thought that a bigger window would cover words that are in far away
paragraphs and thus makes the context incoherent.
As for my vector_size, and epochs, are they suitable for the 70,000
articles that are typically 3,000 tokens in length?
I also did this task on abstracts for both the training articles and the
target articles. The abstracts are typically *140 tokens* in length. For
{'dm':0,'*vector_size*':100,'*window*':5,'alpha':0.05,'min_alpha':0.025,
1}
For abstracts, there's no issues of section order or paragraph order, so I
used a bigger window. And given that the abstracts are much shorter, I used
a smaller vector size. I also decreased the min_count parameter for
abstracts. Do these parameters I used make sense? I would really appreciate
if someone could provide some insights.
Thank you,
Yue

Yue

2018-10-06 05:15:53 UTC

Permalink

Thank you Gordon. Very inspiring!

Post by Gordon Mohr
Ultimately the best parameters are whatever works on your data/task. It's
always best if you can devise a consistent repeatable way of scoring a
model, so that you can automate trials across a larger range of parameters.
With even just a few crude models whose parameters were left at the
defaults, or tuned by eyeballing, you could then compare the top-N results
of different models, manually review them, and store the results for future
automated tests. For example, for one query-doc, look at the top-5
most-similar results for two contrasting models. Of those up-to-10 unique
results, hand-pick the best 5. Now you have a bunch of fixed quality
assertions to test against later models: "for this query-doc, each of these
5 docs should rank higher than these other reviewed-but-unchosen docs that
I, the reviewer, thought weren't as good". The more you do this, the better
your automated score may get, and the more likely it will mimic what
similar users might want.
(And once you have a deployed system, actual user behavior â which results
they click on, which they quickly come back from to click elsewhere or
reformulate their query around, etc â may also generate hints about which
docs should rank highly for which queries.)
The original 'Paragraph Vector' papers also used pre-existing
human-curated categorization systems â Wikipedia categories, or categories
of Arxiv papers â as a source of automated evaluation data. It picked
random triplets of docs â 2 in the same category, one not â and gave a
model a point each time it reported the same-category docs as closer to
each other than the 3rd doc. You might also imagine taking real docs and
splitting them in half, or into fake docs each with every-other-word of the
original doc â and testing how well a model puts such two
synthetic-docs-from-the-same-real-doc closer to each other than to other
random docs. Such tricks won't necessarily simulate real user impressions
of relatedness well, but they can create lots of repeatable scoring trials
that *might* approximate user impressions.
- while `infer_vector()` is useful for docs not available at training
time, or for evaluating some held-out set as an estimate of how well the
model will work on an unbounded stream of future docs, if you really just
need to map 900 "unknown" docs against some existing literature of 70000
docs, you could include the 900 docs in your bulk training, as well.
- your corpus and doc sizes are similar to those used in other published
`Doc2Vec` work, so it's reasonable to try this algorithm
- some work suggests, at least in the word-vector case, that larger
`window` values tend to group vectors by topical-domain, while smaller ones
group words by syntactic interchangeability. But, as with other parameters,
it's best to test by rigorous scoring, because intuitions about when
narrow-vs-wide windows help can be misleading.
- if the doc-vectors are the main thing â and you don't separately need
word-vectors â you can leave `dbow_words=0` off, making `dm=0` pure
"PV-DBOW" mode. Training will be faster â the `window` value is then
irrelevant â and the doc-vectors might be just as good or better,
especially for short texts like the abstracts.
- the most typical `epochs` used in published work is 10 or 20, though if
more help in your tests, by all means use larger values
- while using a larger-than-default initial `alpha` learning-rate, as
you've done with the choice of 0.05, is often done and might help in some
modes, it's very atypical to set the ending `min_alpha` so high. Usual
stochastic-gradient-descent ends training at a tiny near-negligible
learning rate; you probably don't want to change that unless you have proof
your nonstandard choice is helping.
- you might want to train the abstracts & full articles together - it's
possible the abstracts would get stronger vectors if based on a model
backed by the larger corpus. (And, the vectors for each woudl then be the
same size and "in the same space", and thus comparable.)
You're in a reasonable neighborhood for everything, it's just a matter of
tinkering & iteratively evaluating!
- Gordon

Post by Yue
Hi, I have a little more than *70,000 science articles* (typically 4
pages in length) in text files, which each on average consists of *3,000
tokens*. I also have another 900 target articles (also on science) of
similar length, and for each of these target articles I want to find
several most similar articles from the 70,000 articles. So it is like
finding articles that have similar topics to the target articles and then
matching them.
I approached this task by first training the doc2vec model on the 70,000
articles, and then using .infer_vectors and .most_similar on each of the
target articles. To test the results, I sampled the matching results by
looking into some of the articles that got matched. The results certainly
made sense. However, since there is no telling of absolute best matchings
and I could only tell the quality of the results by sampling a few of them,
I wonder if the parameters I used for model training are optimal.
{'dm':0,'*vector_size*':300,'*window*':3,'alpha':0.05,'min_alpha':0.025,
'min_count':15,'workers':12,'*epochs*':30,'hs':0,'negative':5,
'dbow_words':1}
I am particularly interested in window, vector_size, and epochs (welcome
suggestions on other parameters as well). I saw that for window, people
usually use 5. My reason for using 3 is that for science articles, the
topics can be determined best by the key terms, less so by the structure of
the sentences. So by using a smaller window I should be best able to
capture the key terms. Another important reason is that the text files I
have for the articles are converted from PDFs. After the conversion, the
order of the sections in the original articles has changed a lot. For
example, the Introduction section may come before the abstract section, and
some paragraphs may end up intertwined and so on. Besides, lots of the
equations and messy symbols that don't really help determining the topics
were also in the text. So while the word order within sections and
paragraphs is mostly retained, the section and paragraph order may not be.
Thus I thought that a bigger window would cover words that are in far away
paragraphs and thus makes the context incoherent.
As for my vector_size, and epochs, are they suitable for the 70,000
articles that are typically 3,000 tokens in length?
I also did this task on abstracts for both the training articles and the
target articles. The abstracts are typically *140 tokens* in length. For
{'dm':0,'*vector_size*':100,'*window*':5,'alpha':0.05,'min_alpha':0.025,
'min_count':2,'workers':12,'*epochs*':30,'hs':0,'negative':5,'dbow_words'
:1}
For abstracts, there's no issues of section order or paragraph order, so
I used a bigger window. And given that the abstracts are much shorter, I
used a smaller vector size. I also decreased the min_count parameter for
abstracts. Do these parameters I used make sense? I would really appreciate
if someone could provide some insights.
Thank you,
Yue