[gensim:9329] Doc2Vec - Finding good hyperparameters and preprocessment combinations

Discussion:

Denis Candido

2017-09-27 17:54:13 UTC

Hello,

There's some weeks that I'm making some researchs using word2vec/doc2vec to
optimized the search mechanism of several specif documents (that contain
logs, fault codes, descriptions, etc). Currently the search engine used is
the Elastic Search.
At first the objective was to overcome Elastic Search on all cases, but
after some tests I realized that doc2vec is not performing as good as
Elastic Search in most of the cases. Although this is happening, there's
specif cases that doc2vec overcome Elastic Search, for example, when
there's a context match objective.

Then I took as objetive to use both of mechanisms in order to improve the
system. Later I will decide how they will be merge... The main problem here
is the training and preprocessing to achieve a great performance.

The dataset is consisted with ~360k documents.

There's several preprocessing combinations that I'm testing, like
stemization, filtering stopwords and filtering special characters.
About the hyperparameters, I'm using the random search technique to search
for a good combination.

These are the combinations that I'm using:

hyperparams = {
'size': [100, 200],
'min_count': [1, 2, 3, 4, 5],
'iter': [50, 100, 150],
'window': [4, 5, 6, 7, 8],
'alpha': [0.025, 0.01, 0.05],
'min_alpha': [0.025, 1e-4],
}

O all possible combinations generated on these parameters, I test 50 of
them, unique and randomly selected.

The train function:

def start_training(hyperparams, train_corpus):
model = gensim.models.doc2vec.Doc2Vec(size=hyperparams['size'],
min_count=hyperparams['min_count'], iter=hyperparams['iter'], workers=4,
window=hyperparams['window'], alpha=hyperparams['alpha'], min_alpha=
hyperparams['min_alpha'])
print("Building vocabulary")
model.random.seed(0)
model.build_vocab(train_corpus)
print("Training the model")
print(model)
model.train(train_corpus, total_examples=model.corpus_count, epochs=
model.iter)

The evaluation method consists on searching specific documents using a text
that I'm sure that they are linked with the document.
The better it rank the text with the target document, the best.

For an accuracy rate I made an weighted average based on the rank.
Unfortunately there's few evaluation files (about 15 files).

The evaluation function:

def eval_model(model, eval_dir, hyperparams):
ranked_eval = {}
correct = 0

eval_files_list = os.listdir(eval_dir)
for file in eval_files_list:
eval_file = eval_dir + file
words_vec = get_word_vec(eval_file)
model.random.seed(0)
steps = (hyperparams['iter']) + 50
inferred_vector = model.infer_vector(words_vec, alpha=hyperparams[
'alpha'], min_alpha=hyperparams['min_alpha'], steps=steps)
similars = model.docvecs.most_similar([inferred_vector], topn=len(
model.docvecs))
target = eval_file[-18:]

for i in range(len(similars)):
sim = similars[i]
if sim[0] == target_ER:
print(file, "found in position", i)
ranked_eval[file] = i
if i == 0 :
correct += 1
elif i >= 1 and i < 5:
correct += 0.9
elif i >= 5 and i < 10:
correct += 0.7
elif i >= 10 and i < 20:
correct += 0.4
elif i >= 20 and i < 50:
correct += 0.2
break

accuracy_rate = (correct / len(eval_files_list)) * 100

return accuracy_rate, ranked_eval

Important to inform that I preprocess the "input text" of the eval files
the same as the preprocess used on the training set.

I *strongly* believe that the use of Doc2Vec can achieve good improvements
on the currently most used search engines, but it's still not getting
affordable results.
Remembering that the objective is not to beat the Elastic Search indexing
algorithm, but to complement it. So the ideia is not to achieve great
results on every case, but at least try to find good documents that Elastic
Search can't find.

For example, on the best combination I found until now:

{
'size': 100,
'min_count': 4,
'iter': 100,
'window': 4,
'alpha': 0.025,
'min_alpha': 0.025,
'accuracy_rate': 34.285714285714285,
'model_file': './trained_models/test09_360k/test09_360krandom_21.model'
}
{ #These are the classified position of the document. The less the better.
'doc1.txt': 1567,
'doc2.txt': 396,
'doc3.txt': 10929,
'doc4.txt': 3,
'doc5.txt': 3,
'doc6.txt': 0,
'doc7.txt': 70868,
'doc8.txt': 2334,
'doc9.txt': 486,
'doc10.txt': 0,
'doc11.txt': 30569,
'doc12.txt': 1571,
'doc13.txt': 2088,
'doc14.txt': 0
}

*Do you have any opinion about it to improve?* Maybe searching on other
hyperparameters (e.g. the negative sampling values), different
preprocessments, or other ranges of hyperparameters.
*Am I doing something wrong? *Any opinion will help and motivates a lot...

Thanks,
Denis.

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Denis Candido

2017-09-27 17:59:00 UTC

Permalink

This is the preprocessment used on the 'best combination found' mentioned
above:

soup = BeautifulSoup(out, 'html.parser')
out = soup.get_text()

out = out.lower() # Set all characters to lower case
out = re.sub('^.*\wmissing.*?$', ' ', out, flags=re.M) # remove lines with
productmissing due to HTML
out = re.sub('\s+', ' ', out) # replace sequence of spaces by one space

out = re.sub("\_", '', out) # merge words with _
out = re.sub("\=", '', out) # merge words with =
out = re.sub("\-", '', out) # merge words with -
out = re.sub("\/", '', out) # merge words with /
out = re.sub("\'", '', out) # merge words with '

out = re.sub('[^a-zA-Z0-9 ]', ' ', out) # remove all special characters

Em quarta-feira, 27 de setembro de 2017 14:54:13 UTC-3, Denis Candido

Post by Denis Candido
Hello,
There's some weeks that I'm making some researchs using word2vec/doc2vec
to optimized the search mechanism of several specif documents (that contain
logs, fault codes, descriptions, etc). Currently the search engine used is
the Elastic Search.
At first the objective was to overcome Elastic Search on all cases, but
after some tests I realized that doc2vec is not performing as good as
Elastic Search in most of the cases. Although this is happening, there's
specif cases that doc2vec overcome Elastic Search, for example, when
there's a context match objective.
Then I took as objetive to use both of mechanisms in order to improve the
system. Later I will decide how they will be merge... The main problem here
is the training and preprocessing to achieve a great performance.
The dataset is consisted with ~360k documents.
There's several preprocessing combinations that I'm testing, like
stemization, filtering stopwords and filtering special characters.
About the hyperparameters, I'm using the random search technique to search
for a good combination.
hyperparams = {
'size': [100, 200],
'min_count': [1, 2, 3, 4, 5],
'iter': [50, 100, 150],
'window': [4, 5, 6, 7, 8],
'alpha': [0.025, 0.01, 0.05],
'min_alpha': [0.025, 1e-4],
}
O all possible combinations generated on these parameters, I test 50 of
them, unique and randomly selected.
model = gensim.models.doc2vec.Doc2Vec(size=hyperparams['size'],
min_count=hyperparams['min_count'], iter=hyperparams['iter'], workers=4,
window=hyperparams['window'], alpha=hyperparams['alpha'], min_alpha=
hyperparams['min_alpha'])
print("Building vocabulary")
model.random.seed(0)
model.build_vocab(train_corpus)
print("Training the model")
print(model)
model.train(train_corpus, total_examples=model.corpus_count, epochs=
model.iter)
The evaluation method consists on searching specific documents using a
text that I'm sure that they are linked with the document.
The better it rank the text with the target document, the best.
For an accuracy rate I made an weighted average based on the rank.
Unfortunately there's few evaluation files (about 15 files).
ranked_eval = {}
correct = 0
eval_files_list = os.listdir(eval_dir)
eval_file = eval_dir + file
words_vec = get_word_vec(eval_file)
model.random.seed(0)
steps = (hyperparams['iter']) + 50
inferred_vector = model.infer_vector(words_vec, alpha=hyperparams[
'alpha'], min_alpha=hyperparams['min_alpha'], steps=steps)
similars = model.docvecs.most_similar([inferred_vector], topn=len(
model.docvecs))
target = eval_file[-18:]
sim = similars[i]
print(file, "found in position", i)
ranked_eval[file] = i
correct += 1
correct += 0.9
correct += 0.7
correct += 0.4
correct += 0.2
break
accuracy_rate = (correct / len(eval_files_list)) * 100
return accuracy_rate, ranked_eval
Important to inform that I preprocess the "input text" of the eval files
the same as the preprocess used on the training set.
I *strongly* believe that the use of Doc2Vec can achieve good
improvements on the currently most used search engines, but it's still not
getting affordable results.
Remembering that the objective is not to beat the Elastic Search indexing
algorithm, but to complement it. So the ideia is not to achieve great
results on every case, but at least try to find good documents that Elastic
Search can't find.
{
'size': 100,
'min_count': 4,
'iter': 100,
'window': 4,
'alpha': 0.025,
'min_alpha': 0.025,
'accuracy_rate': 34.285714285714285,
'model_file': './trained_models/test09_360k/test09_360krandom_21.model'
}
{ #These are the classified position of the document. The less the better.
'doc1.txt': 1567,
'doc2.txt': 396,
'doc3.txt': 10929,
'doc4.txt': 3,
'doc5.txt': 3,
'doc6.txt': 0,
'doc7.txt': 70868,
'doc8.txt': 2334,
'doc9.txt': 486,
'doc10.txt': 0,
'doc11.txt': 30569,
'doc12.txt': 1571,
'doc13.txt': 2088,
'doc14.txt': 0
}
*Do you have any opinion about it to improve?* Maybe searching on other
hyperparameters (e.g. the negative sampling values), different
preprocessments, or other ranges of hyperparameters.
*Am I doing something wrong? *Any opinion will help and motivates a lot...
Thanks,
Denis.

Gordon Mohr

2017-09-27 18:47:09 UTC

Permalink

Gordon Mohr

2017-09-27 19:04:37 UTC

Permalink

I'm not sure what you mean by "affordable results". It's hard to evaluate
your scoring method - it seems a bit narrow (just 14
querydoc-to-desired-result probes?) and complicated (ad hoc scoring). But
if you think it accurately models your desired results, it's better than
just 'eyeballing' in that its quantitative & repeatable.

Your corpus size in documents is reasonable, especially if the documents
themselves are not tiny. You should try to make sure they're not arranged
in some way that all similar documents (in size, topic, etc) are clumped
together. (For example, if there's a risk of that, one initial shuffle
should help.)

Regarding your preprocessing, does examining the tokenized end-result seem
to retain tokens you'd expect to be significant for your purposes? If so,
it's a fine base, but other experimentation should be driven by your
domain-familiarity. Do make sure your preprocessing on the querydocs is the
same as that performed on the bulk-training, so it's an apples-to-apples
calculation.

Regarding your meta-parameter search space:

* you should probably never tinker with `min_alpha` â the algorithm is
based on the learning-rate value decaying to something tiny. With your
inclusion of a end-value as high as the typical default starting-value
(0.025), I'd fear a lot of 'noise' in the model's final configuration (and
thus ranked evals), and some of your meta-parameter combinations may even
include an alpha learning-rate that *increases* over the course of training
(eg, `alpha=0.01, min_alpha=0.025`).

* with only 14 evaluation datapoints, but tuning 6 metaparameters over a
total of (2*5*3*5*3*2=) 900 permutations, and some of those metaparameters
in tight ranges (esp `min_count` and `window`), and the `alpha` issue
mentioned above, slight differences in your ranked results may be more a
function of jitter/noise/overfitting than any tangible difference in the
meta-parameter appropriateness. More eval data, and more contrast in the
meta-parameters, may help deliver more reliable contrast in your evaluation
scores.

* In particular, a `min_count` of 1 often adds a lot of noise to doc-vector
training (lowering quality), and eliminating more words helps more than
you'd think. So trying a `min_count` range of [2, 5, 10, 20] may reveal
more than just 1...5

* Similarly, tiny `window` values can be surprisingly good with large
word2vec training sets, but also large `window` values tend to emphasize
topical-domain-similarity more â so searching a larger range here may also
help, eg [2, 5, 10, 20]

* 10 or 20 iterations are mentioned in the original Paragraph Vector
(Doc2Vec) papers - so only if you're really confident in your evaluation
method, and improving value of so many more iterations, would I be testing
100+ iterations

* But generally, whenever meta-parameter searching with robust scoring
method, if one of the most 'extreme' offered values performs best, it can
make sense to add another more-extreme value in that same direction

* PV-DBOW mode (`dm=0`) is worth trying, especially on short documents.
(Adding the extra option `dbow_words=1` to PV-DBOW trains word-vectors
interleaved simultaneously, at an extra time cost proportional to `window`,
which sometimes also helps.)

Finally, if I understand "classified position of the document" correctly,
there's *wild* variation in the position of your desired results. (Five
docs appear in the top-4 positions, but then 3 other docs are
lower-than-10,000.) This makes me wonder about the quality of the test
probes, and think that you should look individually at each, to see what's
contributing to the specific results â are some of the probes/targets
really small, really generic, poorly-handled-by-tokenization, etc. It's odd
enough to be worthy-of-investigation if something you've hand-picked to be
"a desired near-top result" is actually behind 70K other documents. (But
maybe, if the docs are very small and very similar, and thus the corpus
isn't as "big" as it looks, it's not odd. Hard to tell.)

- Gordon

Denis Candido

2017-09-28 13:00:54 UTC

Permalink

Hello Gordon,

Very grateful for your answer.

I think that 14 files is a too low number for a test set too. Unfortunately
it takes a large amount of time to collect just one these eval files and I
have to ask another person to do it because I don't know the system. I
think I will have to learn to use so I can do it myself and accelerate this
process.
Knowing that there's around 360k documents, how much evaluation files do
you think it's acceptable?

I think that this evaluation method has some gaps, but I don't know a
better way to improve it.
Taking into account that the objective is to beat Elastic Search on cases
that it performs poorly (cases that find in position > 30 approximately),
what about I use only the eval files that happens it?

See this table to see the comparation of classifications:

https://docs.google.com/spreadsheets/d/1vZAdUiymxYRsqJKSFuGNchDK1QM6k-XhYAl1lva_eyk/edit?usp=sharing

About the documents, the 'training set' are documents that has a problem
title and description followed by a solution (called 'experience record')
for the specified problem (the solution text is not used on training, just
the cause description and title).
The eval files are the problem (called user report) title and description
reported by an user, written by people but a lot of times with log files.
Unfortunately a lot of these reports are not well written and has some
syntax fault (which is not doc2vec/word2vec can deal with...).

On the system there's some problems that has an link with a solution,
informing that X report was solved by Y 'experience record'. They are
different systems and I don't have 'fully access' to it, so the process to
find eval files is complicated.

In order to don't make tendencious tests, I just copy&paste the X report's
title/description and find which position the Y 'experience record' is
classified. As I said, unfortunately some of them are now well written, but
the ones that has 'context description' about the fault, doc2vec performs
good.

I don't know if you consider it a large size but every doc has an average
of 1000 characters, after preprocessing.

*Do you have any idea of evaluation method or a general opinion of core
preprocessings steps?*

Thanks,
Denis.

Em quarta-feira, 27 de setembro de 2017 16:04:38 UTC-3, Gordon Mohr

Post by Gordon Mohr
I'm not sure what you mean by "affordable results". It's hard to evaluate
your scoring method - it seems a bit narrow (just 14
querydoc-to-desired-result probes?) and complicated (ad hoc scoring). But
if you think it accurately models your desired results, it's better than
just 'eyeballing' in that its quantitative & repeatable.
Your corpus size in documents is reasonable, especially if the documents
themselves are not tiny. You should try to make sure they're not arranged
in some way that all similar documents (in size, topic, etc) are clumped
together. (For example, if there's a risk of that, one initial shuffle
should help.)
Regarding your preprocessing, does examining the tokenized end-result seem
to retain tokens you'd expect to be significant for your purposes? If so,
it's a fine base, but other experimentation should be driven by your
domain-familiarity. Do make sure your preprocessing on the querydocs is the
same as that performed on the bulk-training, so it's an apples-to-apples
calculation.
* you should probably never tinker with `min_alpha` â the algorithm is
based on the learning-rate value decaying to something tiny. With your
inclusion of a end-value as high as the typical default starting-value
(0.025), I'd fear a lot of 'noise' in the model's final configuration (and
thus ranked evals), and some of your meta-parameter combinations may even
include an alpha learning-rate that *increases* over the course of training
(eg, `alpha=0.01, min_alpha=0.025`).
* with only 14 evaluation datapoints, but tuning 6 metaparameters over a
total of (2*5*3*5*3*2=) 900 permutations, and some of those metaparameters
in tight ranges (esp `min_count` and `window`), and the `alpha` issue
mentioned above, slight differences in your ranked results may be more a
function of jitter/noise/overfitting than any tangible difference in the
meta-parameter appropriateness. More eval data, and more contrast in the
meta-parameters, may help deliver more reliable contrast in your evaluation
scores.
* In particular, a `min_count` of 1 often adds a lot of noise to
doc-vector training (lowering quality), and eliminating more words helps
more than you'd think. So trying a `min_count` range of [2, 5, 10, 20] may
reveal more than just 1...5
* Similarly, tiny `window` values can be surprisingly good with large
word2vec training sets, but also large `window` values tend to emphasize
topical-domain-similarity more â so searching a larger range here may also
help, eg [2, 5, 10, 20]
* 10 or 20 iterations are mentioned in the original Paragraph Vector
(Doc2Vec) papers - so only if you're really confident in your evaluation
method, and improving value of so many more iterations, would I be testing
100+ iterations
* But generally, whenever meta-parameter searching with robust scoring
method, if one of the most 'extreme' offered values performs best, it can
make sense to add another more-extreme value in that same direction
* PV-DBOW mode (`dm=0`) is worth trying, especially on short documents.
(Adding the extra option `dbow_words=1` to PV-DBOW trains word-vectors
interleaved simultaneously, at an extra time cost proportional to `window`,
which sometimes also helps.)
Finally, if I understand "classified position of the document" correctly,
there's *wild* variation in the position of your desired results. (Five
docs appear in the top-4 positions, but then 3 other docs are
lower-than-10,000.) This makes me wonder about the quality of the test
probes, and think that you should look individually at each, to see what's
contributing to the specific results â are some of the probes/targets
really small, really generic, poorly-handled-by-tokenization, etc. It's odd
enough to be worthy-of-investigation if something you've hand-picked to be
"a desired near-top result" is actually behind 70K other documents. (But
maybe, if the docs are very small and very similar, and thus the corpus
isn't as "big" as it looks, it's not odd. Hard to tell.)
- Gordon

Post by Denis Candido
Hello,
There's some weeks that I'm making some researchs using word2vec/doc2vec
to optimized the search mechanism of several specif documents (that contain
logs, fault codes, descriptions, etc). Currently the search engine used is
the Elastic Search.
At first the objective was to overcome Elastic Search on all cases, but
after some tests I realized that doc2vec is not performing as good as
Elastic Search in most of the cases. Although this is happening, there's
specif cases that doc2vec overcome Elastic Search, for example, when
there's a context match objective.
Then I took as objetive to use both of mechanisms in order to improve the
system. Later I will decide how they will be merge... The main problem here
is the training and preprocessing to achieve a great performance.
The dataset is consisted with ~360k documents.
There's several preprocessing combinations that I'm testing, like
stemization, filtering stopwords and filtering special characters.
About the hyperparameters, I'm using the random search technique to
search for a good combination.
hyperparams = {
'size': [100, 200],
'min_count': [1, 2, 3, 4, 5],
'iter': [50, 100, 150],
'window': [4, 5, 6, 7, 8],
'alpha': [0.025, 0.01, 0.05],
'min_alpha': [0.025, 1e-4],
}
O all possible combinations generated on these parameters, I test 50 of
them, unique and randomly selected.
model = gensim.models.doc2vec.Doc2Vec(size=hyperparams['size'],
min_count=hyperparams['min_count'], iter=hyperparams['iter'], workers=4,
window=hyperparams['window'], alpha=hyperparams['alpha'], min_alpha=
hyperparams['min_alpha'])
print("Building vocabulary")
model.random.seed(0)
model.build_vocab(train_corpus)
print("Training the model")
print(model)
model.train(train_corpus, total_examples=model.corpus_count, epochs=
model.iter)
The evaluation method consists on searching specific documents using a
text that I'm sure that they are linked with the document.
The better it rank the text with the target document, the best.
For an accuracy rate I made an weighted average based on the rank.
Unfortunately there's few evaluation files (about 15 files).
ranked_eval = {}
correct = 0
eval_files_list = os.listdir(eval_dir)
eval_file = eval_dir + file
words_vec = get_word_vec(eval_file)
model.random.seed(0)
steps = (hyperparams['iter']) + 50
inferred_vector = model.infer_vector(words_vec, alpha=hyperparams
['alpha'], min_alpha=hyperparams['min_alpha'], steps=steps)
similars = model.docvecs.most_similar([inferred_vector], topn=len
(model.docvecs))
target = eval_file[-18:]
sim = similars[i]
print(file, "found in position", i)
ranked_eval[file] = i
correct += 1
correct += 0.9
correct += 0.7
correct += 0.4
correct += 0.2
break
accuracy_rate = (correct / len(eval_files_list)) * 100
return accuracy_rate, ranked_eval
Important to inform that I preprocess the "input text" of the eval files
the same as the preprocess used on the training set.
I *strongly* believe that the use of Doc2Vec can achieve good
improvements on the currently most used search engines, but it's still not
getting affordable results.
Remembering that the objective is not to beat the Elastic Search indexing
algorithm, but to complement it. So the ideia is not to achieve great
results on every case, but at least try to find good documents that Elastic
Search can't find.
{
'size': 100,
'min_count': 4,
'iter': 100,
'window': 4,
'alpha': 0.025,
'min_alpha': 0.025,
'accuracy_rate': 34.285714285714285,
'model_file': './trained_models/test09_360k/test09_360krandom_21.model'
}
{ #These are the classified position of the document. The less the better.
'doc1.txt': 1567,
'doc2.txt': 396,
'doc3.txt': 10929,
'doc4.txt': 3,
'doc5.txt': 3,
'doc6.txt': 0,
'doc7.txt': 70868,
'doc8.txt': 2334,
'doc9.txt': 486,
'doc10.txt': 0,
'doc11.txt': 30569,
'doc12.txt': 1571,
'doc13.txt': 2088,
'doc14.txt': 0
}
*Do you have any opinion about it to improve?* Maybe searching on other
hyperparameters (e.g. the negative sampling values), different
preprocessments, or other ranges of hyperparameters.
*Am I doing something wrong? *Any opinion will help and motivates a lot...
Thanks,
Denis.

Gordon Mohr

2017-09-28 20:16:54 UTC

Permalink

There's no magic threshold for ranking-evaluation, but more examples of
desired results are better, and examples that accurately reflect the needed
final-system behavior.

Are people really querying based on the full "title and description"?

It's not clear from your description why the 14 evaluation file
target-results are considered "good results" for the associated queries. To
be a `most_similar()` result at all, the document had to have been in the
training set.

Is it even possible for domain-level experts to tell, from just the
sometimes poorly-written "title and description", that one of the
training-set documents is well-related? Can *you* tell, especially looking
at the evaluation data for expected-ranked docs like `doc3.txt` or
`doc7.txt`, that the training-set doc is meaningfully-related to the
related to the query-doc? (Those docs, especially, make me wonder about the
evaluation dataset construction because neither of your retrieval methods
seem to come anywhere close to ranking them highly.)

Doc2Vec/Word2Vec can be tolerant of sloppy writing, including misspellings
and formatting glitches, if there are enough examples to learn
common/repeated glitches. (A unique typo renders a token meaningless... but
if the same error repeats greater than `min_count` times, the token will
have influence, and with enough examples it should acquire an influence
similar to its properly-typed synonym.)

The Paragraph Vectors followup paper, "Document Embedding With Paragraph
Vectors" <https://arxiv.org/abs/1507.07998> is worth checking out for its
evaluation method for tuning meta-parameters â using extra categorical
fields in the corpuses to identify pairs of documents that "should" have
closer doc-vectors than other, randomly chosen 3rd documents that don't
share the same category. (Each set of meta-parameters are scored by the %
of doc-triplets, where only 2 have the same category, that match this
goal.) If the "experience record" fields are more-carefully-written, or
there are other controlled-vocabulary fields that are tended more carefully
in your source data, and that are shared by documents that "should" be
close, a similar method might help for your dataset and goals. But if the
true final goal is to have certain docs rank highly for certain queries,
nothing will be better than having plenty of examples that effectively
describe the desired end-behavior.

- Gordon

Post by Denis Candido
Hello Gordon,
Very grateful for your answer.
I think that 14 files is a too low number for a test set too.
Unfortunately it takes a large amount of time to collect just one these
eval files and I have to ask another person to do it because I don't know
the system. I think I will have to learn to use so I can do it myself and
accelerate this process.
Knowing that there's around 360k documents, how much evaluation files do
you think it's acceptable?
I think that this evaluation method has some gaps, but I don't know a
better way to improve it.
Taking into account that the objective is to beat Elastic Search on cases
that it performs poorly (cases that find in position > 30 approximately),
what about I use only the eval files that happens it?
https://docs.google.com/spreadsheets/d/1vZAdUiymxYRsqJKSFuGNchDK1QM6k-XhYAl1lva_eyk/edit?usp=sharing
About the documents, the 'training set' are documents that has a problem
title and description followed by a solution (called 'experience record')
for the specified problem (the solution text is not used on training, just
the cause description and title).
The eval files are the problem (called user report) title and description
reported by an user, written by people but a lot of times with log files.
Unfortunately a lot of these reports are not well written and has some
syntax fault (which is not doc2vec/word2vec can deal with...).
On the system there's some problems that has an link with a solution,
informing that X report was solved by Y 'experience record'. They are
different systems and I don't have 'fully access' to it, so the process to
find eval files is complicated.
In order to don't make tendencious tests, I just copy&paste the X report's
title/description and find which position the Y 'experience record' is
classified. As I said, unfortunately some of them are now well written, but
the ones that has 'context description' about the fault, doc2vec performs
good.
I don't know if you consider it a large size but every doc has an average
of 1000 characters, after preprocessing.
*Do you have any idea of evaluation method or a general opinion of core
preprocessings steps?*
Thanks,
Denis.
Em quarta-feira, 27 de setembro de 2017 16:04:38 UTC-3, Gordon Mohr

Post by Gordon Mohr
I'm not sure what you mean by "affordable results". It's hard to evaluate
your scoring method - it seems a bit narrow (just 14
querydoc-to-desired-result probes?) and complicated (ad hoc scoring). But
if you think it accurately models your desired results, it's better than
just 'eyeballing' in that its quantitative & repeatable.
Your corpus size in documents is reasonable, especially if the documents
themselves are not tiny. You should try to make sure they're not arranged
in some way that all similar documents (in size, topic, etc) are clumped
together. (For example, if there's a risk of that, one initial shuffle
should help.)
Regarding your preprocessing, does examining the tokenized end-result
seem to retain tokens you'd expect to be significant for your purposes? If
so, it's a fine base, but other experimentation should be driven by your
domain-familiarity. Do make sure your preprocessing on the querydocs is the
same as that performed on the bulk-training, so it's an apples-to-apples
calculation.
* you should probably never tinker with `min_alpha` â the algorithm is
based on the learning-rate value decaying to something tiny. With your
inclusion of a end-value as high as the typical default starting-value
(0.025), I'd fear a lot of 'noise' in the model's final configuration (and
thus ranked evals), and some of your meta-parameter combinations may even
include an alpha learning-rate that *increases* over the course of training
(eg, `alpha=0.01, min_alpha=0.025`).
* with only 14 evaluation datapoints, but tuning 6 metaparameters over a
total of (2*5*3*5*3*2=) 900 permutations, and some of those metaparameters
in tight ranges (esp `min_count` and `window`), and the `alpha` issue
mentioned above, slight differences in your ranked results may be more a
function of jitter/noise/overfitting than any tangible difference in the
meta-parameter appropriateness. More eval data, and more contrast in the
meta-parameters, may help deliver more reliable contrast in your evaluation
scores.
* In particular, a `min_count` of 1 often adds a lot of noise to
doc-vector training (lowering quality), and eliminating more words helps
more than you'd think. So trying a `min_count` range of [2, 5, 10, 20] may
reveal more than just 1...5
* Similarly, tiny `window` values can be surprisingly good with large
word2vec training sets, but also large `window` values tend to emphasize
topical-domain-similarity more â so searching a larger range here may also
help, eg [2, 5, 10, 20]
* 10 or 20 iterations are mentioned in the original Paragraph Vector
(Doc2Vec) papers - so only if you're really confident in your evaluation
method, and improving value of so many more iterations, would I be testing
100+ iterations
* But generally, whenever meta-parameter searching with robust scoring
method, if one of the most 'extreme' offered values performs best, it can
make sense to add another more-extreme value in that same direction
* PV-DBOW mode (`dm=0`) is worth trying, especially on short documents.
(Adding the extra option `dbow_words=1` to PV-DBOW trains word-vectors
interleaved simultaneously, at an extra time cost proportional to `window`,
which sometimes also helps.)
Finally, if I understand "classified position of the document" correctly,
there's *wild* variation in the position of your desired results. (Five
docs appear in the top-4 positions, but then 3 other docs are
lower-than-10,000.) This makes me wonder about the quality of the test
probes, and think that you should look individually at each, to see what's
contributing to the specific results â are some of the probes/targets
really small, really generic, poorly-handled-by-tokenization, etc. It's odd
enough to be worthy-of-investigation if something you've hand-picked to be
"a desired near-top result" is actually behind 70K other documents. (But
maybe, if the docs are very small and very similar, and thus the corpus
isn't as "big" as it looks, it's not odd. Hard to tell.)
- Gordon

Post by Denis Candido
Hello,
There's some weeks that I'm making some researchs using word2vec/doc2vec
to optimized the search mechanism of several specif documents (that contain
logs, fault codes, descriptions, etc). Currently the search engine used is
the Elastic Search.
At first the objective was to overcome Elastic Search on all cases, but
after some tests I realized that doc2vec is not performing as good as
Elastic Search in most of the cases. Although this is happening, there's
specif cases that doc2vec overcome Elastic Search, for example, when
there's a context match objective.
Then I took as objetive to use both of mechanisms in order to improve
the system. Later I will decide how they will be merge... The main problem
here is the training and preprocessing to achieve a great performance.
The dataset is consisted with ~360k documents.
There's several preprocessing combinations that I'm testing, like
stemization, filtering stopwords and filtering special characters.
About the hyperparameters, I'm using the random search technique to
search for a good combination.
hyperparams = {
'size': [100, 200],
'min_count': [1, 2, 3, 4, 5],
'iter': [50, 100, 150],
'window': [4, 5, 6, 7, 8],
'alpha': [0.025, 0.01, 0.05],
'min_alpha': [0.025, 1e-4],
}
O all possible combinations generated on these parameters, I test 50 of
them, unique and randomly selected.
model = gensim.models.doc2vec.Doc2Vec(size=hyperparams['size'],
min_count=hyperparams['min_count'], iter=hyperparams['iter'], workers=4,
window=hyperparams['window'], alpha=hyperparams['alpha'], min_alpha=
hyperparams['min_alpha'])
print("Building vocabulary")
model.random.seed(0)
model.build_vocab(train_corpus)
print("Training the model")
print(model)
model.train(train_corpus, total_examples=model.corpus_count, epochs=
model.iter)
The evaluation method consists on searching specific documents using a
text that I'm sure that they are linked with the document.
The better it rank the text with the target document, the best.
For an accuracy rate I made an weighted average based on the rank.
Unfortunately there's few evaluation files (about 15 files).
ranked_eval = {}
correct = 0
eval_files_list = os.listdir(eval_dir)
eval_file = eval_dir + file
words_vec = get_word_vec(eval_file)
model.random.seed(0)
steps = (hyperparams['iter']) + 50
inferred_vector = model.infer_vector(words_vec, alpha=
hyperparams['alpha'], min_alpha=hyperparams['min_alpha'], steps=steps)
similars = model.docvecs.most_similar([inferred_vector], topn=
len(model.docvecs))
target = eval_file[-18:]
sim = similars[i]
print(file, "found in position", i)
ranked_eval[file] = i
correct += 1
correct += 0.9
correct += 0.7
correct += 0.4
correct += 0.2
break
accuracy_rate = (correct / len(eval_files_list)) * 100
return accuracy_rate, ranked_eval
Important to inform that I preprocess the "input text" of the eval files
the same as the preprocess used on the training set.
I *strongly* believe that the use of Doc2Vec can achieve good
improvements on the currently most used search engines, but it's still not
getting affordable results.
Remembering that the objective is not to beat the Elastic Search
indexing algorithm, but to complement it. So the ideia is not to achieve
great results on every case, but at least try to find good documents that
Elastic Search can't find.
{
'size': 100,
'min_count': 4,
'iter': 100,
'window': 4,
'alpha': 0.025,
'min_alpha': 0.025,
'accuracy_rate': 34.285714285714285,
'model_file': './trained_models/test09_360k/test09_360krandom_21.model'
}
{ #These are the classified position of the document. The less the better.
'doc1.txt': 1567,
'doc2.txt': 396,
'doc3.txt': 10929,
'doc4.txt': 3,
'doc5.txt': 3,
'doc6.txt': 0,
'doc7.txt': 70868,
'doc8.txt': 2334,
'doc9.txt': 486,
'doc10.txt': 0,
'doc11.txt': 30569,
'doc12.txt': 1571,
'doc13.txt': 2088,
'doc14.txt': 0
}
*Do you have any opinion about it to improve?* Maybe searching on other
hyperparameters (e.g. the negative sampling values), different
preprocessments, or other ranges of hyperparameters.
*Am I doing something wrong? *Any opinion will help and motivates a lot...
Thanks,
Denis.

Rajat Mehta

2018-10-30 14:30:41 UTC

Permalink

Hi Denis,

I have been working on something similar, training a doc2vec model on my
train set and I am stuck somewhere. So may be you can help me a little bit
in this, I am not able to figure out how can I tune my doc 2vec model. I
just have a set of documents and my model is trained to create the
embeddings of those documents. Now in order to implement GridSearchCV, I
need labels to evaluate my models but I don't have any labels. I looked at
many blogs and could not figure out a solution.
So, Could you please help me in this and give me some idea on how can I
tune the hyperparametrs of my model and also how did you implement
GridSearchcV for your model? I would be really thankful to you for your
help.

Regards,
Rajat

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.