Denis Candido
2017-09-27 17:54:13 UTC
Hello,
There's some weeks that I'm making some researchs using word2vec/doc2vec to
optimized the search mechanism of several specif documents (that contain
logs, fault codes, descriptions, etc). Currently the search engine used is
the Elastic Search.
At first the objective was to overcome Elastic Search on all cases, but
after some tests I realized that doc2vec is not performing as good as
Elastic Search in most of the cases. Although this is happening, there's
specif cases that doc2vec overcome Elastic Search, for example, when
there's a context match objective.
Then I took as objetive to use both of mechanisms in order to improve the
system. Later I will decide how they will be merge... The main problem here
is the training and preprocessing to achieve a great performance.
The dataset is consisted with ~360k documents.
There's several preprocessing combinations that I'm testing, like
stemization, filtering stopwords and filtering special characters.
About the hyperparameters, I'm using the random search technique to search
for a good combination.
These are the combinations that I'm using:
hyperparams = {
'size': [100, 200],
'min_count': [1, 2, 3, 4, 5],
'iter': [50, 100, 150],
'window': [4, 5, 6, 7, 8],
'alpha': [0.025, 0.01, 0.05],
'min_alpha': [0.025, 1e-4],
}
O all possible combinations generated on these parameters, I test 50 of
them, unique and randomly selected.
The train function:
def start_training(hyperparams, train_corpus):
model = gensim.models.doc2vec.Doc2Vec(size=hyperparams['size'],
min_count=hyperparams['min_count'], iter=hyperparams['iter'], workers=4,
window=hyperparams['window'], alpha=hyperparams['alpha'], min_alpha=
hyperparams['min_alpha'])
print("Building vocabulary")
model.random.seed(0)
model.build_vocab(train_corpus)
print("Training the model")
print(model)
model.train(train_corpus, total_examples=model.corpus_count, epochs=
model.iter)
The evaluation method consists on searching specific documents using a text
that I'm sure that they are linked with the document.
The better it rank the text with the target document, the best.
For an accuracy rate I made an weighted average based on the rank.
Unfortunately there's few evaluation files (about 15 files).
The evaluation function:
def eval_model(model, eval_dir, hyperparams):
ranked_eval = {}
correct = 0
eval_files_list = os.listdir(eval_dir)
for file in eval_files_list:
eval_file = eval_dir + file
words_vec = get_word_vec(eval_file)
model.random.seed(0)
steps = (hyperparams['iter']) + 50
inferred_vector = model.infer_vector(words_vec, alpha=hyperparams[
'alpha'], min_alpha=hyperparams['min_alpha'], steps=steps)
similars = model.docvecs.most_similar([inferred_vector], topn=len(
model.docvecs))
target = eval_file[-18:]
for i in range(len(similars)):
sim = similars[i]
if sim[0] == target_ER:
print(file, "found in position", i)
ranked_eval[file] = i
if i == 0 :
correct += 1
elif i >= 1 and i < 5:
correct += 0.9
elif i >= 5 and i < 10:
correct += 0.7
elif i >= 10 and i < 20:
correct += 0.4
elif i >= 20 and i < 50:
correct += 0.2
break
accuracy_rate = (correct / len(eval_files_list)) * 100
return accuracy_rate, ranked_eval
Important to inform that I preprocess the "input text" of the eval files
the same as the preprocess used on the training set.
I *strongly* believe that the use of Doc2Vec can achieve good improvements
on the currently most used search engines, but it's still not getting
affordable results.
Remembering that the objective is not to beat the Elastic Search indexing
algorithm, but to complement it. So the ideia is not to achieve great
results on every case, but at least try to find good documents that Elastic
Search can't find.
For example, on the best combination I found until now:
{
'size': 100,
'min_count': 4,
'iter': 100,
'window': 4,
'alpha': 0.025,
'min_alpha': 0.025,
'accuracy_rate': 34.285714285714285,
'model_file': './trained_models/test09_360k/test09_360krandom_21.model'
}
{ #These are the classified position of the document. The less the better.
'doc1.txt': 1567,
'doc2.txt': 396,
'doc3.txt': 10929,
'doc4.txt': 3,
'doc5.txt': 3,
'doc6.txt': 0,
'doc7.txt': 70868,
'doc8.txt': 2334,
'doc9.txt': 486,
'doc10.txt': 0,
'doc11.txt': 30569,
'doc12.txt': 1571,
'doc13.txt': 2088,
'doc14.txt': 0
}
*Do you have any opinion about it to improve?* Maybe searching on other
hyperparameters (e.g. the negative sampling values), different
preprocessments, or other ranges of hyperparameters.
*Am I doing something wrong? *Any opinion will help and motivates a lot...
Thanks,
Denis.
There's some weeks that I'm making some researchs using word2vec/doc2vec to
optimized the search mechanism of several specif documents (that contain
logs, fault codes, descriptions, etc). Currently the search engine used is
the Elastic Search.
At first the objective was to overcome Elastic Search on all cases, but
after some tests I realized that doc2vec is not performing as good as
Elastic Search in most of the cases. Although this is happening, there's
specif cases that doc2vec overcome Elastic Search, for example, when
there's a context match objective.
Then I took as objetive to use both of mechanisms in order to improve the
system. Later I will decide how they will be merge... The main problem here
is the training and preprocessing to achieve a great performance.
The dataset is consisted with ~360k documents.
There's several preprocessing combinations that I'm testing, like
stemization, filtering stopwords and filtering special characters.
About the hyperparameters, I'm using the random search technique to search
for a good combination.
These are the combinations that I'm using:
hyperparams = {
'size': [100, 200],
'min_count': [1, 2, 3, 4, 5],
'iter': [50, 100, 150],
'window': [4, 5, 6, 7, 8],
'alpha': [0.025, 0.01, 0.05],
'min_alpha': [0.025, 1e-4],
}
O all possible combinations generated on these parameters, I test 50 of
them, unique and randomly selected.
The train function:
def start_training(hyperparams, train_corpus):
model = gensim.models.doc2vec.Doc2Vec(size=hyperparams['size'],
min_count=hyperparams['min_count'], iter=hyperparams['iter'], workers=4,
window=hyperparams['window'], alpha=hyperparams['alpha'], min_alpha=
hyperparams['min_alpha'])
print("Building vocabulary")
model.random.seed(0)
model.build_vocab(train_corpus)
print("Training the model")
print(model)
model.train(train_corpus, total_examples=model.corpus_count, epochs=
model.iter)
The evaluation method consists on searching specific documents using a text
that I'm sure that they are linked with the document.
The better it rank the text with the target document, the best.
For an accuracy rate I made an weighted average based on the rank.
Unfortunately there's few evaluation files (about 15 files).
The evaluation function:
def eval_model(model, eval_dir, hyperparams):
ranked_eval = {}
correct = 0
eval_files_list = os.listdir(eval_dir)
for file in eval_files_list:
eval_file = eval_dir + file
words_vec = get_word_vec(eval_file)
model.random.seed(0)
steps = (hyperparams['iter']) + 50
inferred_vector = model.infer_vector(words_vec, alpha=hyperparams[
'alpha'], min_alpha=hyperparams['min_alpha'], steps=steps)
similars = model.docvecs.most_similar([inferred_vector], topn=len(
model.docvecs))
target = eval_file[-18:]
for i in range(len(similars)):
sim = similars[i]
if sim[0] == target_ER:
print(file, "found in position", i)
ranked_eval[file] = i
if i == 0 :
correct += 1
elif i >= 1 and i < 5:
correct += 0.9
elif i >= 5 and i < 10:
correct += 0.7
elif i >= 10 and i < 20:
correct += 0.4
elif i >= 20 and i < 50:
correct += 0.2
break
accuracy_rate = (correct / len(eval_files_list)) * 100
return accuracy_rate, ranked_eval
Important to inform that I preprocess the "input text" of the eval files
the same as the preprocess used on the training set.
I *strongly* believe that the use of Doc2Vec can achieve good improvements
on the currently most used search engines, but it's still not getting
affordable results.
Remembering that the objective is not to beat the Elastic Search indexing
algorithm, but to complement it. So the ideia is not to achieve great
results on every case, but at least try to find good documents that Elastic
Search can't find.
For example, on the best combination I found until now:
{
'size': 100,
'min_count': 4,
'iter': 100,
'window': 4,
'alpha': 0.025,
'min_alpha': 0.025,
'accuracy_rate': 34.285714285714285,
'model_file': './trained_models/test09_360k/test09_360krandom_21.model'
}
{ #These are the classified position of the document. The less the better.
'doc1.txt': 1567,
'doc2.txt': 396,
'doc3.txt': 10929,
'doc4.txt': 3,
'doc5.txt': 3,
'doc6.txt': 0,
'doc7.txt': 70868,
'doc8.txt': 2334,
'doc9.txt': 486,
'doc10.txt': 0,
'doc11.txt': 30569,
'doc12.txt': 1571,
'doc13.txt': 2088,
'doc14.txt': 0
}
*Do you have any opinion about it to improve?* Maybe searching on other
hyperparameters (e.g. the negative sampling values), different
preprocessments, or other ranges of hyperparameters.
*Am I doing something wrong? *Any opinion will help and motivates a lot...
Thanks,
Denis.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.