Yue
2018-10-05 04:35:57 UTC
Hi, I have a little more than *70,000 science articles* (typically 4 pages
in length) in text files, which each on average consists of *3,000 tokens*.
I also have another 900 target articles (also on science) of similar
length, and for each of these target articles I want to find several most
similar articles from the 70,000 articles. So it is like finding articles
that have similar topics to the target articles and then matching them.
I approached this task by first training the doc2vec model on the 70,000
articles, and then using .infer_vectors and .most_similar on each of the
target articles. To test the results, I sampled the matching results by
looking into some of the articles that got matched. The results certainly
made sense. However, since there is no telling of absolute best matchings
and I could only tell the quality of the results by sampling a few of them,
I wonder if the parameters I used for model training are optimal.
The parameters I used are:
{'dm':0,'*vector_size*':300,'*window*':3,'alpha':0.05,'min_alpha':0.025,
'min_count':15,'workers':12,'*epochs*':30,'hs':0,'negative':5,'dbow_words':1
}
I am particularly interested in window, vector_size, and epochs (welcome
suggestions on other parameters as well). I saw that for window, people
usually use 5. My reason for using 3 is that for science articles, the
topics can be determined best by the key terms, less so by the structure of
the sentences. So by using a smaller window I should be best able to
capture the key terms. Another important reason is that the text files I
have for the articles are converted from PDFs. After the conversion, the
order of the sections in the original articles has changed a lot. For
example, the Introduction section may come before the abstract section, and
some paragraphs may end up intertwined and so on. Besides, lots of the
equations and messy symbols that don't really help determining the topics
were also in the text. So while the word order within sections and
paragraphs is mostly retained, the section and paragraph order may not be.
Thus I thought that a bigger window would cover words that are in far away
paragraphs and thus makes the context incoherent.
As for my vector_size, and epochs, are they suitable for the 70,000
articles that are typically 3,000 tokens in length?
I also did this task on abstracts for both the training articles and the
target articles. The abstracts are typically *140 tokens* in length. For
abstracts matching, the parameters I used are:
{'dm':0,'*vector_size*':100,'*window*':5,'alpha':0.05,'min_alpha':0.025,
'min_count':2,'workers':12,'*epochs*':30,'hs':0,'negative':5,'dbow_words':1}
For abstracts, there's no issues of section order or paragraph order, so I
used a bigger window. And given that the abstracts are much shorter, I used
a smaller vector size. I also decreased the min_count parameter for
abstracts. Do these parameters I used make sense? I would really appreciate
if someone could provide some insights.
Thank you,
Yue
in length) in text files, which each on average consists of *3,000 tokens*.
I also have another 900 target articles (also on science) of similar
length, and for each of these target articles I want to find several most
similar articles from the 70,000 articles. So it is like finding articles
that have similar topics to the target articles and then matching them.
I approached this task by first training the doc2vec model on the 70,000
articles, and then using .infer_vectors and .most_similar on each of the
target articles. To test the results, I sampled the matching results by
looking into some of the articles that got matched. The results certainly
made sense. However, since there is no telling of absolute best matchings
and I could only tell the quality of the results by sampling a few of them,
I wonder if the parameters I used for model training are optimal.
The parameters I used are:
{'dm':0,'*vector_size*':300,'*window*':3,'alpha':0.05,'min_alpha':0.025,
'min_count':15,'workers':12,'*epochs*':30,'hs':0,'negative':5,'dbow_words':1
}
I am particularly interested in window, vector_size, and epochs (welcome
suggestions on other parameters as well). I saw that for window, people
usually use 5. My reason for using 3 is that for science articles, the
topics can be determined best by the key terms, less so by the structure of
the sentences. So by using a smaller window I should be best able to
capture the key terms. Another important reason is that the text files I
have for the articles are converted from PDFs. After the conversion, the
order of the sections in the original articles has changed a lot. For
example, the Introduction section may come before the abstract section, and
some paragraphs may end up intertwined and so on. Besides, lots of the
equations and messy symbols that don't really help determining the topics
were also in the text. So while the word order within sections and
paragraphs is mostly retained, the section and paragraph order may not be.
Thus I thought that a bigger window would cover words that are in far away
paragraphs and thus makes the context incoherent.
As for my vector_size, and epochs, are they suitable for the 70,000
articles that are typically 3,000 tokens in length?
I also did this task on abstracts for both the training articles and the
target articles. The abstracts are typically *140 tokens* in length. For
abstracts matching, the parameters I used are:
{'dm':0,'*vector_size*':100,'*window*':5,'alpha':0.05,'min_alpha':0.025,
'min_count':2,'workers':12,'*epochs*':30,'hs':0,'negative':5,'dbow_words':1}
For abstracts, there's no issues of section order or paragraph order, so I
used a bigger window. And given that the abstracts are much shorter, I used
a smaller vector size. I also decreased the min_count parameter for
abstracts. Do these parameters I used make sense? I would really appreciate
if someone could provide some insights.
Thank you,
Yue
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.