Discussion:
[gensim:11845] Document similarity with Wikipedia or Google news
Jay Qadan
2018-12-01 02:53:34 UTC
Permalink
am trying to use this example Doc2vec-wikipedia
<https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb> but
to use similarity with an arbitary document like a news article in attached
sample. Due to computational challenges, I used 'text8' instead of full
wikipedia dump using the gensim api.load("text8"):


- is this the best approach (doc2vec) to find document similarity with
large corpus? any suggestion if there is a better method to get similaities
based on topic rather than similar words?
- As suggested, I used this code to look up similarity, considering that
I use larger number of words than just: 'machine','learning'
- print(model.docvecs.most_similar(positive=[model.infer_vector(['machine','learning'])],
topn=20))
- however the result I get is in this format: (502,
0.5730128288269043), (94, 0.5560649633407593), (187, 0.5478538870811462)
not article title as in the orginial example, any suggestions how to get
the article titles in 'text8' corpus?
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Gordon Mohr
2018-12-01 06:20:46 UTC
Permalink
'text8' is just bulk text from part of Wikipedia concatenated together for
compression tests. It's lost the article boundaries and titles, and further
may only be from "early" articles in some sorted collection of articles.
These factors make it almost entirely useless for meaningful doc-vector
training. (The mere fact that gensim's handling will break it into
manageably-sized lines creates pseudo-documents, and these lines will often
have long runs of sentences from individual articles, could mean the
doc-vectors have some slight topical power. But the doc-ids will still just
be line numbers.)

You'd have to work with a better dump, where the documents are per-article
and the tags are article titles, to get more meaningful results back. (If
the full dataset is too large, discarding short articles, and truncating
larger articles to a few hundred or thousand words, might help make
training memory/times more manageable.)

- Gordon
Post by Jay Qadan
am trying to use this example Doc2vec-wikipedia
<https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb> but
to use similarity with an arbitary document like a news article in attached
sample. Due to computational challenges, I used 'text8' instead of full
- is this the best approach (doc2vec) to find document similarity with
large corpus? any suggestion if there is a better method to get similaities
based on topic rather than similar words?
- As suggested, I used this code to look up similarity, considering
that I use larger number of words than just: 'machine','learning'
- print(model.docvecs.most_similar(positive=[model.infer_vector(['machine','learning'])],
topn=20))
- however the result I get is in this format: (502,
0.5730128288269043), (94, 0.5560649633407593), (187, 0.5478538870811462)
not article title as in the orginial example, any suggestions how to get
the article titles in 'text8' corpus?
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Jay Qadan
2018-12-01 13:31:06 UTC
Permalink
Suppose I want to train only the titles of Wikipedia dump, so the matching will be with the article titled rather than the whole content, how to do it in gensim?
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Gordon Mohr
2018-12-01 17:28:27 UTC
Permalink
You could find a list of all article titles, and use each title as both the
tokenized `words` of a document, and that document's single string `tag`.

But I wouldn't expect very good results from such an approach. Doc2Vec
works better with documents that are at least a few dozen words – and
documents that are just article titles would often be just 1-4 words.

Using the first few dozen to hundreds of words from each article would
likely work better, or some sort of abstract/summary of the articles.

I don't know the current format/quality of the abstracts available at
<https://dumps.wikimedia.org/enwiki/latest/>, but those might work.

Alternatively, there's a Wikipedia API call that gets back a "summary" of
an article (typically the first paragraph before any other named sections):
<https://en.wikipedia.org/api/rest_v1/#!/Page_content/get_page_summary_title>.
You'd likely want to be careful if making bulk requests against this, to
make requests at a measured pace, handle transient errors, and save the
results for reuse to avoid redudant requests.

- Gordon
Post by Jay Qadan
Suppose I want to train only the titles of Wikipedia dump, so the matching
will be with the article titled rather than the whole content, how to do
it in gensim?
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Jay Qadan
2018-12-02 10:44:50 UTC
Permalink
thanks Gordon, on your suggesting of truncating larger articles, how to
achieve that? suppose I want to download api.info("wiki-english-20171001")
, how to truncate it?
Post by Gordon Mohr
'text8' is just bulk text from part of Wikipedia concatenated together for
compression tests. It's lost the article boundaries and titles, and further
may only be from "early" articles in some sorted collection of articles.
These factors make it almost entirely useless for meaningful doc-vector
training. (The mere fact that gensim's handling will break it into
manageably-sized lines creates pseudo-documents, and these lines will often
have long runs of sentences from individual articles, could mean the
doc-vectors have some slight topical power. But the doc-ids will still just
be line numbers.)
You'd have to work with a better dump, where the documents are per-article
and the tags are article titles, to get more meaningful results back. (If
the full dataset is too large, discarding short articles, and truncating
larger articles to a few hundred or thousand words, might help make
training memory/times more manageable.)
- Gordon
Post by Jay Qadan
am trying to use this example Doc2vec-wikipedia
<https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb> but
to use similarity with an arbitary document like a news article in attached
sample. Due to computational challenges, I used 'text8' instead of full
- is this the best approach (doc2vec) to find document similarity
with large corpus? any suggestion if there is a better method to get
similaities based on topic rather than similar words?
- As suggested, I used this code to look up similarity, considering
that I use larger number of words than just: 'machine','learning'
- print(model.docvecs.most_similar(positive=[model.infer_vector(['machine','learning'])],
topn=20))
- however the result I get is in this format: (502,
0.5730128288269043), (94, 0.5560649633407593), (187, 0.5478538870811462)
not article title as in the orginial example, any suggestions how to get
the article titles in 'text8' corpus?
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Gordon Mohr
2018-12-02 19:36:41 UTC
Permalink
You could follow the example of the doc2vec-wikipedia notebook
(https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb)
to the point of getting the Wikipedia data, but then write the items you
get back from `get_texts()` to an interim file of title & tokens –
discarding tokens in excess of some threshold before writing. (This
one-time process could also discard too-small articles.)

Then, read that file back into a new corpus-iterator to do your training.
On the downside, you'd still have to download and scan the full dump once.
On the upside, the truncated file may be much faster to re-iterate over for
multiple training passes – as it's now just the titles & plain text, rather
than original XML dump.

Alternatively, look into the abstracts-download or per-article summary
downloading I'd mentioned in the previous message.

I'd not recommend the use of `api.load()` for anything you could reasonably
do yourself - it hides steps/details in unhelpful ways.

- Gordon
Post by Jay Qadan
thanks Gordon, on your suggesting of truncating larger articles, how to
achieve that? suppose I want to download api.info("wiki-english-20171001")
, how to truncate it?
Post by Gordon Mohr
'text8' is just bulk text from part of Wikipedia concatenated together
for compression tests. It's lost the article boundaries and titles, and
further may only be from "early" articles in some sorted collection of
articles. These factors make it almost entirely useless for meaningful
doc-vector training. (The mere fact that gensim's handling will break it
into manageably-sized lines creates pseudo-documents, and these lines will
often have long runs of sentences from individual articles, could mean the
doc-vectors have some slight topical power. But the doc-ids will still just
be line numbers.)
You'd have to work with a better dump, where the documents are
per-article and the tags are article titles, to get more meaningful results
back. (If the full dataset is too large, discarding short articles, and
truncating larger articles to a few hundred or thousand words, might help
make training memory/times more manageable.)
- Gordon
Post by Jay Qadan
am trying to use this example Doc2vec-wikipedia
<https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb> but
to use similarity with an arbitary document like a news article in attached
sample. Due to computational challenges, I used 'text8' instead of full
- is this the best approach (doc2vec) to find document similarity
with large corpus? any suggestion if there is a better method to get
similaities based on topic rather than similar words?
- As suggested, I used this code to look up similarity, considering
that I use larger number of words than just: 'machine','learning'
- print(model.docvecs.most_similar(positive=[model.infer_vector(['machine','learning'])],
topn=20))
- however the result I get is in this format: (502,
0.5730128288269043), (94, 0.5560649633407593), (187, 0.5478538870811462)
not article title as in the orginial example, any suggestions how to get
the article titles in 'text8' corpus?
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Benedict Holland
2018-12-03 16:33:53 UTC
Permalink
Use cos similarity. Doc2vec gives word embeddings, not document similarity.
There are a variety of extensions using cos similarity like incorporating a
distance to important words.

Thanks,
~Ben
You could follow the example of the doc2vec-wikipedia notebook (
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb)
to the point of getting the Wikipedia data, but then write the items you
get back from `get_texts()` to an interim file of title & tokens –
discarding tokens in excess of some threshold before writing. (This
one-time process could also discard too-small articles.)
Then, read that file back into a new corpus-iterator to do your training.
On the downside, you'd still have to download and scan the full dump once.
On the upside, the truncated file may be much faster to re-iterate over for
multiple training passes – as it's now just the titles & plain text, rather
than original XML dump.
Alternatively, look into the abstracts-download or per-article summary
downloading I'd mentioned in the previous message.
I'd not recommend the use of `api.load()` for anything you could
reasonably do yourself - it hides steps/details in unhelpful ways.
- Gordon
Post by Jay Qadan
thanks Gordon, on your suggesting of truncating larger articles, how to
achieve that? suppose I want to download api.info("wiki-english-20171001")
, how to truncate it?
Post by Gordon Mohr
'text8' is just bulk text from part of Wikipedia concatenated together
for compression tests. It's lost the article boundaries and titles, and
further may only be from "early" articles in some sorted collection of
articles. These factors make it almost entirely useless for meaningful
doc-vector training. (The mere fact that gensim's handling will break it
into manageably-sized lines creates pseudo-documents, and these lines will
often have long runs of sentences from individual articles, could mean the
doc-vectors have some slight topical power. But the doc-ids will still just
be line numbers.)
You'd have to work with a better dump, where the documents are
per-article and the tags are article titles, to get more meaningful results
back. (If the full dataset is too large, discarding short articles, and
truncating larger articles to a few hundred or thousand words, might help
make training memory/times more manageable.)
- Gordon
Post by Jay Qadan
am trying to use this example Doc2vec-wikipedia
<https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb> but
to use similarity with an arbitary document like a news article in attached
sample. Due to computational challenges, I used 'text8' instead of full
- is this the best approach (doc2vec) to find document similarity
with large corpus? any suggestion if there is a better method to get
similaities based on topic rather than similar words?
- As suggested, I used this code to look up similarity, considering
that I use larger number of words than just: 'machine','learning'
- print(model.docvecs.most_similar(positive=[model.infer_vector(['machine','learning'])],
topn=20))
- however the result I get is in this format: (502,
0.5730128288269043), (94, 0.5560649633407593), (187, 0.5478538870811462)
not article title as in the orginial example, any suggestions how to get
the article titles in 'text8' corpus?
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Gordon Mohr
2018-12-03 23:18:57 UTC
Permalink
All the existing similarity methods on Word2Vec/Doc2Vec/KeyedVectors
already use cosine similarity.

Doc2Vec will always train vectors for the document tags provided, but only
train word-vectors in some modes. So Doc2Vec gives doc-embeddings that can
be used for document-similarity for sure, but only sometimes gives useful
word-embeddings.

- Gordon
Post by Benedict Holland
Use cos similarity. Doc2vec gives word embeddings, not document
similarity. There are a variety of extensions using cos similarity like
incorporating a distance to important words.
Thanks,
~Ben
You could follow the example of the doc2vec-wikipedia notebook (
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb)
to the point of getting the Wikipedia data, but then write the items you
get back from `get_texts()` to an interim file of title & tokens –
discarding tokens in excess of some threshold before writing. (This
one-time process could also discard too-small articles.)
Then, read that file back into a new corpus-iterator to do your training.
On the downside, you'd still have to download and scan the full dump once.
On the upside, the truncated file may be much faster to re-iterate over for
multiple training passes – as it's now just the titles & plain text, rather
than original XML dump.
Alternatively, look into the abstracts-download or per-article summary
downloading I'd mentioned in the previous message.
I'd not recommend the use of `api.load()` for anything you could
reasonably do yourself - it hides steps/details in unhelpful ways.
- Gordon
Post by Jay Qadan
thanks Gordon, on your suggesting of truncating larger articles, how to
achieve that? suppose I want to download api.info("wiki-english-20171001")
, how to truncate it?
Post by Gordon Mohr
'text8' is just bulk text from part of Wikipedia concatenated together
for compression tests. It's lost the article boundaries and titles, and
further may only be from "early" articles in some sorted collection of
articles. These factors make it almost entirely useless for meaningful
doc-vector training. (The mere fact that gensim's handling will break it
into manageably-sized lines creates pseudo-documents, and these lines will
often have long runs of sentences from individual articles, could mean the
doc-vectors have some slight topical power. But the doc-ids will still just
be line numbers.)
You'd have to work with a better dump, where the documents are
per-article and the tags are article titles, to get more meaningful results
back. (If the full dataset is too large, discarding short articles, and
truncating larger articles to a few hundred or thousand words, might help
make training memory/times more manageable.)
- Gordon
Post by Jay Qadan
am trying to use this example Doc2vec-wikipedia
<https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb> but
to use similarity with an arbitary document like a news article in attached
sample. Due to computational challenges, I used 'text8' instead of full
- is this the best approach (doc2vec) to find document similarity
with large corpus? any suggestion if there is a better method to get
similaities based on topic rather than similar words?
- As suggested, I used this code to look up similarity,
'machine','learning'
- print(model.docvecs.most_similar(positive=[model.infer_vector(['machine','learning'])],
topn=20))
- however the result I get is in this format: (502,
0.5730128288269043), (94, 0.5560649633407593), (187, 0.5478538870811462)
not article title as in the orginial example, any suggestions how to get
the article titles in 'text8' corpus?
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...