Loreto Parisi
2018-10-08 15:52:23 UTC
I'm using WMD for a text similarity task of short sentences (like tweets).
According to the good tutorial here
<http://jxieeducation.com/2016-06-13/Document-Similarity-With-Word-Movers-Distance/> the
metrics does not scale well for a number of tokens > 10 (I assume this is
something like not much more than N, like 10-30 or a tweet, etc.) "because
flow become very convoluted between similar words". I'm trying to get rid
of the paper here
<https://papers.nips.cc/paper/6139-supervised-word-movers-distance.pdf>,
but it is not clear to me where it fails becoming convolute. The definition
states that *the WMD learns T to minimize D(xi, xj ), so that documents
that share many words (or even related ones) **should have smaller
distances than documents with very dissimilar words. *Okay, the k-NN is
a Neighborhood Components Analysis (NCA), and loss, gradient are therefore
defined. For optimization a batch GD has been used. Also looking at the
dataset, like Amazon Reviews, 20News, etc. - FastText has a good coverage
of most of these ones here
<https://github.com/facebookresearch/fastText/blob/master/classification-results.sh>,
there is no max token size assigned, or similar upper bound.
Anyone has experienced a problem like that?
According to the good tutorial here
<http://jxieeducation.com/2016-06-13/Document-Similarity-With-Word-Movers-Distance/> the
metrics does not scale well for a number of tokens > 10 (I assume this is
something like not much more than N, like 10-30 or a tweet, etc.) "because
flow become very convoluted between similar words". I'm trying to get rid
of the paper here
<https://papers.nips.cc/paper/6139-supervised-word-movers-distance.pdf>,
but it is not clear to me where it fails becoming convolute. The definition
states that *the WMD learns T to minimize D(xi, xj ), so that documents
that share many words (or even related ones) **should have smaller
distances than documents with very dissimilar words. *Okay, the k-NN is
a Neighborhood Components Analysis (NCA), and loss, gradient are therefore
defined. For optimization a batch GD has been used. Also looking at the
dataset, like Amazon Reviews, 20News, etc. - FastText has a good coverage
of most of these ones here
<https://github.com/facebookresearch/fastText/blob/master/classification-results.sh>,
there is no max token size assigned, or similar upper bound.
Anyone has experienced a problem like that?
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.