[gensim:11662] Why WMD does not scale well when working with a lot of words in practice?

Discussion:

Loreto Parisi

2018-10-08 15:52:23 UTC

I'm using WMD for a text similarity task of short sentences (like tweets).
According to the good tutorial here
<http://jxieeducation.com/2016-06-13/Document-Similarity-With-Word-Movers-Distance/> the
metrics does not scale well for a number of tokens > 10 (I assume this is
something like not much more than N, like 10-30 or a tweet, etc.) "because
flow become very convoluted between similar words". I'm trying to get rid
of the paper here
<https://papers.nips.cc/paper/6139-supervised-word-movers-distance.pdf>,
but it is not clear to me where it fails becoming convolute. The definition
states that *the WMD learns T to minimize D(xi, xj ), so that documents
that share many words (or even related ones) **should have smaller
distances than documents with very dissimilar words. *Okay, the k-NN is
a Neighborhood Components Analysis (NCA), and loss, gradient are therefore
defined. For optimization a batch GD has been used. Also looking at the
dataset, like Amazon Reviews, 20News, etc. - FastText has a good coverage
of most of these ones here
<https://github.com/facebookresearch/fastText/blob/master/classification-results.sh>,
there is no max token size assigned, or similar upper bound.
Anyone has experienced a problem like that?

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

2018-10-08 19:04:12 UTC

Permalink

Nothing will fail about Word Mover's Distance with longer texts â it will
just take a lot longer.

In other comparisons, each text already has a fixed-length vector. For
example, maybe all the 100-dimensional word-vectors have been averaged
together, yielding a single 100-dimensional vector for a text of any
length. Or, something like Doc2Vec will have computed a 100-dimensional
summary vector for a text of any length. Then, the pairwise comparisons
between these vectors take the same amount of time no matter the text
length (and further parallelize well with the various array-math native
routines available.)

For WMD, the input representation of a text is a bag-of-word-vectors. So
instead of a 50-word document being just 100-dimensions, it's now a bag of
50 different 100-dimensional vectors. Calculating the WMD to another
50-word document starts with a step of calculating 50*50 pairwise
cosine-distances, between all word-pairs â that's already 2500 times the
calculation required for a single 100d-to-100d cosine-distance (when a text
is a single summary vector). Then WMD requires iteratively searching
through many possible "word weight shifts" to find the "cheapest" one â and
a simple greedy algorithm like "shift the 1st word to the word(s) in the
other doc that are closest to it, then do the 2nd word, etc" won't find the
best solution.

So: WMD is more costly to calculate on small texts, and further grows in
costliness as the texts get longer. If you're patient enough, have enough
CPUs/parallelization, or can apply other optimizations (like only
calculating WMD between texts that some other cheaper method already
identified as somewhat-similar candidates), it may still be fine for any
particular project.

(I suspect there are a bunch of potential WMD-like algorithms that could
offer most of the benefit with less of the calculation cost. There was a
measure called "Soft Cosine Similarity" that seems promising, and perhaps
long texts could useull be preprocessed to reduce their
hundreds-of-word-vectors, via weighted clustering, to just a small fixed
number N of pseudoword-vectors before applying full WMD. But those would be
research/experimental questions...)

- Gordon

Post by Loreto Parisi
I'm using WMD for a text similarity task of short sentences (like
tweets). According to the good tutorial here
<http://jxieeducation.com/2016-06-13/Document-Similarity-With-Word-Movers-Distance/> the
metrics does not scale well for a number of tokens > 10 (I assume this is
something like not much more than N, like 10-30 or a tweet, etc.) "because
flow become very convoluted between similar words". I'm trying to get rid
of the paper here
<https://papers.nips.cc/paper/6139-supervised-word-movers-distance.pdf>,
but it is not clear to me where it fails becoming convolute. The definition
states that *the WMD learns T to minimize D(xi, xj ), so that documents
that share many words (or even related ones) **should have smaller
distances than documents with very dissimilar words. *Okay, the k-NN is
a Neighborhood Components Analysis (NCA), and loss, gradient are therefore
defined. For optimization a batch GD has been used. Also looking at the
dataset, like Amazon Reviews, 20News, etc. - FastText has a good coverage
of most of these ones here
<https://github.com/facebookresearch/fastText/blob/master/classification-results.sh>,
there is no max token size assigned, or similar upper bound.
Anyone has experienced a problem like that?

Loreto Parisi

2018-10-12 07:30:05 UTC

Permalink

Thank you very much for the detailed response. In most of the tests I did
so far, cosine-distance was not the "optimal" candidate metric for text
similarity task - in practice. I have used the Universal Sentence Encoding
model (Tensorflow Hub
- https://tfhub.dev/google/universal-sentence-encoder/2) for the STS task,
but the result that is promising according to the examples, in real world
docs it is not. We are now trying gensim WMD with documents (so non short
text documents) and for our similarity purposes it seems to be more general
purpose to capture what for us is a similar document. In particular when
transforming a document with typical edit operations, WMD seems to work
better definitively. I'm not sure if there is some paper that states this,
but in my experiments this seems to come out so far on significative
documents dataset. I was not aware of the soft-cosine-distance, but I have
tried different approaches like SIF
(https://github.com/loretoparisi/docker/blob/master/sif/Dockerfile) and a
more classical Word2Vec centroid based cosine similarity (with fastText or
StarSpace).

Post by Gordon Mohr
Nothing will fail about Word Mover's Distance with longer texts â it will
just take a lot longer.
In other comparisons, each text already has a fixed-length vector. For
example, maybe all the 100-dimensional word-vectors have been averaged
together, yielding a single 100-dimensional vector for a text of any
length. Or, something like Doc2Vec will have computed a 100-dimensional
summary vector for a text of any length. Then, the pairwise comparisons
between these vectors take the same amount of time no matter the text
length (and further parallelize well with the various array-math native
routines available.)
For WMD, the input representation of a text is a bag-of-word-vectors. So
instead of a 50-word document being just 100-dimensions, it's now a bag of
50 different 100-dimensional vectors. Calculating the WMD to another
50-word document starts with a step of calculating 50*50 pairwise
cosine-distances, between all word-pairs â that's already 2500 times the
calculation required for a single 100d-to-100d cosine-distance (when a text
is a single summary vector). Then WMD requires iteratively searching
through many possible "word weight shifts" to find the "cheapest" one â and
a simple greedy algorithm like "shift the 1st word to the word(s) in the
other doc that are closest to it, then do the 2nd word, etc" won't find the
best solution.
So: WMD is more costly to calculate on small texts, and further grows in
costliness as the texts get longer. If you're patient enough, have enough
CPUs/parallelization, or can apply other optimizations (like only
calculating WMD between texts that some other cheaper method already
identified as somewhat-similar candidates), it may still be fine for any
particular project.
(I suspect there are a bunch of potential WMD-like algorithms that could
offer most of the benefit with less of the calculation cost. There was a
measure called "Soft Cosine Similarity" that seems promising, and perhaps
long texts could useull be preprocessed to reduce their
hundreds-of-word-vectors, via weighted clustering, to just a small fixed
number N of pseudoword-vectors before applying full WMD. But those would be
research/experimental questions...)
- Gordon