[gensim:11710] Choose among Cosine Similarity and WMD (Word's Mover Distance) Similarity
Loreto Parisi
2018-10-23 16:44:44 UTC
I'm using both Cosine Similarity and WMD to compare a list of documents to
an input document, where a document has multiple lines separated by one or
more '\n'.
I'm using Word2Vec binary model from FastText English WikiNews with
embedding dim 300.

Assumed that I have defined those simple methods for text pre-processing,
centroid and cosine similarity calculation

def preprocess(doc,stop_words):
doc = doc.lower() # Lower the text.
doc = word_tokenize(doc) # Split into words.
doc = [w for w in doc if not w in stop_words] # Remove stopwords.
doc = [w for w in doc if w.isalpha()] # Remove numbers and punctuation.
return doc

def sentence_centroid(sentence, wv):
v = np.zeros(300)
for w in sentence:
if w in wv:
v += wv[w]
return v / len(sentence)

def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

I'm doing the following. First I take my input document and I calculate the
centroid from my Word2Vec model

inputv = sentence_centroid(preprocess(lyric_to_compare,stop_words), model)
wmd_distances = []
cosine_distances = []

I the iterate it for the list of documents.

for i in range(len(document_list)):
l2 = document_list[i]

# lyrics centroid
l2v = sentence_centroid(preprocess(l2,stop_words), model)
# wmd similarity
wmdistance = model.wmdistance(preprocess(lyric_to_compare,stop_words),
wmd_distances.append( wmdistance )
# cosine similarity
cosine_similarity = cosine_sim(inputv,l2v)
cosine_similarities.append( cosine_similarity )

so I have now the WMD instances and the cosine distances for all documents
against the inputv
At this point I want to normalize these values.
I first calculate the wmd similarity as *1-wmd_distance. *In the code here
I'm normalizing against the max value, so I'm doing wmd_max - i where i is
the ith wmd distance value
then I normalize between min and max.

# normalize similarity score
if len(wmd_distances) > 1:
wmd_max = np.amax(wmd_distances)
wmd_distances = [(wmd_max - i) for i in wmd_distances]
wmd_distances_norm =
for x in wmd_distances]
cosine_similarities_norm =
for x in cosine_similarities]
wmd_distances = [(1-i) for i in wmd_distances]
wmd_distances_norm = wmd_distances
cosine_similarities_norm = cosine_similarities

So my output now is a list of cosine similarities and wmd similarities
values, eventually normalized.

Applying this to different documents, I have some issues, first of all I'm
not completely sure about the using the max value to get the wmd similarity:

*wmd_similarity[ i ] = max( wmd_distances) - wmd_distances[ i ]*

that maybe could be as simple as *wmd_similarity[ i ] = 1 - wmd_distances[
i ], *that will eventually introduce negative values.

Second point is the normalization, assumed that this could makes sense, I
cannot get rid of the scale of both metrics to choose the best option.
Any hint?
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Gordon Mohr
2018-10-24 08:11:29 UTC
It's not clear what you mean by 'normalize', or ultimately hope to achieve
by this step. Are you sure you need it?

Cosine-similarities will already be in a range from -1.0 to 1.0. Further,
when they come from the same model/process, they'll be comparable to each
other. For example, for sentences a, b, c, d, e, and f, if cossim(a,b) >
cossim(d,e), then it'd be typical/defensible to say that "a and b are more
similar to each other than d and e".

However, if you also calculated cossim(a,c), and then *scaled* the
cossim(a,b) and cossim(a,c) values based on just the min/max seen in those
pairings, the scaled version wouldn't necessarily be meaningfully
comparable to some values scaled based on a different set of pairings. (And
if you didn't care about such longer-range comparability – just ranks – you
probably wouldn't be doing scaling at all.)

For WMDistance, the values are positive and vary more – indeed I'm not sure
there is an obvious 'max' value to the distance, as longer and
more-different texts could get much larger distances. And for some
downstream tasks, there's no need to re-scale the values: the raw
distances, or sorted rank of results, or relative differences between raw
values, may be enough.

But if you do need some similarity-value that ranges from 0.0 to 1.0,
rather than scaling by observed ranges, a common transformation that's used

similarity = 1 / (1 + distance)

Then the re-scaled values don't depend on what max happened to be in the
same grouping. (You could also then shift-and-scale that value to be in the
-1.0 to 1.0 range, by multiplying by 2 and substracting 1, but even if
doing that comparing the WMD-derived similarity with the cosine-similarity
might be nonsensical, given their very-different methods-of-calculation and
typical distributions.)

- Gordon
Post by Loreto Parisi
I'm using both Cosine Similarity and WMD to compare a list of documents to
an input document, where a document has multiple lines separated by one or
more '\n'.
I'm using Word2Vec binary model from FastText English WikiNews with
embedding dim 300.
Assumed that I have defined those simple methods for text pre-processing,
centroid and cosine similarity calculation
doc = doc.lower() # Lower the text.
doc = word_tokenize(doc) # Split into words.
doc = [w for w in doc if not w in stop_words] # Remove stopwords.
doc = [w for w in doc if w.isalpha()] # Remove numbers and punctuation.
return doc
v = np.zeros(300)
v += wv[w]
return v / len(sentence)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
I'm doing the following. First I take my input document and I calculate
the centroid from my Word2Vec model
inputv = sentence_centroid(preprocess(lyric_to_compare,stop_words), model)
wmd_distances = []
cosine_distances = []
I the iterate it for the list of documents.
l2 = document_list[i]
# lyrics centroid
l2v = sentence_centroid(preprocess(l2,stop_words), model)
# wmd similarity
wmdistance = model.wmdistance(preprocess(lyric_to_compare,stop_words),
wmd_distances.append( wmdistance )
# cosine similarity
cosine_similarity = cosine_sim(inputv,l2v)
cosine_similarities.append( cosine_similarity )
so I have now the WMD instances and the cosine distances for all documents
against the inputv
At this point I want to normalize these values.
I first calculate the wmd similarity as *1-wmd_distance. *In the code
here I'm normalizing against the max value, so I'm doing wmd_max - i where
i is the ith wmd distance value
then I normalize between min and max.
# normalize similarity score
wmd_max = np.amax(wmd_distances)
wmd_distances = [(wmd_max - i) for i in wmd_distances]
wmd_distances_norm =
for x in wmd_distances]
cosine_similarities_norm =
for x in cosine_similarities]
wmd_distances = [(1-i) for i in wmd_distances]
wmd_distances_norm = wmd_distances
cosine_similarities_norm = cosine_similarities
So my output now is a list of cosine similarities and wmd similarities
values, eventually normalized.
Applying this to different documents, I have some issues, first of all I'm
*wmd_similarity[ i ] = max( wmd_distances) - wmd_distances[ i ]*
that maybe could be as simple as *wmd_similarity[ i ] = 1 -
wmd_distances[ i ], *that will eventually introduce negative values.
Second point is the normalization, assumed that this could makes sense, I
cannot get rid of the scale of both metrics to choose the best option.
Any hint?
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loreto Parisi
2018-10-30 11:56:58 UTC
Hey Gordon, thank you very much for your suggestions, this means a lot!
I think you were right, there is no need to scale them because of [-1,1]
range of cosine-sim and assumed that at the end a direct comparison of the
two metrics in terms of distribution does not help.

Putting all together, I have then modified the code like following,
according to your suggestions :)

inputv = sentence_centroid(preprocess(lyric_to_compare,stop_words), model)
wmd_similarities = []
cosine_similarities = []
for i in range(len(lyrics_list)):
l2 = lyrics_list[i]['lyrics_body']
# lyrics centroid
l2v = sentence_centroid(preprocess(l2,stop_words), model)
# wmd similarity
wmdistance = model.wmdistance(preprocess(lyric_to_compare,stop_words),
# https://groups.google.com/forum/#!topic/gensim/-pRZnsOEaPQ
wmsimilarity = 1/(1+wmdistance)
# cosine similarity
cosine_similarity = cosine_sim(inputv,l2v)
# re-scaling wmdsim
wmd_similarities_norm = [(2*i-1) for i in wmd_similarities]

I have added the re-scale of the WMD similarities in [-1,1] just for
testing purposes.
I will now try it on the real-world data and see what happens and if this
approach is closers to what I would expect :)
My two cents, maybe to add your suggestions to the gensim WMD/CosineSim
tutorials, because they were definitively very helpful to me, and hopefully
for other gensim users :)
Thanks again.
Post by Gordon Mohr
It's not clear what you mean by 'normalize', or ultimately hope to achieve
by this step. Are you sure you need it?
Cosine-similarities will already be in a range from -1.0 to 1.0. Further,
when they come from the same model/process, they'll be comparable to each
other. For example, for sentences a, b, c, d, e, and f, if cossim(a,b) >
cossim(d,e), then it'd be typical/defensible to say that "a and b are more
similar to each other than d and e".
However, if you also calculated cossim(a,c), and then *scaled* the
cossim(a,b) and cossim(a,c) values based on just the min/max seen in those
pairings, the scaled version wouldn't necessarily be meaningfully
comparable to some values scaled based on a different set of pairings. (And
if you didn't care about such longer-range comparability – just ranks – you
probably wouldn't be doing scaling at all.)
For WMDistance, the values are positive and vary more – indeed I'm not
sure there is an obvious 'max' value to the distance, as longer and
more-different texts could get much larger distances. And for some
downstream tasks, there's no need to re-scale the values: the raw
distances, or sorted rank of results, or relative differences between raw
values, may be enough.
But if you do need some similarity-value that ranges from 0.0 to 1.0,
rather than scaling by observed ranges, a common transformation that's used
similarity = 1 / (1 + distance)
Then the re-scaled values don't depend on what max happened to be in the
same grouping. (You could also then shift-and-scale that value to be in the
-1.0 to 1.0 range, by multiplying by 2 and substracting 1, but even if
doing that comparing the WMD-derived similarity with the cosine-similarity
might be nonsensical, given their very-different methods-of-calculation and
typical distributions.)
- Gordon
Post by Loreto Parisi
I'm using both Cosine Similarity and WMD to compare a list of documents
to an input document, where a document has multiple lines separated by one
or more '\n'.
I'm using Word2Vec binary model from FastText English WikiNews with
embedding dim 300.
Assumed that I have defined those simple methods for text pre-processing,
centroid and cosine similarity calculation
doc = doc.lower() # Lower the text.
doc = word_tokenize(doc) # Split into words.
doc = [w for w in doc if not w in stop_words] # Remove stopwords.
doc = [w for w in doc if w.isalpha()] # Remove numbers and punctuation.
return doc
v = np.zeros(300)
v += wv[w]
return v / len(sentence)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
I'm doing the following. First I take my input document and I calculate
the centroid from my Word2Vec model
inputv = sentence_centroid(preprocess(lyric_to_compare,stop_words), model)
wmd_distances = []
cosine_distances = []
I the iterate it for the list of documents.
l2 = document_list[i]
# lyrics centroid
l2v = sentence_centroid(preprocess(l2,stop_words), model)
# wmd similarity
wmdistance = model.wmdistance(preprocess(lyric_to_compare,stop_words),
wmd_distances.append( wmdistance )
# cosine similarity
cosine_similarity = cosine_sim(inputv,l2v)
cosine_similarities.append( cosine_similarity )
so I have now the WMD instances and the cosine distances for all
documents against the inputv
At this point I want to normalize these values.
I first calculate the wmd similarity as *1-wmd_distance. *In the code
here I'm normalizing against the max value, so I'm doing wmd_max - i where
i is the ith wmd distance value
then I normalize between min and max.
# normalize similarity score
wmd_max = np.amax(wmd_distances)
wmd_distances = [(wmd_max - i) for i in wmd_distances]
wmd_distances_norm =
for x in wmd_distances]
cosine_similarities_norm =
for x in cosine_similarities]
wmd_distances = [(1-i) for i in wmd_distances]
wmd_distances_norm = wmd_distances
cosine_similarities_norm = cosine_similarities
So my output now is a list of cosine similarities and wmd similarities
values, eventually normalized.
Applying this to different documents, I have some issues, first of all
I'm not completely sure about the using the max value to get the wmd
*wmd_similarity[ i ] = max( wmd_distances) - wmd_distances[ i ]*
that maybe could be as simple as *wmd_similarity[ i ] = 1 -
wmd_distances[ i ], *that will eventually introduce negative values.
Second point is the normalization, assumed that this could makes sense, I
cannot get rid of the scale of both metrics to choose the best option.
Any hint?
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.