[gensim:11710] Choose among Cosine Similarity and WMD (Word's Mover Distance) Similarity

Discussion:

Loreto Parisi

2018-10-23 16:44:44 UTC

I'm using both Cosine Similarity and WMD to compare a list of documents to
an input document, where a document has multiple lines separated by one or
more '\n'.
I'm using Word2Vec binary model from FastText English WikiNews with
embedding dim 300.

Assumed that I have defined those simple methods for text pre-processing,
centroid and cosine similarity calculation

def preprocess(doc,stop_words):
doc = doc.lower() # Lower the text.
doc = word_tokenize(doc) # Split into words.
doc = [w for w in doc if not w in stop_words] # Remove stopwords.
doc = [w for w in doc if w.isalpha()] # Remove numbers and punctuation.
return doc

def sentence_centroid(sentence, wv):
v = np.zeros(300)
for w in sentence:
if w in wv:
v += wv[w]
return v / len(sentence)

def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

I'm doing the following. First I take my input document and I calculate the
centroid from my Word2Vec model

inputv = sentence_centroid(preprocess(lyric_to_compare,stop_words), model)
wmd_distances = []
cosine_distances = []

I the iterate it for the list of documents.

for i in range(len(document_list)):
l2 = document_list[i]

# lyrics centroid
l2v = sentence_centroid(preprocess(l2,stop_words), model)
# wmd similarity
wmdistance = model.wmdistance(preprocess(lyric_to_compare,stop_words),
preprocess(l2,stop_words))
wmd_distances.append( wmdistance )
# cosine similarity
cosine_similarity = cosine_sim(inputv,l2v)
cosine_similarities.append( cosine_similarity )

so I have now the WMD instances and the cosine distances for all documents
against the inputv
At this point I want to normalize these values.
I first calculate the wmd similarity as *1-wmd_distance. *In the code here
I'm normalizing against the max value, so I'm doing wmd_max - i where i is
the ith wmd distance value
then I normalize between min and max.

# normalize similarity score
if len(wmd_distances) > 1:
wmd_max = np.amax(wmd_distances)
wmd_distances = [(wmd_max - i) for i in wmd_distances]
wmd_distances_norm =
[((x-np.min(wmd_distances))/(np.max(wmd_distances)-np.min(wmd_distances)))
for x in wmd_distances]
cosine_similarities_norm =
[((x-np.min(cosine_similarities))/(np.max(cosine_similarities)-np.min(cosine_similarities)))
for x in cosine_similarities]
else:
wmd_distances = [(1-i) for i in wmd_distances]
wmd_distances_norm = wmd_distances
cosine_similarities_norm = cosine_similarities

So my output now is a list of cosine similarities and wmd similarities
values, eventually normalized.

Applying this to different documents, I have some issues, first of all I'm
not completely sure about the using the max value to get the wmd similarity:

*wmd_similarity[ i ] = max( wmd_distances) - wmd_distances[ i ]*

that maybe could be as simple as *wmd_similarity[ i ] = 1 - wmd_distances[
i ], *that will eventually introduce negative values.

Second point is the normalization, assumed that this could makes sense, I
cannot get rid of the scale of both metrics to choose the best option.
Any hint?

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

2018-10-24 08:11:29 UTC

Permalink

It's not clear what you mean by 'normalize', or ultimately hope to achieve
by this step. Are you sure you need it?

Cosine-similarities will already be in a range from -1.0 to 1.0. Further,
when they come from the same model/process, they'll be comparable to each
other. For example, for sentences a, b, c, d, e, and f, if cossim(a,b) >
cossim(d,e), then it'd be typical/defensible to say that "a and b are more
similar to each other than d and e".

However, if you also calculated cossim(a,c), and then *scaled* the
cossim(a,b) and cossim(a,c) values based on just the min/max seen in those
pairings, the scaled version wouldn't necessarily be meaningfully
comparable to some values scaled based on a different set of pairings. (And
if you didn't care about such longer-range comparability â just ranks â you
probably wouldn't be doing scaling at all.)

For WMDistance, the values are positive and vary more â indeed I'm not sure
there is an obvious 'max' value to the distance, as longer and
more-different texts could get much larger distances. And for some
downstream tasks, there's no need to re-scale the values: the raw
distances, or sorted rank of results, or relative differences between raw
values, may be enough.

But if you do need some similarity-value that ranges from 0.0 to 1.0,
rather than scaling by observed ranges, a common transformation that's used
is:

similarity = 1 / (1 + distance)

Then the re-scaled values don't depend on what max happened to be in the
same grouping. (You could also then shift-and-scale that value to be in the
-1.0 to 1.0 range, by multiplying by 2 and substracting 1, but even if
doing that comparing the WMD-derived similarity with the cosine-similarity
might be nonsensical, given their very-different methods-of-calculation and
typical distributions.)

- Gordon

Post by Loreto Parisi
I'm using both Cosine Similarity and WMD to compare a list of documents to
an input document, where a document has multiple lines separated by one or
more '\n'.
I'm using Word2Vec binary model from FastText English WikiNews with
embedding dim 300.
Assumed that I have defined those simple methods for text pre-processing,
centroid and cosine similarity calculation
doc = doc.lower() # Lower the text.
doc = word_tokenize(doc) # Split into words.
doc = [w for w in doc if not w in stop_words] # Remove stopwords.
doc = [w for w in doc if w.isalpha()] # Remove numbers and punctuation.
return doc
v = np.zeros(300)
v += wv[w]
return v / len(sentence)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
I'm doing the following. First I take my input document and I calculate
the centroid from my Word2Vec model
inputv = sentence_centroid(preprocess(lyric_to_compare,stop_words), model)
wmd_distances = []
cosine_distances = []
I the iterate it for the list of documents.
l2 = document_list[i]
# lyrics centroid
l2v = sentence_centroid(preprocess(l2,stop_words), model)
# wmd similarity
wmdistance = model.wmdistance(preprocess(lyric_to_compare,stop_words),
preprocess(l2,stop_words))
wmd_distances.append( wmdistance )
# cosine similarity
cosine_similarity = cosine_sim(inputv,l2v)
cosine_similarities.append( cosine_similarity )
so I have now the WMD instances and the cosine distances for all documents
against the inputv
At this point I want to normalize these values.
I first calculate the wmd similarity as *1-wmd_distance. *In the code
here I'm normalizing against the max value, so I'm doing wmd_max - i where
i is the ith wmd distance value
then I normalize between min and max.
# normalize similarity score
wmd_max = np.amax(wmd_distances)
wmd_distances = [(wmd_max - i) for i in wmd_distances]
wmd_distances_norm =
[((x-np.min(wmd_distances))/(np.max(wmd_distances)-np.min(wmd_distances)))
for x in wmd_distances]
cosine_similarities_norm =
[((x-np.min(cosine_similarities))/(np.max(cosine_similarities)-np.min(cosine_similarities)))
for x in cosine_similarities]
wmd_distances = [(1-i) for i in wmd_distances]
wmd_distances_norm = wmd_distances
cosine_similarities_norm = cosine_similarities
So my output now is a list of cosine similarities and wmd similarities
values, eventually normalized.
Applying this to different documents, I have some issues, first of all I'm
*wmd_similarity[ i ] = max( wmd_distances) - wmd_distances[ i ]*
that maybe could be as simple as *wmd_similarity[ i ] = 1 -
wmd_distances[ i ], *that will eventually introduce negative values.
Second point is the normalization, assumed that this could makes sense, I
cannot get rid of the scale of both metrics to choose the best option.
Any hint?

Loreto Parisi

2018-10-30 11:56:58 UTC

Permalink

Hey Gordon, thank you very much for your suggestions, this means a lot!
I think you were right, there is no need to scale them because of [-1,1]
range of cosine-sim and assumed that at the end a direct comparison of the
two metrics in terms of distribution does not help.

Putting all together, I have then modified the code like following,
according to your suggestions :)

inputv = sentence_centroid(preprocess(lyric_to_compare,stop_words), model)
wmd_similarities = []
cosine_similarities = []
for i in range(len(lyrics_list)):
l2 = lyrics_list[i]['lyrics_body']
# lyrics centroid
l2v = sentence_centroid(preprocess(l2,stop_words), model)
# wmd similarity
wmdistance = model.wmdistance(preprocess(lyric_to_compare,stop_words),
preprocess(l2,stop_words))
# https://groups.google.com/forum/#!topic/gensim/-pRZnsOEaPQ
wmsimilarity = 1/(1+wmdistance)
wmd_similarities.append(wmsimilarity)
# cosine similarity
cosine_similarity = cosine_sim(inputv,l2v)
cosine_similarities.append(cosine_similarity)
# re-scaling wmdsim
wmd_similarities_norm = [(2*i-1) for i in wmd_similarities]

I have added the re-scale of the WMD similarities in [-1,1] just for
testing purposes.
I will now try it on the real-world data and see what happens and if this
approach is closers to what I would expect :)
My two cents, maybe to add your suggestions to the gensim WMD/CosineSim
tutorials, because they were definitively very helpful to me, and hopefully
for other gensim users :)
Thanks again.

Post by Gordon Mohr
It's not clear what you mean by 'normalize', or ultimately hope to achieve
by this step. Are you sure you need it?
Cosine-similarities will already be in a range from -1.0 to 1.0. Further,
when they come from the same model/process, they'll be comparable to each
other. For example, for sentences a, b, c, d, e, and f, if cossim(a,b) >
cossim(d,e), then it'd be typical/defensible to say that "a and b are more
similar to each other than d and e".
However, if you also calculated cossim(a,c), and then *scaled* the
cossim(a,b) and cossim(a,c) values based on just the min/max seen in those
pairings, the scaled version wouldn't necessarily be meaningfully
comparable to some values scaled based on a different set of pairings. (And
if you didn't care about such longer-range comparability â just ranks â you
probably wouldn't be doing scaling at all.)
For WMDistance, the values are positive and vary more â indeed I'm not
sure there is an obvious 'max' value to the distance, as longer and
more-different texts could get much larger distances. And for some
downstream tasks, there's no need to re-scale the values: the raw
distances, or sorted rank of results, or relative differences between raw
values, may be enough.
But if you do need some similarity-value that ranges from 0.0 to 1.0,
rather than scaling by observed ranges, a common transformation that's used
similarity = 1 / (1 + distance)
Then the re-scaled values don't depend on what max happened to be in the
same grouping. (You could also then shift-and-scale that value to be in the
-1.0 to 1.0 range, by multiplying by 2 and substracting 1, but even if
doing that comparing the WMD-derived similarity with the cosine-similarity
might be nonsensical, given their very-different methods-of-calculation and
typical distributions.)
- Gordon

Post by Loreto Parisi
I'm using both Cosine Similarity and WMD to compare a list of documents
to an input document, where a document has multiple lines separated by one
or more '\n'.
I'm using Word2Vec binary model from FastText English WikiNews with
embedding dim 300.
Assumed that I have defined those simple methods for text pre-processing,
centroid and cosine similarity calculation
doc = doc.lower() # Lower the text.
doc = word_tokenize(doc) # Split into words.
doc = [w for w in doc if not w in stop_words] # Remove stopwords.
doc = [w for w in doc if w.isalpha()] # Remove numbers and punctuation.
return doc
v = np.zeros(300)
v += wv[w]
return v / len(sentence)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
I'm doing the following. First I take my input document and I calculate
the centroid from my Word2Vec model
inputv = sentence_centroid(preprocess(lyric_to_compare,stop_words), model)
wmd_distances = []
cosine_distances = []
I the iterate it for the list of documents.
l2 = document_list[i]
# lyrics centroid
l2v = sentence_centroid(preprocess(l2,stop_words), model)
# wmd similarity
wmdistance = model.wmdistance(preprocess(lyric_to_compare,stop_words),
preprocess(l2,stop_words))
wmd_distances.append( wmdistance )
# cosine similarity
cosine_similarity = cosine_sim(inputv,l2v)
cosine_similarities.append( cosine_similarity )
so I have now the WMD instances and the cosine distances for all
documents against the inputv
At this point I want to normalize these values.
I first calculate the wmd similarity as *1-wmd_distance. *In the code
here I'm normalizing against the max value, so I'm doing wmd_max - i where
i is the ith wmd distance value
then I normalize between min and max.
# normalize similarity score
wmd_max = np.amax(wmd_distances)
wmd_distances = [(wmd_max - i) for i in wmd_distances]
wmd_distances_norm =
[((x-np.min(wmd_distances))/(np.max(wmd_distances)-np.min(wmd_distances)))
for x in wmd_distances]
cosine_similarities_norm =
[((x-np.min(cosine_similarities))/(np.max(cosine_similarities)-np.min(cosine_similarities)))
for x in cosine_similarities]
wmd_distances = [(1-i) for i in wmd_distances]
wmd_distances_norm = wmd_distances
cosine_similarities_norm = cosine_similarities
So my output now is a list of cosine similarities and wmd similarities
values, eventually normalized.
Applying this to different documents, I have some issues, first of all
I'm not completely sure about the using the max value to get the wmd
*wmd_similarity[ i ] = max( wmd_distances) - wmd_distances[ i ]*
that maybe could be as simple as *wmd_similarity[ i ] = 1 -
wmd_distances[ i ], *that will eventually introduce negative values.
Second point is the normalization, assumed that this could makes sense, I
cannot get rid of the scale of both metrics to choose the best option.
Any hint?