Loreto Parisi
2018-10-23 16:44:44 UTC
I'm using both Cosine Similarity and WMD to compare a list of documents to
an input document, where a document has multiple lines separated by one or
more '\n'.
I'm using Word2Vec binary model from FastText English WikiNews with
embedding dim 300.
Assumed that I have defined those simple methods for text pre-processing,
centroid and cosine similarity calculation
def preprocess(doc,stop_words):
doc = doc.lower() # Lower the text.
doc = word_tokenize(doc) # Split into words.
doc = [w for w in doc if not w in stop_words] # Remove stopwords.
doc = [w for w in doc if w.isalpha()] # Remove numbers and punctuation.
return doc
def sentence_centroid(sentence, wv):
v = np.zeros(300)
for w in sentence:
if w in wv:
v += wv[w]
return v / len(sentence)
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
I'm doing the following. First I take my input document and I calculate the
centroid from my Word2Vec model
inputv = sentence_centroid(preprocess(lyric_to_compare,stop_words), model)
wmd_distances = []
cosine_distances = []
I the iterate it for the list of documents.
for i in range(len(document_list)):
l2 = document_list[i]
# lyrics centroid
l2v = sentence_centroid(preprocess(l2,stop_words), model)
# wmd similarity
wmdistance = model.wmdistance(preprocess(lyric_to_compare,stop_words),
preprocess(l2,stop_words))
wmd_distances.append( wmdistance )
# cosine similarity
cosine_similarity = cosine_sim(inputv,l2v)
cosine_similarities.append( cosine_similarity )
so I have now the WMD instances and the cosine distances for all documents
against the inputv
At this point I want to normalize these values.
I first calculate the wmd similarity as *1-wmd_distance. *In the code here
I'm normalizing against the max value, so I'm doing wmd_max - i where i is
the ith wmd distance value
then I normalize between min and max.
# normalize similarity score
if len(wmd_distances) > 1:
wmd_max = np.amax(wmd_distances)
wmd_distances = [(wmd_max - i) for i in wmd_distances]
wmd_distances_norm =
[((x-np.min(wmd_distances))/(np.max(wmd_distances)-np.min(wmd_distances)))
for x in wmd_distances]
cosine_similarities_norm =
[((x-np.min(cosine_similarities))/(np.max(cosine_similarities)-np.min(cosine_similarities)))
for x in cosine_similarities]
else:
wmd_distances = [(1-i) for i in wmd_distances]
wmd_distances_norm = wmd_distances
cosine_similarities_norm = cosine_similarities
So my output now is a list of cosine similarities and wmd similarities
values, eventually normalized.
Applying this to different documents, I have some issues, first of all I'm
not completely sure about the using the max value to get the wmd similarity:
*wmd_similarity[ i ] = max( wmd_distances) - wmd_distances[ i ]*
that maybe could be as simple as *wmd_similarity[ i ] = 1 - wmd_distances[
i ], *that will eventually introduce negative values.
Second point is the normalization, assumed that this could makes sense, I
cannot get rid of the scale of both metrics to choose the best option.
Any hint?
an input document, where a document has multiple lines separated by one or
more '\n'.
I'm using Word2Vec binary model from FastText English WikiNews with
embedding dim 300.
Assumed that I have defined those simple methods for text pre-processing,
centroid and cosine similarity calculation
def preprocess(doc,stop_words):
doc = doc.lower() # Lower the text.
doc = word_tokenize(doc) # Split into words.
doc = [w for w in doc if not w in stop_words] # Remove stopwords.
doc = [w for w in doc if w.isalpha()] # Remove numbers and punctuation.
return doc
def sentence_centroid(sentence, wv):
v = np.zeros(300)
for w in sentence:
if w in wv:
v += wv[w]
return v / len(sentence)
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
I'm doing the following. First I take my input document and I calculate the
centroid from my Word2Vec model
inputv = sentence_centroid(preprocess(lyric_to_compare,stop_words), model)
wmd_distances = []
cosine_distances = []
I the iterate it for the list of documents.
for i in range(len(document_list)):
l2 = document_list[i]
# lyrics centroid
l2v = sentence_centroid(preprocess(l2,stop_words), model)
# wmd similarity
wmdistance = model.wmdistance(preprocess(lyric_to_compare,stop_words),
preprocess(l2,stop_words))
wmd_distances.append( wmdistance )
# cosine similarity
cosine_similarity = cosine_sim(inputv,l2v)
cosine_similarities.append( cosine_similarity )
so I have now the WMD instances and the cosine distances for all documents
against the inputv
At this point I want to normalize these values.
I first calculate the wmd similarity as *1-wmd_distance. *In the code here
I'm normalizing against the max value, so I'm doing wmd_max - i where i is
the ith wmd distance value
then I normalize between min and max.
# normalize similarity score
if len(wmd_distances) > 1:
wmd_max = np.amax(wmd_distances)
wmd_distances = [(wmd_max - i) for i in wmd_distances]
wmd_distances_norm =
[((x-np.min(wmd_distances))/(np.max(wmd_distances)-np.min(wmd_distances)))
for x in wmd_distances]
cosine_similarities_norm =
[((x-np.min(cosine_similarities))/(np.max(cosine_similarities)-np.min(cosine_similarities)))
for x in cosine_similarities]
else:
wmd_distances = [(1-i) for i in wmd_distances]
wmd_distances_norm = wmd_distances
cosine_similarities_norm = cosine_similarities
So my output now is a list of cosine similarities and wmd similarities
values, eventually normalized.
Applying this to different documents, I have some issues, first of all I'm
not completely sure about the using the max value to get the wmd similarity:
*wmd_similarity[ i ] = max( wmd_distances) - wmd_distances[ i ]*
that maybe could be as simple as *wmd_similarity[ i ] = 1 - wmd_distances[
i ], *that will eventually introduce negative values.
Second point is the normalization, assumed that this could makes sense, I
cannot get rid of the scale of both metrics to choose the best option.
Any hint?
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.