[gensim:11863] Word2vec Alignment

Discussion:

e***@gmail.com

2018-12-04 02:13:59 UTC

I am currently using the following code to align sevela word2vec models
(named as 'model_1950', 'model_1960', ... ,'model_2010').

Note: The code in green color is the alignment code.

from gensim.models import Word2Vec
import numpy as np
import gensim

def smart_procrustes_align_gensim(base_embed, other_embed, words=None):
"""Procrustes align two gensim word2vec models (to allow for comparison
between same word across models).
Code ported from HistWords <https://github.com/williamleif/histwords> by
William Hamilton <***@stanford.edu>.
(With help from William. Thank you!)

First, intersect the vocabularies (see `intersection_align_gensim`
documentation).
Then do the alignment on the other_embed model.
Replace the other_embed model's syn0 and syn0norm numpy matrices with the
aligned version.
Return other_embed.

If `words` is set, intersect the two models' vocabulary with the vocabulary
in words (see `intersection_align_gensim` documentation).
"""
# patch by Richard So [https://twitter.com/richardjeanso) (thanks!) to
update this code for new version of gensim
base_embed.init_sims()
other_embed.init_sims()

# make sure vocabulary and indices are aligned
in_base_embed, in_other_embed = intersection_align_gensim(base_embed,
other_embed, words=words)

# get the embedding matrices
base_vecs = in_base_embed.wv.syn0norm
other_vecs = in_other_embed.wv.syn0norm

# just a matrix dot product with numpy
m = other_vecs.T.dot(base_vecs)
# SVD method from numpy
u, _, v = np.linalg.svd(m)
# another matrix operation
ortho = u.dot(v)
# Replace original array with modified one
# i.e. multiplying the embedding matrix (syn0norm)by "ortho"
other_embed.wv.syn0norm = other_embed.wv.syn0 =
(other_embed.wv.syn0norm).dot(ortho)
return other_embed
def intersection_align_gensim(m1,m2, words=None):
"""
Intersect two gensim word2vec models, m1 and m2.
Only the shared vocabulary between them is kept.
If 'words' is set (as list or set), then the vocabulary is intersected with
this list as well.
Indices are re-organized from 0..N in order of descending frequency (=sum
of counts from both m1 and m2).
These indices correspond to the new syn0 and syn0norm objects in both
gensim models:
-- so that Row 0 of m1.syn0 will be for the same word as Row 0 of m2.syn0
-- you can find the index of any word on the .index2word list:
model.index2word.index(word) => 2
The .vocab dictionary is also updated for each model, preserving the count
but updating the index.
"""

# Get the vocab for each model
vocab_m1 = set(m1.wv.vocab.keys())
vocab_m2 = set(m2.wv.vocab.keys())

# Find the common vocabulary
common_vocab = vocab_m1&vocab_m2
if words: common_vocab&=set(words)

# If no alignment necessary because vocab is identical...
if not vocab_m1-common_vocab and not vocab_m2-common_vocab:
return (m1,m2)

# Otherwise sort by frequency (summed for both)
common_vocab = list(common_vocab)
common_vocab.sort(key=lambda w: m1.wv.vocab[w].count +
m2.wv.vocab[w].count,reverse=True)

# Then for each model...
for m in [m1,m2]:
# Replace old syn0norm array with new one (with common vocab)
indices = [m.wv.vocab[w].index for w in common_vocab]
old_arr = m.wv.syn0norm
new_arr = np.array([old_arr[index] for index in indices])
m.wv.syn0norm = m.wv.syn0 = new_arr

# Replace old vocab dictionary with new one (with common vocab)
# and old index2word with new one
m.wv.index2word = common_vocab
old_vocab = m.wv.vocab
new_vocab = {}
for new_index,word in enumerate(common_vocab):
old_vocab_obj=old_vocab[word]
new_vocab[word] = gensim.models.word2vec.Vocab(index=new_index,
count=old_vocab_obj.count)
m.wv.vocab = new_vocab

return (m1,m2)

#load word2vec models
model_1950 = Word2Vec.load('wv_1950')
model_1960 = Word2Vec.load('wv_1960')
model_1970 = Word2Vec.load('wv_1970')
model_1980 = Word2Vec.load('wv_1980')
model_1990 = Word2Vec.load('wv_1990')
model_2000 = Word2Vec.load('wv_2000')
model_2010 = Word2Vec.load('wv_2010')

#align and save
aligned_model_1960 = smart_procrustes_align_gensim(model_1950, model_1960,
words=None)
aligned_model_1960.save('aligned_wv_1960')

aligned_model_1970 = smart_procrustes_align_gensim(aligned_model_1960,
model_1970, words=None)
aligned_model_1970.save('aligned_wv_1970')

aligned_model_1980 = smart_procrustes_align_gensim(aligned_model_1970,
model_1980, words=None)
aligned_model_1980.save('aligned_wv_1980')

aligned_model_1990 = smart_procrustes_align_gensim(aligned_model_1980,
model_1990, words=None)
aligned_model_1990.save('aligned_wv_1990')

aligned_model_2000 = smart_procrustes_align_gensim(aligned_model_1990,
model_2000, words=None)
aligned_model_2000.save('aligned_wv_2000')

aligned_model_2010 = smart_procrustes_align_gensim(aligned_model_2000,
model_2010, words=None)
aligned_model_2010.save('aligned_wv_2010')

As shown above, I considered *'model_1950' *as my *base model* for the
*intial* alignment. Afterwards, I chronologically use the models for
alignment as shown in my latter part of the code (in black color).

However, while searching the word 'ctrx' I noted that *'aligned_model_1980',
'aligned_model_1990', 'aligned_model_2000', 'aligned_model_2010'* do not
have it in the vocabulary. But, the models '*model_1950',
'aligned_model_1960','aligned_model_1970' *have that word in their
vocabularies.

The main reason for the alignment is that we can compare the behaviour of a
word accross diffreent time periods. However, I am confused as why my
vocabularies are not different. Please let me know a way to resolve my
issue?

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

2018-12-04 06:54:02 UTC

Permalink

Which of the original models included the word?

I'm not familiar with this code â this sort of 'alignment' isn't a part of
gensim. You might have better luck asking the author of this code & this
technique.

But one of its comments says, "Only the shared vocabulary between [the two
models] is kept." That suggests that as soon as you try to "align" with
another model that doesn't contain a particularword, the result also won't
contain that word.

So the results you describe would be expected if the original models for
1950, 1960, and 1970 have the word, but the model for 1980 does not. Is
that the case? If so, mystery solved.

(Is there a reason why this particular not-very-natural word 'ctrx' is of
such interest? It looks like a peculiar abbreviation, I have no guess as to
what it could have ever meant, and thus it could easily fall into the
category of words that don't span eras, or have so few extant examples a
technique like this can't say much.)

- Gordon

Post by e***@gmail.com
I am currently using the following code to align sevela word2vec models
(named as 'model_1950', 'model_1960', ... ,'model_2010').
Note: The code in green color is the alignment code.
from gensim.models import Word2Vec
import numpy as np
import gensim
"""Procrustes align two gensim word2vec models (to allow for comparison
between same word across models).
Code ported from HistWords <https://github.com/williamleif/histwords> by
(With help from William. Thank you!)
First, intersect the vocabularies (see `intersection_align_gensim`
documentation).
Then do the alignment on the other_embed model.
Replace the other_embed model's syn0 and syn0norm numpy matrices with the
aligned version.
Return other_embed.
If `words` is set, intersect the two models' vocabulary with the
vocabulary in words (see `intersection_align_gensim` documentation).
"""
# patch by Richard So [https://twitter.com/richardjeanso) (thanks!) to
update this code for new version of gensim
base_embed.init_sims()
other_embed.init_sims()
# make sure vocabulary and indices are aligned
in_base_embed, in_other_embed = intersection_align_gensim(base_embed,
other_embed, words=words)
# get the embedding matrices
base_vecs = in_base_embed.wv.syn0norm
other_vecs = in_other_embed.wv.syn0norm
# just a matrix dot product with numpy
m = other_vecs.T.dot(base_vecs)
# SVD method from numpy
u, _, v = np.linalg.svd(m)
# another matrix operation
ortho = u.dot(v)
# Replace original array with modified one
# i.e. multiplying the embedding matrix (syn0norm)by "ortho"
other_embed.wv.syn0norm = other_embed.wv.syn0 =
(other_embed.wv.syn0norm).dot(ortho)
return other_embed
"""
Intersect two gensim word2vec models, m1 and m2.
Only the shared vocabulary between them is kept.
If 'words' is set (as list or set), then the vocabulary is intersected
with this list as well.
Indices are re-organized from 0..N in order of descending frequency (=sum
of counts from both m1 and m2).
These indices correspond to the new syn0 and syn0norm objects in both
-- so that Row 0 of m1.syn0 will be for the same word as Row 0 of m2.syn0
model.index2word.index(word) => 2
The .vocab dictionary is also updated for each model, preserving the count
but updating the index.
"""
# Get the vocab for each model
vocab_m1 = set(m1.wv.vocab.keys())
vocab_m2 = set(m2.wv.vocab.keys())
# Find the common vocabulary
common_vocab = vocab_m1&vocab_m2
if words: common_vocab&=set(words)
# If no alignment necessary because vocab is identical...
return (m1,m2)
# Otherwise sort by frequency (summed for both)
common_vocab = list(common_vocab)
common_vocab.sort(key=lambda w: m1.wv.vocab[w].count +
m2.wv.vocab[w].count,reverse=True)
# Then for each model...
# Replace old syn0norm array with new one (with common vocab)
indices = [m.wv.vocab[w].index for w in common_vocab]
old_arr = m.wv.syn0norm
new_arr = np.array([old_arr[index] for index in indices])
m.wv.syn0norm = m.wv.syn0 = new_arr
# Replace old vocab dictionary with new one (with common vocab)
# and old index2word with new one
m.wv.index2word = common_vocab
old_vocab = m.wv.vocab
new_vocab = {}
old_vocab_obj=old_vocab[word]
new_vocab[word] = gensim.models.word2vec.Vocab(index=new_index,
count=old_vocab_obj.count)
m.wv.vocab = new_vocab
return (m1,m2)
#load word2vec models
model_1950 = Word2Vec.load('wv_1950')
model_1960 = Word2Vec.load('wv_1960')
model_1970 = Word2Vec.load('wv_1970')
model_1980 = Word2Vec.load('wv_1980')
model_1990 = Word2Vec.load('wv_1990')
model_2000 = Word2Vec.load('wv_2000')
model_2010 = Word2Vec.load('wv_2010')
#align and save
aligned_model_1960 = smart_procrustes_align_gensim(model_1950, model_1960,
words=None)
aligned_model_1960.save('aligned_wv_1960')
aligned_model_1970 = smart_procrustes_align_gensim(aligned_model_1960,
model_1970, words=None)
aligned_model_1970.save('aligned_wv_1970')
aligned_model_1980 = smart_procrustes_align_gensim(aligned_model_1970,
model_1980, words=None)
aligned_model_1980.save('aligned_wv_1980')
aligned_model_1990 = smart_procrustes_align_gensim(aligned_model_1980,
model_1990, words=None)
aligned_model_1990.save('aligned_wv_1990')
aligned_model_2000 = smart_procrustes_align_gensim(aligned_model_1990,
model_2000, words=None)
aligned_model_2000.save('aligned_wv_2000')
aligned_model_2010 = smart_procrustes_align_gensim(aligned_model_2000,
model_2010, words=None)
aligned_model_2010.save('aligned_wv_2010')
As shown above, I considered *'model_1950' *as my *base model* for the
*intial* alignment. Afterwards, I chronologically use the models for
alignment as shown in my latter part of the code (in black color).
However, while searching the word 'ctrx' I noted that *'aligned_model_1980',
'aligned_model_1990', 'aligned_model_2000', 'aligned_model_2010'* do not
have it in the vocabulary. But, the models '*model_1950',
'aligned_model_1960','aligned_model_1970' *have that word in their
vocabularies.
The main reason for the alignment is that we can compare the behaviour of
a word accross diffreent time periods. However, I am confused as why my
vocabularies are not different. Please let me know a way to resolve my
issue?

Andrey Kutuzov

2018-12-04 11:57:08 UTC

Permalink

As shown above, I considered *'model_1950' *as my *base model* for
the *intial* alignment. Afterwards, I chronologically use the models for
alignment as shown in my latter part of the code (in black color).
However, while searching the word 'ctrx' I noted
that /'aligned_model_1980', 'aligned_model_1990', 'aligned_model_2000',
'aligned_model_2010'/ do not have it in the vocabulary. But, the models
'/model_1950', 'aligned_model_1960','aligned_model_1970' /have that word
in their vocabularies.
The main reason for the alignment is that we can compare the behaviour
of a word accross diffreent time periods. However, I am confused as why
my vocabularies are not different. Please let me know a way to resolve
my issue?

Once again - this alignment code returns 'aligned' models which contain
only the words present in _both_ models which are being aligned. Thus,
if you align, say, the '1980' and '1990' models, the resulting aligned
model's vocabulary will keep only the words shared by 1980 and 1990. If,
for example, the word 'ctrx' is not present in either of them, it will
not be present in the resulting aligned model. Judging by your
description, it seems that this word disappeared from your training
corpora somewhere around 1980 (which probably means it went out of usage
around that time, but it depends on the nature of your corpora, of course).

The idea of aligning word embeddings is indeed to be able to compare the
meaning of words across different models (including those trained on
different time periods). But if a word X is not present in a model at
all (for example, because it does not occur in its training corpus),
then you can't 'compare' it to anything, it does not make sense.

--
Solve et coagula!
Andrey
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

e***@gmail.com

2018-12-04 13:19:21 UTC

Permalink

Dear Andrey,

Thank you for the reply. I was just wondering out of the following two
options what is the most best way. Please let me know your thoughts :)

#align chronologically (option 1)
aligned_model_1960 = smart_procrustes_align_gensim(model_1950, model_1960,
words=None)
aligned_model_1970 = smart_procrustes_align_gensim(aligned_model_1960,
model_1970, words=None)
aligned_model_1980 = smart_procrustes_align_gensim(aligned_model_1970,
model_1980, words=None)
aligned_model_1990 = smart_procrustes_align_gensim(aligned_model_1980,
model_1990, words=None)
aligned_model_2000 = smart_procrustes_align_gensim(aligned_model_1990,
model_2000, words=None)
aligned_model_2010 = smart_procrustes_align_gensim(aligned_model_2000,
model_2010, words=None)

#align considering 1950 as the base model (option 2)
aligned_model_1960 = smart_procrustes_align_gensim(model_1950, model_1960,
words=None)
aligned_model_1970 = smart_procrustes_align_gensim(model_1950, model_1970,
words=None)
aligned_model_1980 = smart_procrustes_align_gensim(model_1950, model_1980,
words=None)
aligned_model_1990 = smart_procrustes_align_gensim(model_1950, model_1990,
words=None)
aligned_model_2000 = smart_procrustes_align_gensim(model_1950, model_2000,
words=None)
aligned_model_2010 = smart_procrustes_align_gensim(model_1950, model_2010,
words=None)

Andrey Kutuzov

2018-12-04 13:23:05 UTC

Permalink

What is your task?

Post by e***@gmail.com
Dear Andrey,
Thank you for the reply. I was just wondering out of the following two
options what is the most best way. Please let me know your thoughts :)
#align chronologically (option 1)
aligned_model_1960 = smart_procrustes_align_gensim(model_1950,
model_1960, words=None)
aligned_model_1970 = smart_procrustes_align_gensim(aligned_model_1960,
model_1970, words=None)
aligned_model_1980 = smart_procrustes_align_gensim(aligned_model_1970,
model_1980, words=None)
aligned_model_1990 = smart_procrustes_align_gensim(aligned_model_1980,
model_1990, words=None)
aligned_model_2000 = smart_procrustes_align_gensim(aligned_model_1990,
model_2000, words=None)
aligned_model_2010 = smart_procrustes_align_gensim(aligned_model_2000,
model_2010, words=None)
#align considering 1950 as the base model (option 2)
aligned_model_1960 = smart_procrustes_align_gensim(model_1950,
model_1960, words=None)
aligned_model_1970 = smart_procrustes_align_gensim(model_1950,
model_1970, words=None)
aligned_model_1980 = smart_procrustes_align_gensim(model_1950,
model_1980, words=None)
aligned_model_1990 = smart_procrustes_align_gensim(model_1950,
model_1990, words=None)
aligned_model_2000 = smart_procrustes_align_gensim(model_1950,
model_2000, words=None)
aligned_model_2010 = smart_procrustes_align_gensim(model_1950,
model_2010, words=None)
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

e***@gmail.com

2018-12-04 13:45:34 UTC

Permalink

Hi Andrey, my task is to see how the words change over time. More
specifically I am using the following code to measure the semantic drift by
neighbourhood. Through this measurements I am trying to do some inferences.
:)

######Calculate semantic shift by neighborhood

def
measure_semantic_shift_by_neighborhood(model1,model2,word,k=25,verbose=False):
"""
Basic implementation of William Hamilton (@williamleif) et al's measure of
semantic change
proposed in their paper "Cultural Shift or Linguistic Drift?"
(https://arxiv.org/abs/1606.02821),
which they call the "local neighborhood measure." They find this measure
better suited to understand
the semantic change of nouns owing to "cultural shift," or changes in
meaning "local" to that word,
rather than global changes in language ("linguistic drift") use that are
better suited to a
Procrustes-alignment method (also described in the same paper.)
Arguments are:
- `model1`, `model2`: Are gensim word2vec models.
- `word` is a sting representation of a given word.
- `k` is the size of the word's neighborhood (# of its closest words in its
vector space).
"""
# Import function for cosine distance
from scipy.spatial.distance import cosine
# Check that this word is present in both models
if not word in model1.wv.vocab or not word in model2.wv.vocab:
print("!! Word %s not present in both models." % word)
return None
# Get the two neighborhoods
neighborhood1 = [w for w,c in model1.most_similar(word,topn=k)]
neighborhood2 = [w for w,c in model2.most_similar(word,topn=k)]
# Print?
if verbose:
print('>> Neighborhood of associations of the word "%s" in model1:' % word)
print(', '.join(neighborhood1))
print
print('>> Neighborhood of associations of the word "%s" in model2:' % word)
print(', '.join(neighborhood2))
# Get the 'meta' neighborhood (both combined)
meta_neighborhood = list(set(neighborhood1)|set(neighborhood2))
# Filter the meta neighborhood so that it contains only words present in
both models
meta_neighborhood = [w for w in meta_neighborhood if w in model1.wv.vocab
and w in model2.wv.vocab]
# For both models, get a similarity vector between the focus word and all
of the words in the meta neighborhood
vector1 = [model1.similarity(word,w) for w in meta_neighborhood]
vector2 = [model2.similarity(word,w) for w in meta_neighborhood]
# Compute the cosine distance *between* those similarity vectors
dist=cosine(vector1,vector2)
# Return this cosine distance -- a measure of the relative semantic shift
for this word between these two models
return dist

Andrey Kutuzov

2018-12-05 21:14:38 UTC

Permalink

Hi ***@gmail.com,

To use the code you gave in your previous message, you don't need
Procrustes alignment at all. It will work perfectly on the models as
they are.

If you still decide to align them, then there is not much difference
between the two modes you suggested ('chronological' and 'with one base
model').
The first option is better in that it makes 'post-1950' models closer to
each other, probably resulting in them being more comparable. The
downside is that with each iteration the size of the vocabulary will
shrink, and the last model will contain only the intersection of the
vocabularies of _all_ the models (probably not many words).
The second option allows to keep larger vocabularies for each aligned
model (it essentially being an intersection between only 2
vocabularies). The downside is that since you align all the time spans
to one and the same 'base model' (1950 in you case), it can lead to the
other models being not very much comparable between themselves.
But of course you should choose the alignment method based on your
evaluation setup. And once again - for the strategy in your code
(comparing vectors of cosine similarities to the nearest neighbors), you
don't need alignment at all.

Post by e***@gmail.com
Hi Andrey, my task is to see how the words change over time. More
specifically I am using the following code to measure the semantic drift
by neighbourhood. Through this measurements I am trying to do some
inferences. :)
######Calculate semantic shift by neighborhood
def
"""
of semantic change
proposed in their paper "Cultural Shift or Linguistic Drift?"
(https://arxiv.org/abs/1606.02821),
which they call the "local neighborhood measure." They find this measure
better suited to understand
the semantic change of nouns owing to "cultural shift," or changes in
meaning "local" to that word,
rather than global changes in language ("linguistic drift") use that are
better suited to a
Procrustes-alignment method (also described in the same paper.)
- `model1`, `model2`: Are gensim word2vec models.
- `word` is a sting representation of a given word.
- `k` is the size of the word's neighborhood (# of its closest words in
its vector space).
"""
# Import function for cosine distance
from scipy.spatial.distance import cosine
# Check that this word is present in both models
print("!! Word %s not present in both models." % word)
return None
# Get the two neighborhoods
neighborhood1 = [w for w,c in model1.most_similar(word,topn=k)]
neighborhood2 = [w for w,c in model2.most_similar(word,topn=k)]
# Print?
print('>> Neighborhood of associations of the word "%s" in model1:' % word)
print(', '.join(neighborhood1))
print
print('>> Neighborhood of associations of the word "%s" in model2:' % word)
print(', '.join(neighborhood2))
# Get the 'meta' neighborhood (both combined)
meta_neighborhood = list(set(neighborhood1)|set(neighborhood2))
# Filter the meta neighborhood so that it contains only words present in
both models
meta_neighborhood = [w for w in meta_neighborhood if w in
model1.wv.vocab and w in model2.wv.vocab]
# For both models, get a similarity vector between the focus word and
all of the words in the meta neighborhood
vector1 = [model1.similarity(word,w) for w in meta_neighborhood]
vector2 = [model2.similarity(word,w) for w in meta_neighborhood]
# Compute the cosine distance *between* those similarity vectors
dist=cosine(vector1,vector2)
# Return this cosine distance -- a measure of the relative semantic
shift for this word between these two models
return dist
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

e***@gmail.com

2018-12-06 00:38:28 UTC

Permalink

Dear Andrey,

Thank you very much for your valuable thoughts. There are really useful.
Thank you very much once again :)

Emi