e***@gmail.com
2018-12-04 02:13:59 UTC
I am currently using the following code to align sevela word2vec models
(named as 'model_1950', 'model_1960', ... ,'model_2010').
Note: The code in green color is the alignment code.
from gensim.models import Word2Vec
import numpy as np
import gensim
def smart_procrustes_align_gensim(base_embed, other_embed, words=None):
"""Procrustes align two gensim word2vec models (to allow for comparison
between same word across models).
Code ported from HistWords <https://github.com/williamleif/histwords> by
William Hamilton <***@stanford.edu>.
(With help from William. Thank you!)
First, intersect the vocabularies (see `intersection_align_gensim`
documentation).
Then do the alignment on the other_embed model.
Replace the other_embed model's syn0 and syn0norm numpy matrices with the
aligned version.
Return other_embed.
If `words` is set, intersect the two models' vocabulary with the vocabulary
in words (see `intersection_align_gensim` documentation).
"""
# patch by Richard So [https://twitter.com/richardjeanso) (thanks!) to
update this code for new version of gensim
base_embed.init_sims()
other_embed.init_sims()
# make sure vocabulary and indices are aligned
in_base_embed, in_other_embed = intersection_align_gensim(base_embed,
other_embed, words=words)
# get the embedding matrices
base_vecs = in_base_embed.wv.syn0norm
other_vecs = in_other_embed.wv.syn0norm
# just a matrix dot product with numpy
m = other_vecs.T.dot(base_vecs)
# SVD method from numpy
u, _, v = np.linalg.svd(m)
# another matrix operation
ortho = u.dot(v)
# Replace original array with modified one
# i.e. multiplying the embedding matrix (syn0norm)by "ortho"
other_embed.wv.syn0norm = other_embed.wv.syn0 =
(other_embed.wv.syn0norm).dot(ortho)
return other_embed
def intersection_align_gensim(m1,m2, words=None):
"""
Intersect two gensim word2vec models, m1 and m2.
Only the shared vocabulary between them is kept.
If 'words' is set (as list or set), then the vocabulary is intersected with
this list as well.
Indices are re-organized from 0..N in order of descending frequency (=sum
of counts from both m1 and m2).
These indices correspond to the new syn0 and syn0norm objects in both
gensim models:
-- so that Row 0 of m1.syn0 will be for the same word as Row 0 of m2.syn0
-- you can find the index of any word on the .index2word list:
model.index2word.index(word) => 2
The .vocab dictionary is also updated for each model, preserving the count
but updating the index.
"""
# Get the vocab for each model
vocab_m1 = set(m1.wv.vocab.keys())
vocab_m2 = set(m2.wv.vocab.keys())
# Find the common vocabulary
common_vocab = vocab_m1&vocab_m2
if words: common_vocab&=set(words)
# If no alignment necessary because vocab is identical...
if not vocab_m1-common_vocab and not vocab_m2-common_vocab:
return (m1,m2)
# Otherwise sort by frequency (summed for both)
common_vocab = list(common_vocab)
common_vocab.sort(key=lambda w: m1.wv.vocab[w].count +
m2.wv.vocab[w].count,reverse=True)
# Then for each model...
for m in [m1,m2]:
# Replace old syn0norm array with new one (with common vocab)
indices = [m.wv.vocab[w].index for w in common_vocab]
old_arr = m.wv.syn0norm
new_arr = np.array([old_arr[index] for index in indices])
m.wv.syn0norm = m.wv.syn0 = new_arr
# Replace old vocab dictionary with new one (with common vocab)
# and old index2word with new one
m.wv.index2word = common_vocab
old_vocab = m.wv.vocab
new_vocab = {}
for new_index,word in enumerate(common_vocab):
old_vocab_obj=old_vocab[word]
new_vocab[word] = gensim.models.word2vec.Vocab(index=new_index,
count=old_vocab_obj.count)
m.wv.vocab = new_vocab
return (m1,m2)
#load word2vec models
model_1950 = Word2Vec.load('wv_1950')
model_1960 = Word2Vec.load('wv_1960')
model_1970 = Word2Vec.load('wv_1970')
model_1980 = Word2Vec.load('wv_1980')
model_1990 = Word2Vec.load('wv_1990')
model_2000 = Word2Vec.load('wv_2000')
model_2010 = Word2Vec.load('wv_2010')
#align and save
aligned_model_1960 = smart_procrustes_align_gensim(model_1950, model_1960,
words=None)
aligned_model_1960.save('aligned_wv_1960')
aligned_model_1970 = smart_procrustes_align_gensim(aligned_model_1960,
model_1970, words=None)
aligned_model_1970.save('aligned_wv_1970')
aligned_model_1980 = smart_procrustes_align_gensim(aligned_model_1970,
model_1980, words=None)
aligned_model_1980.save('aligned_wv_1980')
aligned_model_1990 = smart_procrustes_align_gensim(aligned_model_1980,
model_1990, words=None)
aligned_model_1990.save('aligned_wv_1990')
aligned_model_2000 = smart_procrustes_align_gensim(aligned_model_1990,
model_2000, words=None)
aligned_model_2000.save('aligned_wv_2000')
aligned_model_2010 = smart_procrustes_align_gensim(aligned_model_2000,
model_2010, words=None)
aligned_model_2010.save('aligned_wv_2010')
As shown above, I considered *'model_1950' *as my *base model* for the
*intial* alignment. Afterwards, I chronologically use the models for
alignment as shown in my latter part of the code (in black color).
However, while searching the word 'ctrx' I noted that *'aligned_model_1980',
'aligned_model_1990', 'aligned_model_2000', 'aligned_model_2010'* do not
have it in the vocabulary. But, the models '*model_1950',
'aligned_model_1960','aligned_model_1970' *have that word in their
vocabularies.
The main reason for the alignment is that we can compare the behaviour of a
word accross diffreent time periods. However, I am confused as why my
vocabularies are not different. Please let me know a way to resolve my
issue?
(named as 'model_1950', 'model_1960', ... ,'model_2010').
Note: The code in green color is the alignment code.
from gensim.models import Word2Vec
import numpy as np
import gensim
def smart_procrustes_align_gensim(base_embed, other_embed, words=None):
"""Procrustes align two gensim word2vec models (to allow for comparison
between same word across models).
Code ported from HistWords <https://github.com/williamleif/histwords> by
William Hamilton <***@stanford.edu>.
(With help from William. Thank you!)
First, intersect the vocabularies (see `intersection_align_gensim`
documentation).
Then do the alignment on the other_embed model.
Replace the other_embed model's syn0 and syn0norm numpy matrices with the
aligned version.
Return other_embed.
If `words` is set, intersect the two models' vocabulary with the vocabulary
in words (see `intersection_align_gensim` documentation).
"""
# patch by Richard So [https://twitter.com/richardjeanso) (thanks!) to
update this code for new version of gensim
base_embed.init_sims()
other_embed.init_sims()
# make sure vocabulary and indices are aligned
in_base_embed, in_other_embed = intersection_align_gensim(base_embed,
other_embed, words=words)
# get the embedding matrices
base_vecs = in_base_embed.wv.syn0norm
other_vecs = in_other_embed.wv.syn0norm
# just a matrix dot product with numpy
m = other_vecs.T.dot(base_vecs)
# SVD method from numpy
u, _, v = np.linalg.svd(m)
# another matrix operation
ortho = u.dot(v)
# Replace original array with modified one
# i.e. multiplying the embedding matrix (syn0norm)by "ortho"
other_embed.wv.syn0norm = other_embed.wv.syn0 =
(other_embed.wv.syn0norm).dot(ortho)
return other_embed
def intersection_align_gensim(m1,m2, words=None):
"""
Intersect two gensim word2vec models, m1 and m2.
Only the shared vocabulary between them is kept.
If 'words' is set (as list or set), then the vocabulary is intersected with
this list as well.
Indices are re-organized from 0..N in order of descending frequency (=sum
of counts from both m1 and m2).
These indices correspond to the new syn0 and syn0norm objects in both
gensim models:
-- so that Row 0 of m1.syn0 will be for the same word as Row 0 of m2.syn0
-- you can find the index of any word on the .index2word list:
model.index2word.index(word) => 2
The .vocab dictionary is also updated for each model, preserving the count
but updating the index.
"""
# Get the vocab for each model
vocab_m1 = set(m1.wv.vocab.keys())
vocab_m2 = set(m2.wv.vocab.keys())
# Find the common vocabulary
common_vocab = vocab_m1&vocab_m2
if words: common_vocab&=set(words)
# If no alignment necessary because vocab is identical...
if not vocab_m1-common_vocab and not vocab_m2-common_vocab:
return (m1,m2)
# Otherwise sort by frequency (summed for both)
common_vocab = list(common_vocab)
common_vocab.sort(key=lambda w: m1.wv.vocab[w].count +
m2.wv.vocab[w].count,reverse=True)
# Then for each model...
for m in [m1,m2]:
# Replace old syn0norm array with new one (with common vocab)
indices = [m.wv.vocab[w].index for w in common_vocab]
old_arr = m.wv.syn0norm
new_arr = np.array([old_arr[index] for index in indices])
m.wv.syn0norm = m.wv.syn0 = new_arr
# Replace old vocab dictionary with new one (with common vocab)
# and old index2word with new one
m.wv.index2word = common_vocab
old_vocab = m.wv.vocab
new_vocab = {}
for new_index,word in enumerate(common_vocab):
old_vocab_obj=old_vocab[word]
new_vocab[word] = gensim.models.word2vec.Vocab(index=new_index,
count=old_vocab_obj.count)
m.wv.vocab = new_vocab
return (m1,m2)
#load word2vec models
model_1950 = Word2Vec.load('wv_1950')
model_1960 = Word2Vec.load('wv_1960')
model_1970 = Word2Vec.load('wv_1970')
model_1980 = Word2Vec.load('wv_1980')
model_1990 = Word2Vec.load('wv_1990')
model_2000 = Word2Vec.load('wv_2000')
model_2010 = Word2Vec.load('wv_2010')
#align and save
aligned_model_1960 = smart_procrustes_align_gensim(model_1950, model_1960,
words=None)
aligned_model_1960.save('aligned_wv_1960')
aligned_model_1970 = smart_procrustes_align_gensim(aligned_model_1960,
model_1970, words=None)
aligned_model_1970.save('aligned_wv_1970')
aligned_model_1980 = smart_procrustes_align_gensim(aligned_model_1970,
model_1980, words=None)
aligned_model_1980.save('aligned_wv_1980')
aligned_model_1990 = smart_procrustes_align_gensim(aligned_model_1980,
model_1990, words=None)
aligned_model_1990.save('aligned_wv_1990')
aligned_model_2000 = smart_procrustes_align_gensim(aligned_model_1990,
model_2000, words=None)
aligned_model_2000.save('aligned_wv_2000')
aligned_model_2010 = smart_procrustes_align_gensim(aligned_model_2000,
model_2010, words=None)
aligned_model_2010.save('aligned_wv_2010')
As shown above, I considered *'model_1950' *as my *base model* for the
*intial* alignment. Afterwards, I chronologically use the models for
alignment as shown in my latter part of the code (in black color).
However, while searching the word 'ctrx' I noted that *'aligned_model_1980',
'aligned_model_1990', 'aligned_model_2000', 'aligned_model_2010'* do not
have it in the vocabulary. But, the models '*model_1950',
'aligned_model_1960','aligned_model_1970' *have that word in their
vocabularies.
The main reason for the alignment is that we can compare the behaviour of a
word accross diffreent time periods. However, I am confused as why my
vocabularies are not different. Please let me know a way to resolve my
issue?
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.