Discussion:
[gensim:11776] loading data from gensim API
Joao
2018-11-13 10:19:20 UTC
Permalink
Hello all,

I've started using Gemsim's API and the following is taking an inordinate
amount of time to load: api.load("word2vec-google-news-300").
I was wondering whether these can be saved locally after loading. If so,
how do you load it from your local computer?

Best,
Joao
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Mueller, Mark-Christoph
2018-11-13 10:36:37 UTC
Permalink
Hi Joao,


I had the same problem not just when using Google embeddings, but with any set of word embeddings. We came up with a python-based solution that uses lazy loading of individual word vectors, which speeds up things considerably, part from some other advantages:


https://github.com/nlpAThits/WOMBAT


You need to import your resource first, which might still take some time, but things are much faster from then on.

You also need to have the resource in plain text format. There are scripts out there that do that for you (i can also provide one).

Converting the resource to plain text prior to import also allows you to do some filtering of the vocabulary: The GoogleNews embeddings contain a *huge* part of phrases that are not really meaningful, and will never be used anyway unless you have a tokenizer that is aware of these phrases.


Best,

Christoph


Mark-Christoph MÃŒller

Research Associate

HITS gGmbH
Schloss-Wolfsbrunnenweg 35
69118 Heidelberg
Germany

phone +49 6221 533 238
fax +49 6221 533 298
email mark-***@h-its.org
http://www.h-its.org
_________________________________________________
Amtsgericht Mannheim / HRB 337446
Managing Director: Dr. Gesa Schönberger
________________________________
Von: ***@googlegroups.com <***@googlegroups.com> im Auftrag von Joao <***@hotmail.com>
Gesendet: Dienstag, 13. November 2018 11:19
An: Gensim
Betreff: [gensim:11776] loading data from gensim API

Hello all,

I've started using Gemsim's API and the following is taking an inordinate amount of time to load: api.load("word2vec-google-news-300").
I was wondering whether these can be saved locally after loading. If so, how do you load it from your local computer?

Best,
Joao


--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com<mailto:gensim+***@googlegroups.com>.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Gordon Mohr
2018-11-13 18:21:06 UTC
Permalink
This `api.load()` function always downloads the data to a temporary
location. The `downloader` module's `load()` function has an optional
`return_path` argument that, if True, simply downloads the dataset and
returns the path to where it was saved. See the docs:

https://radimrehurek.com/gensim/downloader.html#gensim.downloader.load

It's also common for people to look-up someplace that dataset is mirrored,
and download it using a web browser. Then, it can be loaded with the
`KeyedVectors.load_word2vec_format(filepath)` method.

Note that it's many gigabytes in size on disk and when loaded into RAM, and
then when you start doing `most_similar()` lookups, it nearly doubles in
size when all vectors are unit-normalized. So it's common on low-memory
machines (like 4GB-8GB) for using that full set to trigger local memory
swapping, which will make operations very very slow (since every
`most_similar()` accesses the whole dataset).

If you *only* need `most_similar()` operations, and are OK working with
only the unit-normalized vectors, you can call
`model.init_sims(replace=True)` after it's loaded. That will discard the
original raw vectors, keeping only the unit-normalized vectors, saving
about half the memory.

As it contains millions of words, but most of the value is in the
most-frequent words, and those are listed first, you can also use the
optional `limit` parameter on `load_word2vec_format()`. For example,
loading with `limit=500000` loads only the first (most-frequently-seen)
500K words, saving about 5/6ths of the memory.

- Gordon
Post by Joao
Hello all,
I've started using Gemsim's API and the following is taking an inordinate
amount of time to load: api.load("word2vec-google-news-300").
I was wondering whether these can be saved locally after loading. If so,
how do you load it from your local computer?
Best,
Joao
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...