[gensim:11479] Suggestion hardware config for training doc2vec on very large corpus

Discussion:

ziqi zhang

2018-08-23 15:45:52 UTC

Please excuse me for my lack of knowledge of gensim and document
embeddings, but I would like to some help on the kinds of hardware I need
in order to do the following task:

Corpus size:
- 1TB compressed (.txt format, uncompressed)
- 10 million docs
- # of tokens: unknown
- expected dimensions: 500

I learned from this thread: 'Best strategy to train doc2vec on a huge
corpus' that I can roughly estimate a memory requirement of 10 million
vectors * 500 dimensions * 4 bytes-per-float = 20 GB

But what about disk space, CPU, and time?

- The corpus has 1TB in size compressed, ideally I would like not to
uncompress it, as it will inflate x 5~10 times. Can gensim *handle
compressed text data*, like reading a .gz stream? I am really not sure how
many tokens/words are there in the data... I notice literature mostly
discuss size of corpus by number of tokens, not disk space. So I cannot
find useful comparison on this...
- I suppose Gensim can take advantage of multi-cores. What could be a rough *estimated
time* of training on a 8-CPU node? (Or GPUs? I am not familiar with the
models and capacities of GPUs)

And in general, what would be a recommended configuration to do this task
within 5 days?

sorry if I have described the problem poorly... as said, I am not very
familiar with doc2vec, I only used word2vec on small corpus before. Please
I would really appreciate suggestions, and questions to help me clarify
this too!

Thanks

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

2018-08-23 18:49:52 UTC

Permalink

The 20GB RAM you've calculated would be just for the doc-vectors; the model
will also require more RAM based on its internal weights & the size of the
effective vocabulary (unique tokens), which will depend on your corpus and
`min_count` parameter.

Gensim Doc2Vec can stream from compressed files just fine, but if they
require significant preprocessing/tokenization, it's usually best for speed
to do that once, re-writing the corpus to a simply-whitespace-delimited
source file (compressed if necessary), so that cost isn't paid redundantly
each training pass. Also, if the corpus has a original ordering that clumps
related documents (same topics, words, etc) together, it may be useful to
shuffle it once.

You should check a sample of the data to get a sense of document-sizes (in
tokens). If you are correct that it would inflate 5x, and if it were
English text with perhaps 1 token per 7 bytes, your numbers suggest each
document might be (5TB / 10 million =) 500KB and (500KB/7b =) 71,000 tokens.

Such long documents would:

* exceed gensim's internal implementation limit of 10K tokens per document,
so you'd have to split them into smaller text examples (but by tagging
those examples with the same doc-tags, you can still learn a single vector
for the combined document)
* be much longer than are typically used with Doc2Vec - most work seems to
be with documents of a few dozen to a few thousand words

Gensim can make use of multiple cores, but doesn't make use of GPUs.
However current versions of gensim stop getting a training throughput
benefit from more cores somewhere in the 3-12 count of worker-threads. The
exact optimal number will vary based on your system & training parameters â
so you can use trial-and-error with some alternate settings to see what
maximizes logged throughput can be reached after a few minutes, then only
launch the full training once that's been determined.

Overall speed depends on so many things (including parameters like
`min_count`, `sample`, `size`, `negative`, `window`, `dm`, etc) that
there's no sure estimates, but you can start experimenting. The training
itself, once the initial vocabulary-scan finishes, won't require any more
memory during training, and its duration scales linearly with
corpus-size/training-passes. So starting a training, with logging on, and
observing its progress rate after a few minutes' progress can give you a
fair estimate of the full time required.

One of the most time-consuming and memory-consumptive steps will be the
initial vocabulary-scan. It'll may take multiple tries/tweaks to get it to
succeed. But note that you *can* `model.save()` a model which has only
finished `build_vocab()` â then re-load that model, and direct tamper with
training parameters, to try different setups without having to pay
vocabulary-scan again. (If you look inside Doc2Vec.build_vocab()'s source
and do its `scan_vocab()` manually, you can even save it then, and then
tamper with `min_count`/`sample` before manually doing the later
prepare/etc steps, so try alternate vocab settings without requiring a
re-scan.)

So the next step to get better estimates & experience would be to start
experimenting, on a subset of the full data if necessary. You'd likely want
a machine with at least 8 cores, at least 64GB RAM, and if possible SSD(s)
for the tokenized corpus. Runs tests with logging on, and take note of
where limits are hit or how achievable vocabulary-scan or training
rates-of-progress vary with different options.

Having a process that works on this size of data, and completes in 5 days,
may be a big challenge that requires lots of tuning and
compromises/optimizations.

- Gordon

Post by ziqi zhang
Please excuse me for my lack of knowledge of gensim and document
embeddings, but I would like to some help on the kinds of hardware I need
- 1TB compressed (.txt format, uncompressed)
- 10 million docs
- # of tokens: unknown
- expected dimensions: 500
I learned from this thread: 'Best strategy to train doc2vec on a huge
corpus' that I can roughly estimate a memory requirement of 10 million
vectors * 500 dimensions * 4 bytes-per-float = 20 GB
But what about disk space, CPU, and time?
- The corpus has 1TB in size compressed, ideally I would like not to
uncompress it, as it will inflate x 5~10 times. Can gensim *handle
compressed text data*, like reading a .gz stream? I am really not sure
how many tokens/words are there in the data... I notice literature mostly
discuss size of corpus by number of tokens, not disk space. So I cannot
find useful comparison on this...
- I suppose Gensim can take advantage of multi-cores. What could be a
rough *estimated time* of training on a 8-CPU node? (Or GPUs? I am not
familiar with the models and capacities of GPUs)
And in general, what would be a recommended configuration to do this task
within 5 days?
sorry if I have described the problem poorly... as said, I am not very
familiar with doc2vec, I only used word2vec on small corpus before. Please
I would really appreciate suggestions, and questions to help me clarify
this too!
Thanks

ziqi zhang

2018-08-23 19:31:43 UTC

Permalink

Thank you so much for your extreme patience in writing such a detailed
explanation - it is more than a perfect answer I am looking for, I am
grateful!

Post by Gordon Mohr
The 20GB RAM you've calculated would be just for the doc-vectors; the
model will also require more RAM based on its internal weights & the size
of the effective vocabulary (unique tokens), which will depend on your
corpus and `min_count` parameter.
Gensim Doc2Vec can stream from compressed files just fine, but if they
require significant preprocessing/tokenization, it's usually best for speed
to do that once, re-writing the corpus to a simply-whitespace-delimited
source file (compressed if necessary), so that cost isn't paid redundantly
each training pass. Also, if the corpus has a original ordering that clumps
related documents (same topics, words, etc) together, it may be useful to
shuffle it once.
You should check a sample of the data to get a sense of document-sizes (in
tokens). If you are correct that it would inflate 5x, and if it were
English text with perhaps 1 token per 7 bytes, your numbers suggest each
document might be (5TB / 10 million =) 500KB and (500KB/7b =) 71,000 tokens.
* exceed gensim's internal implementation limit of 10K tokens per
document, so you'd have to split them into smaller text examples (but by
tagging those examples with the same doc-tags, you can still learn a single
vector for the combined document)
* be much longer than are typically used with Doc2Vec - most work seems to
be with documents of a few dozen to a few thousand words
Gensim can make use of multiple cores, but doesn't make use of GPUs.
However current versions of gensim stop getting a training throughput
benefit from more cores somewhere in the 3-12 count of worker-threads. The
exact optimal number will vary based on your system & training parameters â
so you can use trial-and-error with some alternate settings to see what
maximizes logged throughput can be reached after a few minutes, then only
launch the full training once that's been determined.
Overall speed depends on so many things (including parameters like
`min_count`, `sample`, `size`, `negative`, `window`, `dm`, etc) that
there's no sure estimates, but you can start experimenting. The training
itself, once the initial vocabulary-scan finishes, won't require any more
memory during training, and its duration scales linearly with
corpus-size/training-passes. So starting a training, with logging on, and
observing its progress rate after a few minutes' progress can give you a
fair estimate of the full time required.
One of the most time-consuming and memory-consumptive steps will be the
initial vocabulary-scan. It'll may take multiple tries/tweaks to get it to
succeed. But note that you *can* `model.save()` a model which has only
finished `build_vocab()` â then re-load that model, and direct tamper with
training parameters, to try different setups without having to pay
vocabulary-scan again. (If you look inside Doc2Vec.build_vocab()'s source
and do its `scan_vocab()` manually, you can even save it then, and then
tamper with `min_count`/`sample` before manually doing the later
prepare/etc steps, so try alternate vocab settings without requiring a
re-scan.)
So the next step to get better estimates & experience would be to start
experimenting, on a subset of the full data if necessary. You'd likely want
a machine with at least 8 cores, at least 64GB RAM, and if possible SSD(s)
for the tokenized corpus. Runs tests with logging on, and take note of
where limits are hit or how achievable vocabulary-scan or training
rates-of-progress vary with different options.
Having a process that works on this size of data, and completes in 5 days,
may be a big challenge that requires lots of tuning and
compromises/optimizations.
- Gordon