ziqi zhang
2018-08-23 15:45:52 UTC
Please excuse me for my lack of knowledge of gensim and document
embeddings, but I would like to some help on the kinds of hardware I need
in order to do the following task:
Corpus size:
- 1TB compressed (.txt format, uncompressed)
- 10 million docs
- # of tokens: unknown
- expected dimensions: 500
I learned from this thread: 'Best strategy to train doc2vec on a huge
corpus' that I can roughly estimate a memory requirement of 10 million
vectors * 500 dimensions * 4 bytes-per-float = 20 GB
But what about disk space, CPU, and time?
- The corpus has 1TB in size compressed, ideally I would like not to
uncompress it, as it will inflate x 5~10 times. Can gensim *handle
compressed text data*, like reading a .gz stream? I am really not sure how
many tokens/words are there in the data... I notice literature mostly
discuss size of corpus by number of tokens, not disk space. So I cannot
find useful comparison on this...
- I suppose Gensim can take advantage of multi-cores. What could be a rough *estimated
time* of training on a 8-CPU node? (Or GPUs? I am not familiar with the
models and capacities of GPUs)
And in general, what would be a recommended configuration to do this task
within 5 days?
sorry if I have described the problem poorly... as said, I am not very
familiar with doc2vec, I only used word2vec on small corpus before. Please
I would really appreciate suggestions, and questions to help me clarify
this too!
Thanks
embeddings, but I would like to some help on the kinds of hardware I need
in order to do the following task:
Corpus size:
- 1TB compressed (.txt format, uncompressed)
- 10 million docs
- # of tokens: unknown
- expected dimensions: 500
I learned from this thread: 'Best strategy to train doc2vec on a huge
corpus' that I can roughly estimate a memory requirement of 10 million
vectors * 500 dimensions * 4 bytes-per-float = 20 GB
But what about disk space, CPU, and time?
- The corpus has 1TB in size compressed, ideally I would like not to
uncompress it, as it will inflate x 5~10 times. Can gensim *handle
compressed text data*, like reading a .gz stream? I am really not sure how
many tokens/words are there in the data... I notice literature mostly
discuss size of corpus by number of tokens, not disk space. So I cannot
find useful comparison on this...
- I suppose Gensim can take advantage of multi-cores. What could be a rough *estimated
time* of training on a 8-CPU node? (Or GPUs? I am not familiar with the
models and capacities of GPUs)
And in general, what would be a recommended configuration to do this task
within 5 days?
sorry if I have described the problem poorly... as said, I am not very
familiar with doc2vec, I only used word2vec on small corpus before. Please
I would really appreciate suggestions, and questions to help me clarify
this too!
Thanks
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.