Virashree Patel
2018-10-26 19:59:14 UTC
Hi,
I am pretty new at topic modeling and Gensim. So, I am still trying to
understand many of concepts. I am trying to run gensim's LDA model on my
corpus that contains around 25,446,114 tweets. I created a streaming corpus
and id2word dictionary using gensim. I am using num_topics = 100, chunk
size = 85000 (loading 85000 tweets at a time)
I am using
Gensim : 3.5.0
Numpy: 1.15.3
Here is the link to corpus and id2word
dictionary: https://drive.google.com/drive/folders/1FrJ8gJbiDqp3VC5syOjRVcQPcESdYOYa?usp=sharing
I don't know what I am doing wrong or how to solve this. Please help !!
Here are the errors I get :
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:
1023: RuntimeWarning: divide by zero encountered in log
diff = np.log(self.expElogbeta)
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:690
: RuntimeWarning: overflow encountered in add
sstats[:, ids] += np.outer(expElogthetad.T, cts / phinorm)
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:700
: RuntimeWarning: invalid value encountered in multiply
sstats *= self.expElogbeta
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:690
: RuntimeWarning: overflow encountered in add
sstats[:, ids] += np.outer(expElogthetad.T, cts / phinorm)
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:700
: RuntimeWarning: invalid value encountered in multiply
sstats *= self.expElogbeta
Process ForkPoolWorker-30:
Traceback (most recent call last):
File
"/home/linuxbrew/.linuxbrew/Cellar/python/3.7.0/lib/python3.7/multiprocessing/process.py"
, line 297, in _bootstrap
self.run()
File
"/home/linuxbrew/.linuxbrew/Cellar/python/3.7.0/lib/python3.7/multiprocessing/process.py"
, line 99, in run
self._target(*self._args, **self._kwargs)
File
"/home/linuxbrew/.linuxbrew/Cellar/python/3.7.0/lib/python3.7/multiprocessing/pool.py"
, line 105, in worker
initializer(*initargs)
File
"/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamulticore.py"
, line 333, in worker_e_step
worker_lda.do_estep(chunk) # TODO: auto-tune alpha?
File
"/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py",
line 725, in do_estep
gamma, sstats = self.inference(chunk, collect_sstats=True)
File
"/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py",
line 662, in inference
expElogbetad = self.expElogbeta[:, ids]
IndexError: index 287500 is out of bounds for axis 1 with size 287500
Here is the code I am running
import pprint
import logging
import gensim
logging.basicConfig(filename='gensim.log',
format="%(asctime)s:%(levelname)s:%(message)s",
level=logging.INFO)
corpus = gensim.corpora.MmCorpus('disasterTweets.mm')
id2word = gensim.corpora.Dictionary.load('disasterTweets.dict')
id2word.filter_tokens(bad_ids=[id2word.token2id['eofeofeof']])
print('eofeofeof' in id2word.token2id)
lda_model = gensim.models.LdaMulticore(corpus=corpus,
id2word=id2word,
chunksize=85000,
num_topics=100)
pprint.pprint(lda_model.print_topics())
I am pretty new at topic modeling and Gensim. So, I am still trying to
understand many of concepts. I am trying to run gensim's LDA model on my
corpus that contains around 25,446,114 tweets. I created a streaming corpus
and id2word dictionary using gensim. I am using num_topics = 100, chunk
size = 85000 (loading 85000 tweets at a time)
I am using
Gensim : 3.5.0
Numpy: 1.15.3
Here is the link to corpus and id2word
dictionary: https://drive.google.com/drive/folders/1FrJ8gJbiDqp3VC5syOjRVcQPcESdYOYa?usp=sharing
I don't know what I am doing wrong or how to solve this. Please help !!
Here are the errors I get :
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:
1023: RuntimeWarning: divide by zero encountered in log
diff = np.log(self.expElogbeta)
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:690
: RuntimeWarning: overflow encountered in add
sstats[:, ids] += np.outer(expElogthetad.T, cts / phinorm)
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:700
: RuntimeWarning: invalid value encountered in multiply
sstats *= self.expElogbeta
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:690
: RuntimeWarning: overflow encountered in add
sstats[:, ids] += np.outer(expElogthetad.T, cts / phinorm)
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:700
: RuntimeWarning: invalid value encountered in multiply
sstats *= self.expElogbeta
Process ForkPoolWorker-30:
Traceback (most recent call last):
File
"/home/linuxbrew/.linuxbrew/Cellar/python/3.7.0/lib/python3.7/multiprocessing/process.py"
, line 297, in _bootstrap
self.run()
File
"/home/linuxbrew/.linuxbrew/Cellar/python/3.7.0/lib/python3.7/multiprocessing/process.py"
, line 99, in run
self._target(*self._args, **self._kwargs)
File
"/home/linuxbrew/.linuxbrew/Cellar/python/3.7.0/lib/python3.7/multiprocessing/pool.py"
, line 105, in worker
initializer(*initargs)
File
"/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamulticore.py"
, line 333, in worker_e_step
worker_lda.do_estep(chunk) # TODO: auto-tune alpha?
File
"/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py",
line 725, in do_estep
gamma, sstats = self.inference(chunk, collect_sstats=True)
File
"/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py",
line 662, in inference
expElogbetad = self.expElogbeta[:, ids]
IndexError: index 287500 is out of bounds for axis 1 with size 287500
Here is the code I am running
import pprint
import logging
import gensim
logging.basicConfig(filename='gensim.log',
format="%(asctime)s:%(levelname)s:%(message)s",
level=logging.INFO)
corpus = gensim.corpora.MmCorpus('disasterTweets.mm')
id2word = gensim.corpora.Dictionary.load('disasterTweets.dict')
id2word.filter_tokens(bad_ids=[id2word.token2id['eofeofeof']])
print('eofeofeof' in id2word.token2id)
lda_model = gensim.models.LdaMulticore(corpus=corpus,
id2word=id2word,
chunksize=85000,
num_topics=100)
pprint.pprint(lda_model.print_topics())
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.