Discussion:
[gensim:11830] trying to reproduce distributed lda tutorial, LdaModel hangs.
John-Paul Robinson
2018-11-28 18:54:03 UTC
Permalink
Hi,

I'm trying to use gensim's lda topic modeling in a project using the
wikipedia data set as described in the tutorials. I have successfully done
the dictionary conversion and and matrix build.

I'm attempting to do the distributed lda training. I'm able to get a pyro
network running and lda sees it and appears to use it, however, my workers
don't appear to be doing any work and lda doesn't appear to make any
forward progress.
import logging, gensim
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.DEBUG)
# load id->word mapping (the dictionary), one of the results of step 2
above
id2word =
gensim.corpora.Dictionary.load_from_text('wiki_wordids.txt.bz2')
# load corpus iterator
mm = gensim.corpora.MmCorpus('wiki_tfidf.mm')
# mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm.bz2') # use this if
you compressed the TFIDF output
print(mm)
2018-11-28 12:33:19,828 : DEBUG : {'kw': {}, 'mode': 'rb', 'uri': 'wiki_wordids.txt.bz2'}
2018-11-28 12:33:20,390 : DEBUG : {'kw': {}, 'mode': 'rb', 'uri': 'wiki_tfidf.mm.index'}
2018-11-28 12:33:20,984 : INFO : loaded corpus index from wiki_tfidf.mm.index
2018-11-28 12:33:20,986 : INFO : initializing cython corpus reader from wiki_tfidf.mm
2018-11-28 12:33:20,986 : DEBUG : {'kw': {}, 'mode': 'rb', 'uri': 'wiki_tfidf.mm'}
2018-11-28 12:33:21,031 : INFO : accepted corpus with 4562950 documents, 100000 features, 720997289 non-zero entries

MmCorpus(4562950 documents, 100000 features, 720997289 non-zero entries)



Here's the next block of code, the call to LdaModel and its output.
# extract 100 LDA topics, using 1 pass and updating once every 1 chunk
(10,000 documents)
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word,
num_topics=100, update_every=1, chunksize=10000, passes=1, distributed=True)


2018-11-28 12:33:25,295 : INFO : using symmetric alpha at 0.01
2018-11-28 12:33:25,296 : INFO : using symmetric eta at 0.01
2018-11-28 12:33:25,435 : DEBUG : looking for dispatcher at PYRO:***@127.0.0.1:34370
2018-11-28 12:33:53,932 : INFO : using distributed version with 20 workers
2018-11-28 12:33:55,211 : INFO : running online (single-pass) LDA training, 100 topics, 1 passes over the supplied corpus of 4562950 documents, updating model once every 200000 documents, evaluating perplexity every 2000000 documents, iterating 50x with a convergence threshold of 0.001000
2018-11-28 12:33:55,213 : INFO : initializing 20 workers
2018-11-28 12:34:09,246 : DEBUG : {'kw': {}, 'mode': 'rb', 'uri': 'wiki_tfidf.mm'}
2018-11-28 12:34:16,371 : INFO : PROGRESS: pass 0, dispatching documents up to #10000/4562950
2018-11-28 12:34:22,982 : INFO : PROGRESS: pass 0, dispatching documents up to #20000/4562950
2018-11-28 12:34:28,247 : INFO : PROGRESS: pass 0, dispatching documents up to #30000/4562950
2018-11-28 12:34:32,974 : INFO : PROGRESS: pass 0, dispatching documents up to #40000/4562950
2018-11-28 12:34:36,557 : INFO : PROGRESS: pass 0, dispatching documents up to #50000/4562950
2018-11-28 12:34:38,797 : INFO : PROGRESS: pass 0, dispatching documents up to #60000/4562950
2018-11-28 12:34:40,663 : INFO : PROGRESS: pass 0, dispatching documents up to #70000/4562950
2018-11-28 12:34:42,512 : INFO : PROGRESS: pass 0, dispatching documents up to #80000/4562950
2018-11-28 12:34:46,500 : INFO : PROGRESS: pass 0, dispatching documents up to #90000/4562950
2018-11-28 12:34:51,097 : INFO : PROGRESS: pass 0, dispatching documents up to #100000/4562950
2018-11-28 12:34:55,239 : INFO : PROGRESS: pass 0, dispatching documents up to #110000/4562950

The code stalls after this last PROGRESS update. I also don't see any CPU load from my lda workers, so I suspect they aren't actually doing any work.

I'm running this in a jupyter notebook. Here's the map of my pyro network:

[***@c0101 nb]$ python -m Pyro4.nsc list
--------START LIST
Pyro.NameServer --> PYRO:***@0.0.0.0:9090
metadata: ['class:Pyro4.naming.NameServer']
gensim.lda_dispatcher --> PYRO:***@127.0.0.1:34370
gensim.lda_worker.149ae3 --> PYRO:***@172.20.201.102:38727
gensim.lda_worker.16ed2e --> PYRO:***@172.20.201.103:45434
gensim.lda_worker.1c77f3 --> PYRO:***@172.20.201.102:38357
gensim.lda_worker.2029a --> PYRO:***@172.20.201.103:40133
gensim.lda_worker.3da37b --> PYRO:***@172.20.201.102:33853
gensim.lda_worker.54c23b --> PYRO:***@172.20.201.102:44257
gensim.lda_worker.5966ea --> PYRO:***@172.20.201.102:44724
gensim.lda_worker.7263f7 --> PYRO:***@172.20.201.103:33762
gensim.lda_worker.74787b --> PYRO:***@172.20.201.102:46607
gensim.lda_worker.796f77 --> PYRO:***@172.20.201.102:40852
gensim.lda_worker.7d1c63 --> PYRO:***@172.20.201.103:33614
gensim.lda_worker.7e7056 --> PYRO:***@172.20.201.103:43702
gensim.lda_worker.9338a8 --> PYRO:***@172.20.201.102:32828
gensim.lda_worker.9e4860 --> PYRO:***@172.20.201.103:34055
gensim.lda_worker.a649cd --> PYRO:***@172.20.201.103:46621
gensim.lda_worker.e6b4f1 --> PYRO:***@172.20.201.103:42194
gensim.lda_worker.ecd5dc --> PYRO:***@172.20.201.102:38420
gensim.lda_worker.fb2844 --> PYRO:***@172.20.201.103:33647
gensim.lda_worker.fb4a65 --> PYRO:***@172.20.201.102:35575
gensim.lda_worker.fe046f --> PYRO:***@172.20.201.103:43371
--------END LIST

Not sure how to move past this point. Any pointers appreciated.
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
John-Paul Robinson
2018-12-01 05:55:48 UTC
Permalink
I have found gensim.models.LdaMulticore() and it appears to be progressing
reasonably well with 20 workers. Will post followup on results
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
omar mustafa
2018-12-06 08:39:05 UTC
Permalink
Dear @John-Paul Robinson

*I'm trying to do exactly same of what you have done, but surprisingly
I'm getting zero of id2word. would you please help me. i know there is a
confusion between hashdictionary and real dictionary *
*but unfortunately, I could not solve the issue.*

import bz2
from gensim.test.utils import datapath, get_tmpfile
from gensim.corpora import WikiCorpus, MmCorpus
from gensim.corpora import Dictionary, MmCorpus, HashDictionary, WikiCorpus
DEFAULT_DICT_SIZE = 100000
from gensim.utils import lemmatize
keep_words = DEFAULT_DICT_SIZE
dictionary = HashDictionary(id_range=keep_words, debug=False)
dictionary.allow_update = True
corpus =
datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")


wikiids = WikiCorpus(corpus,lemmatize=lemmatize, dictionary=dictionary)
MmCorpus.serialize('/Wiki-LSI/wiki-corpus-6122018.mm', wikiids,
progress_cnt=10000)
dictionary.filter_extremes(no_below=20, no_above=0.1,
keep_n=DEFAULT_DICT_SIZE)



dictionary.save_as_text("/Wiki-LSI/wiki_en_wordids_6122018.txt.bz2")
wikiids.save('/Wiki-LSI/wiki_en_wordids_pkl_6122018.pkl.bz2')
dictionary.allow_update= False

dictionary =
Dictionary.load_from_text('/Wiki-LSI/wiki_en_wordids_6122018.txt.bz2')

# I have been told that the issue due to the real dictionary and the
hashdictionary that have been used to extract the word<>id,



print(dictionary)
Dictionary(0 unique tokens: []) !!!!!!!!!!!!!!!!!!!??????????
Regards

Omar
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...