Hi,
Yes, this is a very unfortunate problem that I'll be happy to fix.
Ok, so I double checked that running in the virtual environment isn't
causing any problems. When I run outside I also get 26 processes allocating
UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
[*snip*]
odemasi 61669 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61670 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61671 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61672 59981 0 2738764 9821696 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61673 59981 0 2738764 9821696 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61674 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61675 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61676 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61681 59981 0 2738764 9821680 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61682 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61683 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61684 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61685 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61686 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61687 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61688 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61689 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61694 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61698 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61699 59981 0 2738764 9821704 23 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61700 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61701 59981 0 2738764 9821704 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61702 59981 0 2738764 9821696 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
odemasi 61703 59981 0 2738764 9821696 14 03:42 pts/5 00:00:00 python
RunLDA.py 2
[*snip*]
The standard out that I'm getting is:
/home/odemasi/Packages/venv/lib/python2.6/site-packages/numpy/lib/utils.py:95:
DeprecationWarning: `scipy.sparse.sparsetools` is deprecated!
scipy.sparse.sparsetools is a private module for scipy.sparse, and should
not be used.
warnings.warn(depdoc, DeprecationWarning)
/home/odemasi/Packages/venv/lib/python2.6/site-packages/scipy/lib/_util.py:67:
DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use
scipy.linalg.blas instead
DeprecationWarning)
2015-06-25 03:36:38,835 : INFO : adding document #0 to Dictionary(0 unique
tokens: [])
2015-06-25 03:39:34,893 : INFO : built Dictionary(5060602 unique tokens:
[u'loyalsubscribers', u'iftheyclosedchipotleiddie',
u'\u666e\u6bb5\u306e\u53e3\u8abf\u3067\u4f55\u6ce3\u3044\u3066\u308b\u3093\u3067\u3059\u304b\u79c1\u306f\u3069\u3053\u306b\u3082\u884c\u304d\u307e\u305b\u3093\u304b\u3089\u5927\u4e08\u592b\u3067\u3059\u3092\u8a00\u3046',
u'deargodmakeatrade', u'billycorgan']...) from 1 documents (total 5060602
corpus positions)
2015-06-25 03:39:36,283 : INFO : using symmetric alpha at 0.01
2015-06-25 03:39:36,283 : INFO : using serial LDA version on this node
2015-06-25 03:42:20,479 : WARNING : input corpus stream has no len();
counting documents
2015-06-25 03:42:25,018 : INFO : running online LDA training, 100 topics, 1
passes over the supplied corpus of 100000 documents, updating every 48000
documents, evaluating every ~100000 documents, iterating 50x with a
convergence threshold of 0.001000
2015-06-25 03:42:25,018 : WARNING : too few updates, training might not
converge; consider increasing the number of passes or iterations to improve
accuracy
2015-06-25 03:42:25,023 : INFO : training LDA model using 24 processes
2015-06-25 03:42:27,407 : INFO : PROGRESS: pass 0, dispatched chunk #0 =
documents up to #2000/100000, outstanding queue size 1
Traceback (most recent call last):
File "/usr/lib64/python2.6/multiprocessing/queues.py", line 242, in _feed
send(obj)
SystemError: NULL result without error in PyObject_Call
2015-06-25 03:42:30,449 : INFO : PROGRESS: pass 0, dispatched chunk #1 =
documents up to #4000/100000, outstanding queue size 2
2015-06-25 03:42:30,612 : INFO : PROGRESS: pass 0, dispatched chunk #2 =
documents up to #6000/100000, outstanding queue size 3
2015-06-25 03:42:30,793 : INFO : PROGRESS: pass 0, dispatched chunk #3 =
documents up to #8000/100000, outstanding queue size 4
A little more about my application: each document is very tiny and right
now I'm constraining the training to 100,000 documents. It takes < 1min to
load and stream through the data. I know that running with this little data
won't give me much performance gain, but until I can get it dispersing the
work I can't run withe more data. The process has already been running for
17 hours, and that seems like a ridiculously long time for a corpus that is
a few MB (9 million documents is ~1.5GB).
Any suggestions of what to check next?
Thanks!
Orianna
Post by Stephen WuInteresting, Orianna. My problem does reappear as well -- shutting down
processes and restarting them doesn't always work. Also, I suspect that
some of the methods may end up jumping on the same core later on in
processing? Could be totally wrong about that. Radim, is there
gensim-specific logging that you're looking for?
stephen
Post by o***@berkeley.eduHello,
I'm having the same problem and would also really appreciate some help.
Checking "ps -F -A | grep NameOfMyProgram" shows that Gensim is spawning
the correct number of processes by default, but that they are all on the
same processor (I'm on a 24 core Red Hat machine). I'm running inside a
virtual environment, but it looks like that shouldn't effect things and
when I launched from outside the virtual environment processes ran on 4
cores, which was better, but still not good. Note, I think I'm calling
Gensim correctly as it does distribute to the two cores on my laptop when I
run the same code there.
Any help or suggestions are really appreciated, as I'm not really sure
where to go from here.
Thanks.
Orianna
Post by Stephen WuThanks for following up. I haven't actually gotten the training to work
in the end, so I'd welcome you looking at the issue!
I didn't see anything notable in INFO but unfortunately I don't have the
logs for LdaMulticore. I was running make_wiki simultaneously, though, and
it was trying to do everything on the same core that LdaMulticore was -- so
maybe there's something in that. The make_wiki process would have
completed but was just going really slow. Below is the fairly normal INFO
output of make_wiki, and where I cut it off.
stephen
2015-06-18 10:17:54,373 : INFO : adding document #2990000 to
Dictionary(2000000 unique tokens: [u'tripolitan', u'ftdna', u'fi\u0250',
u'soestdijk', u'phintella']...)
2015-06-18 10:20:31,873 : INFO : discarding 37835 tokens: [(u'giravee',
1), (u'actuariesindia', 1), (u'wonho', 1), (u'nerdocrumbesia', 1),
(u'jidova', 1), (u'alfredomacias', 1), (u'ysa\u04f1e', 1), (u'saraldi', 1),
(u'belvilacqua', 1), (u'cargharay', 1)]...
2015-06-18 10:20:31,879 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 3000000 (=100.0%) documents
Dictionary(2000000 unique tokens: [u'tripolitan', u'ftdna', u'fi\u0250',
u'soestdijk', u'phintella']...)
2015-06-18 10:20:43,940 : INFO : adding document #3000000 to
Dictionary(2000000 unique tokens: [u'tripolitan', u'ftdna', u'fi\u0250',
u'soestdijk', u'phintella']...)^C
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File
"/home/swu/trapit/research/.virt/lib/python2.7/site-packages/gensim/scripts/make_wiki.py",
line 83, in <module>
wiki = WikiCorpus(inp, lemmatize=lemmatize) # takes about 9h on a
macbook pro, for 3.5m articles (june 2011)
File
"/home/swu/trapit/research/.virt/local/lib/python2.7/site-packages/gensim/corpora/wikicorpus.py",
line 270, in __init__
self.dictionary = Dictionary(self.get_texts())
File
"/home/swu/trapit/research/.virt/local/lib/python2.7/site-packages/gensim/corpora/dictionary.py",
line 58, in __init__
self.add_documents(documents, prune_at=prune_at)
File
"/home/swu/trapit/research/.virt/local/lib/python2.7/site-packages/gensim/corpora/dictionary.py",
line 124, in add_documents
logger.info("adding document #%i to %s", docno, self)
File "/usr/lib/python2.7/logging/__init__.py", line 1140, in info
self._log(INFO, msg, args, **kwargs)
File "/usr/lib/python2.7/logging/__init__.py", line 1258, in _log
self.handle(record)
File "/usr/lib/python2.7/logging/__init__.py", line 1268, in handle
self.callHandlers(record)
File "/usr/lib/python2.7/logging/__init__.py", line 1308, in callHandlers
hdlr.handle(record)
File "/usr/lib/python2.7/logging/__init__.py", line 748, in handle
self.emit(record)
File "/usr/lib/python2.7/logging/__init__.py", line 867, in emit
stream.write(fs % msg)
KeyboardInterrupt
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in
_bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 85, in worker
task = get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
racquire()
Post by Radim ÅehůÅekHello Stephen,
do you happen to have a log from when things didn't work (INFO level,
or preferably DEBUG)?
I'm thinking maybe one of the processes failed / died for some reason,
and the multiprocessing didn't recover. If that's the case, there should be
a stack trace in the log.
Just a wild hypothesis :)
Radim
Post by Stephen WuI killed the processes and reran them with no/minimal changes and
parallelization is working just fine. Unclear why, which is a bit
unsatisfying after several hours of digging.
Leading hypothesis: this was probably some OS-level thing, e.g.,
processes might have wanted to stay on the same processor to make use of
caches efficiently.
stephen
Post by Stephen WuI'm running on a machine with 16 cores. LdaMulticore seems to
recognize that I have 16 cores and by default starts 16 workers. However,
all the workers are divvying up work on the same processor. So on my
900k-document corpus, this is taking a while.
I had a few hypotheses about why this was the case and talked to
others about some of these. So far, I don't think the culprit is any of
- I wrapped LdaMulticore in a custom scikit-learn estimator, and
this estimator does give real results after being trained.
- I am running on a 900k-doc corpus that sits in memory at about 10+GB
- I'm kicking it off within iPython within a screen session
- I've tested running a few other Python processes, and they all
use the same CPU. E.g., I'm trying to parse wikipedia using gensim, and
its worker(s) also use the same CPU.
Any help appreciated.
--
You received this message because you are subscribed to a topic in the
Google Groups "gensim" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/gensim/2pYRRDaFriY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the
Google Groups "gensim" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/gensim/2pYRRDaFriY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.