Discussion:
[gensim:10855] Preparing the corpus freezes system
(too old to reply)
Axiombadger
2018-04-08 10:48:07 UTC
Permalink
Hi,

I am just starting to use gensim and am having some issues with the
wikipedia corpus.

Following the instructions here:

https://radimrehurek.com/gensim/wiki.html

I run the following:

python3.5 -m gensim.scripts.make_wiki /home/user/enwiki-latest-pages-
articles.xml.bz2 /home/user/wiki

The program begins and I get output like:

2018-04-08 11:38:30,853 : INFO : running /home/user/.local/lib/python3.5/
site-packages/gensim/scripts/make_wiki.py /home/user/enwiki-latest-pages-
articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,936 : INFO : adding document #0 to Dictionary(0 unique
tokens: [])
2018-04-08 11:39:14,701 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['minikh', 'meteora', 'simbalist',
'burbano', 'aak']...)
2018-04-08 11:39:53,316 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['cerego', 'minikh', 'constantian',
'študovať', 'meteora']...)
2018-04-08 11:40:25,823 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['minikh', 'arisu', 'študovať', 'veitvet',
'djohor']...)
2018-04-08 11:40:55,901 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['glabrum', 'minikh', 'arisu', 'študovať',
'veitvet']...)
2018-04-08 11:41:19,130 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:32,992 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:45,127 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
2018-04-08 11:41:56,792 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)

It eventually reaches a point where it just freezes. I was using tmux to
drop in and out of the terminal, so I tried plugging a monitor into the
machine I am using as a server and just running it from there and the
system locks up.

I am using Mint 18.3, with as you can see Python 3.5. I installed all of
the dependencies with pip and the --user flag and explicitly call python-3.5

When I run the same with enwiki-latest-pages-articles1.xml-p10p30302.bz2 (a
much smaller corpus) the task completes.

Is this just a RAM issue? I have 16GB and about 110GB free space on an
SSD. What would I need in order to run the above command?

I can use smaller corpus, I just ask because it is the first line of code
listed in the above instructions and it fails, has the file creep from 8GB
at time of writing to about 14GB now caused problems?

Where might I get logs for something crashing so unceremoniously?

Cheers.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ivan Menshikh
2018-04-09 06:12:46 UTC
Permalink
Hello,

looks like you have enough of resources for this command. Try to see what
happens with RAM/CPU at this moment using htop <https://hisham.hm/htop/>in
the different console.
Post by Axiombadger
Hi,
I am just starting to use gensim and am having some issues with the
wikipedia corpus.
https://radimrehurek.com/gensim/wiki.html
python3.5 -m gensim.scripts.make_wiki /home/user/enwiki-latest-pages-
articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,853 : INFO : running /home/user/.local/lib/python3.5/
site-packages/gensim/scripts/make_wiki.py /home/user/enwiki-latest-pages-
articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,936 : INFO : adding document #0 to Dictionary(0
unique tokens: [])
2018-04-08 11:39:14,701 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['minikh', 'meteora', 'simbalist',
'burbano', 'aak']...)
2018-04-08 11:39:53,316 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['cerego', 'minikh', 'constantian',
'študovať', 'meteora']...)
2018-04-08 11:40:25,823 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['minikh', 'arisu', 'študovať', 'veitvet',
'djohor']...)
2018-04-08 11:40:55,901 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['glabrum', 'minikh', 'arisu', 'študovať',
'veitvet']...)
2018-04-08 11:41:19,130 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:32,992 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:45,127 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
2018-04-08 11:41:56,792 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
It eventually reaches a point where it just freezes. I was using tmux to
drop in and out of the terminal, so I tried plugging a monitor into the
machine I am using as a server and just running it from there and the
system locks up.
I am using Mint 18.3, with as you can see Python 3.5. I installed all of
the dependencies with pip and the --user flag and explicitly call python-3.5
When I run the same with enwiki-latest-pages-articles1.xml-p10p30302.bz2
(a much smaller corpus) the task completes.
Is this just a RAM issue? I have 16GB and about 110GB free space on an
SSD. What would I need in order to run the above command?
I can use smaller corpus, I just ask because it is the first line of code
listed in the above instructions and it fails, has the file creep from 8GB
at time of writing to about 14GB now caused problems?
Where might I get logs for something crashing so unceremoniously?
Cheers.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Craig Thomson
2018-04-09 13:47:55 UTC
Permalink
Thanks for the response.

I had top up a couple of the times it froze (not htop although I have
switched to that now). On top none of the 3-4 python processes were above
3% RAM (and they possibly share some of that anyway?).

I have actually since had a hard system freeze when pos tagging with
spaCy. I added a forced garbage collection every 10k lines or something
and now it is fine (albeit taking hours so I will need to wait to try
gensim again). SpaCy was not close to running out of RAM either.

I am now running in venv python 3.5.2 (python is new to me, Ruby, PHP and
C++ background).

I will try and freeze my laptop with gensim on the same setup.

Is there some kind of setting to make python more aggressive with garbage
collection or am I barking up the wrong tree with that idea?
Post by Ivan Menshikh
Hello,
looks like you have enough of resources for this command. Try to see what
happens with RAM/CPU at this moment using htop <https://hisham.hm/htop/>in
the different console.
Post by Axiombadger
Hi,
I am just starting to use gensim and am having some issues with the
wikipedia corpus.
https://radimrehurek.com/gensim/wiki.html
python3.5 -m gensim.scripts.make_wiki /home/user/enwiki-latest-pages-
articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,853 : INFO : running /home/user/.local/lib/python3.5/
site-packages/gensim/scripts/make_wiki.py /home/user/enwiki-latest-pages-
articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,936 : INFO : adding document #0 to Dictionary(0
unique tokens: [])
2018-04-08 11:39:14,701 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['minikh', 'meteora', 'simbalist',
'burbano', 'aak']...)
2018-04-08 11:39:53,316 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['cerego', 'minikh', 'constantian',
'študovať', 'meteora']...)
2018-04-08 11:40:25,823 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['minikh', 'arisu', 'študovať', 'veitvet',
'djohor']...)
2018-04-08 11:40:55,901 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['glabrum', 'minikh', 'arisu', 'študovať',
'veitvet']...)
2018-04-08 11:41:19,130 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:32,992 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:45,127 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
2018-04-08 11:41:56,792 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
It eventually reaches a point where it just freezes. I was using tmux to
drop in and out of the terminal, so I tried plugging a monitor into the
machine I am using as a server and just running it from there and the
system locks up.
I am using Mint 18.3, with as you can see Python 3.5. I installed all of
the dependencies with pip and the --user flag and explicitly call python-3.5
When I run the same with enwiki-latest-pages-articles1.xml-p10p30302.bz2
(a much smaller corpus) the task completes.
Is this just a RAM issue? I have 16GB and about 110GB free space on an
SSD. What would I need in order to run the above command?
I can use smaller corpus, I just ask because it is the first line of code
listed in the above instructions and it fails, has the file creep from 8GB
at time of writing to about 14GB now caused problems?
Where might I get logs for something crashing so unceremoniously?
Cheers.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ivan Menshikh
2018-04-10 03:06:59 UTC
Permalink
Hi Craig,

about "more aggressive GC", try this
method: https://docs.python.org/2/library/gc.html#gc.set_threshold, I'm not
sure about the usefulness of this method in the current case, but feel free
to try.
Post by Craig Thomson
Thanks for the response.
I had top up a couple of the times it froze (not htop although I have
switched to that now). On top none of the 3-4 python processes were above
3% RAM (and they possibly share some of that anyway?).
I have actually since had a hard system freeze when pos tagging with
spaCy. I added a forced garbage collection every 10k lines or something
and now it is fine (albeit taking hours so I will need to wait to try
gensim again). SpaCy was not close to running out of RAM either.
I am now running in venv python 3.5.2 (python is new to me, Ruby, PHP and
C++ background).
I will try and freeze my laptop with gensim on the same setup.
Is there some kind of setting to make python more aggressive with garbage
collection or am I barking up the wrong tree with that idea?
Post by Ivan Menshikh
Hello,
looks like you have enough of resources for this command. Try to see what
happens with RAM/CPU at this moment using htop <https://hisham.hm/htop/>in
the different console.
Post by Axiombadger
Hi,
I am just starting to use gensim and am having some issues with the
wikipedia corpus.
https://radimrehurek.com/gensim/wiki.html
python3.5 -m gensim.scripts.make_wiki /home/user/enwiki-latest-pages-
articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,853 : INFO : running /home/user/.local/lib/python3.5
/site-packages/gensim/scripts/make_wiki.py /home/user/enwiki-latest-
pages-articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,936 : INFO : adding document #0 to Dictionary(0
unique tokens: [])
2018-04-08 11:39:14,701 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['minikh', 'meteora', 'simbalist',
'burbano', 'aak']...)
2018-04-08 11:39:53,316 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['cerego', 'minikh', 'constantian',
'študovať', 'meteora']...)
2018-04-08 11:40:25,823 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['minikh', 'arisu', 'študovať', 'veitvet',
'djohor']...)
2018-04-08 11:40:55,901 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['glabrum', 'minikh', 'arisu', 'študovať',
'veitvet']...)
2018-04-08 11:41:19,130 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:32,992 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:45,127 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
2018-04-08 11:41:56,792 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
It eventually reaches a point where it just freezes. I was using tmux
to drop in and out of the terminal, so I tried plugging a monitor into the
machine I am using as a server and just running it from there and the
system locks up.
I am using Mint 18.3, with as you can see Python 3.5. I installed all
of the dependencies with pip and the --user flag and explicitly call
python-3.5
When I run the same with enwiki-latest-pages-articles1.xml-p10p30302.bz2
(a much smaller corpus) the task completes.
Is this just a RAM issue? I have 16GB and about 110GB free space on an
SSD. What would I need in order to run the above command?
I can use smaller corpus, I just ask because it is the first line of
code listed in the above instructions and it fails, has the file creep from
8GB at time of writing to about 14GB now caused problems?
Where might I get logs for something crashing so unceremoniously?
Cheers.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Craig Thomson
2018-04-10 08:42:30 UTC
Permalink
Thanks for the pointer, I will take a look at that as it may be useful
generally.

I had some pos tagging running so could not mess with my server until this
morning.

That is done so I now do the following:

ssh into my server

# setup a tmux session
tmux

# enter the python virtual environment (3.5.2)
source Development/python-env/bin/activate


#
python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.xml.bz2
~/outdir

I then setup some monitoring tools (also ssh in then tmux). I did:

- htop
- watch -n1 sensors
- watch -n10 def -lh

This was to keep an eye on HDD space and to check for CPU overheat although
the cores barely broke 70 degrees (cpu load holds at 80% on each of the 4
cores) and there is plenty HDD space. Memory looks fine on htop, it barely
uses 1G (out of 16G).

It freezes after putting out the following output (there is more above
obviously but this is where it freezes). When this happens everything
stops, all the tmux sessions come down and on my laptop I just get the last
readings of each monitor which show the same CPU, RAM, temps and disk
space. Because it is crashing in such a way I am not sure how to get at
any kind of actual error message.

2018-04-10 09:27:01,699 : INFO : adding document #90000 to
Dictionary(1115408 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:27:27,986 : INFO : adding document #100000 to
Dictionary(1216455 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:27:52,628 : INFO : adding document #110000 to
Dictionary(1306640 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:28:16,163 : INFO : adding document #120000 to
Dictionary(1385497 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'gewÀsen', 'wbss']...)
2018-04-10 09:28:38,390 : INFO : adding document #130000 to
Dictionary(1455322 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'aranga']...)
2018-04-10 09:29:01,829 : INFO : adding document #140000 to
Dictionary(1532614 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'aranga']...)
2018-04-10 09:29:23,102 : INFO : adding document #150000 to
Dictionary(1621284 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'kanenas']...)
2018-04-10 09:29:45,919 : INFO : adding document #160000 to
Dictionary(1699019 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'kanenas']...)
2018-04-10 09:30:06,493 : INFO : adding document #170000 to
Dictionary(1763625 unique tokens: ['quadricostate', 'vankulick', 'àž­àžœàž²à¹€àžŠ',
'raniere', 'bakshi']...)
2018-04-10 09:30:26,547 : INFO : adding document #180000 to
Dictionary(1816463 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:30:44,641 : INFO : adding document #190000 to
Dictionary(1873986 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:03,103 : INFO : adding document #200000 to
Dictionary(1932042 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:22,488 : INFO : adding document #210000 to
Dictionary(1982135 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:44,125 : INFO : discarding 27140 tokens: [('ziedu', 1),
('headstroke', 1), ('shawfielders', 1), ('sardisch', 1), ('luxsitpress',
1), ('fameuil', 1), ('munkaszolgálat', 1), ('batruna', 1), ('pigita', 1),
('goreiro', 1)]...
2018-04-10 09:31:44,125 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 220000 (=100.0%) documents
2018-04-10 09:31:47,379 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['quadricostate', 'vankulick', 'raniere', 'bakshi',
'vermoorde']...)
2018-04-10 09:31:47,454 : INFO : adding document #220000 to
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
2018-04-10 09:32:08,284 : INFO : discarding 52395 tokens: [('willouby', 1),
('debuchii', 1), ('llanwynno', 1), ('scurfpea', 1), ('tshogchungs', 1),
('dorsalateral', 1), ('cjmi', 1), ('chierichetti', 1), ('marketized', 1),
('eubetchia', 1)]...
2018-04-10 09:32:08,284 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 230000 (=100.0%) documents
2018-04-10 09:32:11,348 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['quadricostate', 'vankulick', 'raniere', 'bakshi',
'vermoorde']...)
2018-04-10 09:32:11,421 : INFO : adding document #230000 to
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
Post by Ivan Menshikh
Hi Craig,
about "more aggressive GC", try this method: https://docs.python.
org/2/library/gc.html#gc.set_threshold, I'm not sure about the
usefulness of this method in the current case, but feel free to try.
Post by Craig Thomson
Thanks for the response.
I had top up a couple of the times it froze (not htop although I have
switched to that now). On top none of the 3-4 python processes were above
3% RAM (and they possibly share some of that anyway?).
I have actually since had a hard system freeze when pos tagging with
spaCy. I added a forced garbage collection every 10k lines or something
and now it is fine (albeit taking hours so I will need to wait to try
gensim again). SpaCy was not close to running out of RAM either.
I am now running in venv python 3.5.2 (python is new to me, Ruby, PHP and
C++ background).
I will try and freeze my laptop with gensim on the same setup.
Is there some kind of setting to make python more aggressive with garbage
collection or am I barking up the wrong tree with that idea?
Post by Ivan Menshikh
Hello,
looks like you have enough of resources for this command. Try to see
what happens with RAM/CPU at this moment using htop
<https://hisham.hm/htop/>in the different console.
Post by Axiombadger
Hi,
I am just starting to use gensim and am having some issues with the
wikipedia corpus.
https://radimrehurek.com/gensim/wiki.html
python3.5 -m gensim.scripts.make_wiki /home/user/enwiki-latest-pages-
articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,853 : INFO : running /home/user/.local/lib/python3.
5/site-packages/gensim/scripts/make_wiki.py /home/user/enwiki-latest-
pages-articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,936 : INFO : adding document #0 to Dictionary(0
unique tokens: [])
2018-04-08 11:39:14,701 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['minikh', 'meteora', 'simbalist',
'burbano', 'aak']...)
2018-04-08 11:39:53,316 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['cerego', 'minikh', 'constantian',
'študovať', 'meteora']...)
2018-04-08 11:40:25,823 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['minikh', 'arisu', 'študovať', 'veitvet',
'djohor']...)
2018-04-08 11:40:55,901 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['glabrum', 'minikh', 'arisu', 'študovať',
'veitvet']...)
2018-04-08 11:41:19,130 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:32,992 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:45,127 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
2018-04-08 11:41:56,792 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
It eventually reaches a point where it just freezes. I was using tmux
to drop in and out of the terminal, so I tried plugging a monitor into the
machine I am using as a server and just running it from there and the
system locks up.
I am using Mint 18.3, with as you can see Python 3.5. I installed all
of the dependencies with pip and the --user flag and explicitly call
python-3.5
When I run the same with enwiki-latest-pages-articles1.xml-p10p30302.bz2
(a much smaller corpus) the task completes.
Is this just a RAM issue? I have 16GB and about 110GB free space on an
SSD. What would I need in order to run the above command?
I can use smaller corpus, I just ask because it is the first line of
code listed in the above instructions and it fails, has the file creep from
8GB at time of writing to about 14GB now caused problems?
Where might I get logs for something crashing so unceremoniously?
Cheers.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ivan Menshikh
2018-04-11 04:19:38 UTC
Permalink
Try to run it in "detached" mode like

nohup python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.xml.bz2
~/outdir >log.log&

now, this doesn't block your console and will work after disconnect too and
shouldn't affect your tmux session.
Post by Craig Thomson
Thanks for the pointer, I will take a look at that as it may be useful
generally.
I had some pos tagging running so could not mess with my server until this
morning.
ssh into my server
# setup a tmux session
tmux
# enter the python virtual environment (3.5.2)
source Development/python-env/bin/activate
#
python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.xml.bz2
~/outdir
- htop
- watch -n1 sensors
- watch -n10 def -lh
This was to keep an eye on HDD space and to check for CPU overheat
although the cores barely broke 70 degrees (cpu load holds at 80% on each
of the 4 cores) and there is plenty HDD space. Memory looks fine on htop,
it barely uses 1G (out of 16G).
It freezes after putting out the following output (there is more above
obviously but this is where it freezes). When this happens everything
stops, all the tmux sessions come down and on my laptop I just get the last
readings of each monitor which show the same CPU, RAM, temps and disk
space. Because it is crashing in such a way I am not sure how to get at
any kind of actual error message.
2018-04-10 09:27:01,699 : INFO : adding document #90000 to
Dictionary(1115408 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:27:27,986 : INFO : adding document #100000 to
Dictionary(1216455 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:27:52,628 : INFO : adding document #110000 to
Dictionary(1306640 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:28:16,163 : INFO : adding document #120000 to
Dictionary(1385497 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'gewÀsen', 'wbss']...)
2018-04-10 09:28:38,390 : INFO : adding document #130000 to
Dictionary(1455322 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'aranga']...)
2018-04-10 09:29:01,829 : INFO : adding document #140000 to
Dictionary(1532614 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'aranga']...)
2018-04-10 09:29:23,102 : INFO : adding document #150000 to
Dictionary(1621284 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'kanenas']...)
2018-04-10 09:29:45,919 : INFO : adding document #160000 to
Dictionary(1699019 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'kanenas']...)
2018-04-10 09:30:06,493 : INFO : adding document #170000 to
Dictionary(1763625 unique tokens: ['quadricostate', 'vankulick', 'àž­àžœàž²à¹€àžŠ',
'raniere', 'bakshi']...)
2018-04-10 09:30:26,547 : INFO : adding document #180000 to
Dictionary(1816463 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:30:44,641 : INFO : adding document #190000 to
Dictionary(1873986 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:03,103 : INFO : adding document #200000 to
Dictionary(1932042 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:22,488 : INFO : adding document #210000 to
Dictionary(1982135 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:44,125 : INFO : discarding 27140 tokens: [('ziedu', 1),
('headstroke', 1), ('shawfielders', 1), ('sardisch', 1), ('luxsitpress',
1), ('fameuil', 1), ('munkaszolgálat', 1), ('batruna', 1), ('pigita', 1),
('goreiro', 1)]...
2018-04-10 09:31:44,125 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 220000 (=100.0%) documents
2018-04-10 09:31:47,379 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['quadricostate', 'vankulick', 'raniere', 'bakshi',
'vermoorde']...)
2018-04-10 09:31:47,454 : INFO : adding document #220000 to
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
2018-04-10 09:32:08,284 : INFO : discarding 52395 tokens: [('willouby',
1), ('debuchii', 1), ('llanwynno', 1), ('scurfpea', 1), ('tshogchungs', 1),
('dorsalateral', 1), ('cjmi', 1), ('chierichetti', 1), ('marketized', 1),
('eubetchia', 1)]...
2018-04-10 09:32:08,284 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 230000 (=100.0%) documents
2018-04-10 09:32:11,348 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['quadricostate', 'vankulick', 'raniere', 'bakshi',
'vermoorde']...)
2018-04-10 09:32:11,421 : INFO : adding document #230000 to
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
On Tue, Apr 10, 2018 at 4:06 AM, Ivan Menshikh <
Post by Ivan Menshikh
Hi Craig,
https://docs.python.org/2/library/gc.html#gc.set_threshold, I'm not sure
about the usefulness of this method in the current case, but feel free to
try.
Post by Craig Thomson
Thanks for the response.
I had top up a couple of the times it froze (not htop although I have
switched to that now). On top none of the 3-4 python processes were above
3% RAM (and they possibly share some of that anyway?).
I have actually since had a hard system freeze when pos tagging with
spaCy. I added a forced garbage collection every 10k lines or something
and now it is fine (albeit taking hours so I will need to wait to try
gensim again). SpaCy was not close to running out of RAM either.
I am now running in venv python 3.5.2 (python is new to me, Ruby, PHP
and C++ background).
I will try and freeze my laptop with gensim on the same setup.
Is there some kind of setting to make python more aggressive with
garbage collection or am I barking up the wrong tree with that idea?
Post by Ivan Menshikh
Hello,
looks like you have enough of resources for this command. Try to see
what happens with RAM/CPU at this moment using htop
<https://hisham.hm/htop/>in the different console.
Post by Axiombadger
Hi,
I am just starting to use gensim and am having some issues with the
wikipedia corpus.
https://radimrehurek.com/gensim/wiki.html
python3.5 -m gensim.scripts.make_wiki /home/user/enwiki-latest-pages-
articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,853 : INFO : running /home/user/.local/lib/python3
.5/site-packages/gensim/scripts/make_wiki.py /home/user/enwiki-latest-
pages-articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,936 : INFO : adding document #0 to Dictionary(0
unique tokens: [])
2018-04-08 11:39:14,701 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['minikh', 'meteora', 'simbalist',
'burbano', 'aak']...)
2018-04-08 11:39:53,316 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['cerego', 'minikh', 'constantian',
'študovať', 'meteora']...)
2018-04-08 11:40:25,823 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['minikh', 'arisu', 'študovať', 'veitvet',
'djohor']...)
2018-04-08 11:40:55,901 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['glabrum', 'minikh', 'arisu', 'študovať',
'veitvet']...)
2018-04-08 11:41:19,130 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:32,992 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:45,127 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
2018-04-08 11:41:56,792 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
It eventually reaches a point where it just freezes. I was using tmux
to drop in and out of the terminal, so I tried plugging a monitor into the
machine I am using as a server and just running it from there and the
system locks up.
I am using Mint 18.3, with as you can see Python 3.5. I installed all
of the dependencies with pip and the --user flag and explicitly call
python-3.5
When I run the same
with enwiki-latest-pages-articles1.xml-p10p30302.bz2 (a much smaller
corpus) the task completes.
Is this just a RAM issue? I have 16GB and about 110GB free space on
an SSD. What would I need in order to run the above command?
I can use smaller corpus, I just ask because it is the first line of
code listed in the above instructions and it fails, has the file creep from
8GB at time of writing to about 14GB now caused problems?
Where might I get logs for something crashing so unceremoniously?
Cheers.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Craig Thomson
2018-04-11 08:56:35 UTC
Permalink
Thanks again,

I ran (within the python venv)

nohup python -m gensim.scripts.make_wiki
~/Development/corpus/downloads/enwiki-latest-pages-articles.xml.bz2
~/Development/corpus/output >log.log &

(I have been tidying up folders a bit hence the slightly different paths).

It crashes after varying amounts of time both with and without tmux. When
it crashes in tmux it brings down every single tmux session with it.

I have nothing else at all on this system so I can change distro, python
environment, anything.

I cannot at the moment test the same thing on my laptop (also Mint 18.3,
with the same python venv) as I have other work to do and am kind of
back-burning this on the desktop which is at home.

Watching this as much as I can in htop, it is still 80% CPU per core, and
about 600M - 1G of RAM at any given time.

log.log output (not on tmux) was:

2018-04-11 08:28:11,470 : INFO : running
/home/user/Development/python-env/lib/python3.5/site-packages/gensim/scripts/make_wiki.py
/home/user/Development/corpus/downloads/enwiki-latest-pages-articles.xml.bz2
/home/user/Development/corpus/output
2018-04-11 08:28:11,544 : INFO : adding document #0 to Dictionary(0 unique
tokens: [])
2018-04-11 08:28:53,148 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['sandez', 'brickyards', 'ettling',
'attis', 'jse']...)
2018-04-11 08:29:29,801 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['yemek', 'sandez', 'brickyards',
'ettling', 'attis']...)
2018-04-11 08:30:00,202 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['sandez', 'attis', 'jse', 'ejolts',
'skaphidia']...)
2018-04-11 08:30:28,594 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['sandez', 'attis', 'jse', 'ejolts',
'skaphidia']...)
2018-04-11 08:30:50,365 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:03,307 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:14,661 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:25,488 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:48,834 : INFO : adding document #90000 to
Dictionary(1115408 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:32:14,763 : INFO : adding document #100000 to
Dictionary(1216455 unique tokens: ['pentatonics', 'sandez', 'mizormac',
'attis', 'jse']...)
2018-04-11 08:32:39,305 : INFO : adding document #110000 to
Dictionary(1306640 unique tokens: ['pentatonics', 'svellnosbreen',
'sandez', 'mizormac', 'attis']...)
2018-04-11 08:33:02,130 : INFO : adding document #120000 to
Dictionary(1385497 unique tokens: ['pentatonics', 'svellnosbreen',
'sandez', 'mizormac', 'attis']...)
2018-04-11 08:33:23,947 : INFO : adding document #130000 to
Dictionary(1455322 unique tokens: ['checotah', 'jse', 'ejolts', 'hohnadel',
'nightriders']...)
2018-04-11 08:33:46,905 : INFO : adding document #140000 to
Dictionary(1532614 unique tokens: ['checotah', 'jse', 'ejolts', 'hohnadel',
'nightriders']...)
2018-04-11 08:34:08,194 : INFO : adding document #150000 to
Dictionary(1621284 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:34:31,281 : INFO : adding document #160000 to
Dictionary(1699019 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:34:51,961 : INFO : adding document #170000 to
Dictionary(1763625 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:12,260 : INFO : adding document #180000 to
Dictionary(1816463 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:30,444 : INFO : adding document #190000 to
Dictionary(1873986 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:49,295 : INFO : adding document #200000 to
Dictionary(1932042 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:09,197 : INFO : adding document #210000 to
Dictionary(1982135 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:31,133 : INFO : discarding 27140 tokens: [('talost', 1),
('zhizhuan', 1), ('trevuren', 1), ('callachan', 1), ('methylisation', 1),
('blacklo', 1), ('īshat', 1), ('ilahiyyat', 1), ('grīmekhalaṃ', 1),
('afferre', 1)]...
2018-04-11 08:36:31,133 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 220000 (=100.0%) documents
2018-04-11 08:36:34,173 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['uluʁ', 'jse', 'izg', 'ejolts', 'hohnadel']...)
2018-04-11 08:36:34,241 : INFO : adding document #220000 to
Dictionary(2000000 unique tokens: ['uluʁ', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:55,451 : INFO : discarding 52425 tokens: [('xhjhs', 1),
('àŽ…àŽ•', 1), ('creekvale', 1), ('villavincie', 1), ('kurewen', 1),
('askamiciw', 1), ('askipiw', 1), ('manautou', 1), ('zichmini', 1),
('olenoides', 1)]...
2018-04-11 08:36:55,451 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 230000 (=100.0%) documents
2018-04-11 08:36:58,468 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:36:58,541 : INFO : adding document #230000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:19,504 : INFO : discarding 52345 tokens: [('προβούλευΌα',
1), ('adebanjos', 1), ('brajković', 1), ('sepn', 1), ('diastatops', 1),
('tamoshanters', 1), ('zumann', 1), ('тхайМОг', 1), ('rickhardt', 1),
('penceat', 1)]...
2018-04-11 08:37:19,504 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 240000 (=100.0%) documents
2018-04-11 08:37:22,698 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:22,772 : INFO : adding document #240000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:45,184 : INFO : discarding 42278 tokens: [('tizir', 1),
('jacupiranga', 1), ('Ќаєш', 1), ('ninkov', 1), ('chanuyot', 1), ('vođa',
1), ('zhǎnghǎi', 1), ('rpps', 1), ('domasław', 1), ('gaeege', 1)]...
2018-04-11 08:37:45,184 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 250000 (=100.0%) documents
2018-04-11 08:37:48,410 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:48,488 : INFO : adding document #250000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:11,016 : INFO : discarding 50985 tokens: [('zampognari',
1), ('doctorală', 1), ('trinominals', 1), ('mansenc', 1),
('globalresearch', 1), ('mansengou', 1), ('loughans', 1), ('busaeus', 1),
('hirsaugienses', 1), ('paralelă', 1)]...
2018-04-11 08:38:11,016 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 260000 (=100.0%) documents
2018-04-11 08:38:14,018 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:14,087 : INFO : adding document #260000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:36,723 : INFO : discarding 45170 tokens: [('kaikhah', 1),
('kohrausch', 1), ('levate', 1), ('pelagheia', 1), ('ehcr', 1), ('buccata',
1), ('taramov', 1), ('wauch', 1), ('eymer', 1), ('exradius', 1)]...
2018-04-11 08:38:36,724 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 270000 (=100.0%) documents
2018-04-11 08:38:39,635 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:39,704 : INFO : adding document #270000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:02,766 : INFO : discarding 50380 tokens: [('mbil', 1),
('λιΌΜίτης', 1), ('lovaart', 1), ('medabot', 1), ('tebirkes', 1), ('innjō',
1), ('issobel', 1), ('neeby', 1), ('τέΌπλος', 1), ('karlostachys', 1)]...
2018-04-11 08:39:02,766 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 280000 (=100.0%) documents
2018-04-11 08:39:05,693 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:05,760 : INFO : adding document #280000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:25,872 : INFO : discarding 52105 tokens: [('mbb_', 1),
('scioptric', 1), ('sashwo', 1), ('thirachai', 1), ('kharaillah', 1),
('dìnghǎilù', 1), ('taylormusic', 1), ('kongjiang', 1), ('vaughanmusic',
1), ('kòngjiānglù', 1)]...
2018-04-11 08:39:25,872 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 290000 (=100.0%) documents
2018-04-11 08:39:29,010 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:29,088 : INFO : adding document #290000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:48,652 : INFO : discarding 46673 tokens: [('goldsmithry',
1), ('fianṡruth', 1), ('flannacán', 1), ('schaid', 1), ('cayohoga', 1),
('rilasciata', 1), ('attancourt', 1), ('villefore', 1), ('jarsin', 1),
('qeyniy', 1)]...
2018-04-11 08:39:48,652 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 300000 (=100.0%) documents
2018-04-11 08:39:51,546 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:51,611 : INFO : adding document #300000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:11,566 : INFO : discarding 48892 tokens: [('pujadó', 1),
('tikalladislav', 1), ('hilsbach', 1), ('térygéza', 1), ('astly', 1),
('iiibes', 1), ('wartelle', 1), ('carmelito', 1), ('nosless', 1),
('vÀrldsspindeln', 1)]...
2018-04-11 08:40:11,566 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 310000 (=100.0%) documents
2018-04-11 08:40:14,717 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:14,795 : INFO : adding document #310000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:35,018 : INFO : discarding 50222 tokens: [('hinderlist',
1), ('tatenawate', 1), ('spofvenhielm', 1), ('vicarivs', 1), ('starenfelt',
1), ('神歊東埁', 1), ('МачальМОка', 1), ('𣋚𠉞', 1), ('âwaxsîdâr', 1),
('noterid', 1)]...
2018-04-11 08:40:35,018 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 320000 (=100.0%) documents
2018-04-11 08:40:37,915 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:37,980 : INFO : adding document #320000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:57,494 : INFO : discarding 42122 tokens: [('buahan', 1),
('olympisky', 1), ('ōio', 1), ('rasulova', 1), ('treesforlife', 1),
('倧井内芪王', 1), ('usubov', 1), ('blÃ¥dalsvatnet', 1), ('賀楜内芪王', 1),
('ghoraib', 1)]...
2018-04-11 08:40:57,494 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 330000 (=100.0%) documents
2018-04-11 08:41:00,425 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:00,496 : INFO : adding document #330000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:19,188 : INFO : discarding 38954 tokens: [('exoda', 1),
('黃䞖仲', 1), ('kriemelman', 1), ('mandeldrums', 1), ('hihihihi', 1), ('黃䌯思',
1), ('korostelyov', 1), ('skillometer', 1), ('lachrymology', 1),
('processid', 1)]...
2018-04-11 08:41:19,188 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 340000 (=100.0%) documents
2018-04-11 08:41:22,068 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:22,132 : INFO : adding document #340000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:41,704 : INFO : discarding 39759 tokens: [('úlfheðinn',
1), ('neustrasia', 1), ('magdelone', 1), ('hillopathes', 1), ('mexbol', 1),
('tirtonadi', 1), ('batizocoi', 1), ('triwindhu', 1), ('mejiso', 1),
('namerō', 1)]...
2018-04-11 08:41:41,704 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 350000 (=100.0%) documents
2018-04-11 08:41:44,641 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:44,712 : INFO : adding document #350000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:42:03,259 : INFO : discarding 38180 tokens: [('ffyr', 1),
('cösitzer', 1), ('crefft', 1), ('duppenbecker', 1), ('tenguzame', 1),
('dniestrem', 1), ('geronte', 1), ('serpari', 1), ('応汗州郜督府郜督', 1),
('техеМ', 1)]...
2018-04-11 08:42:03,259 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 360000 (=100.0%) documents
2018-04-11 08:42:06,137 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:42:06,202 : INFO : adding document #360000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:24,380 : INFO : discarding 36319 tokens: [('willingii',
1), ('woodmaniorum', 1), ('wildford', 1), ('canibungan', 1),
('arsenophonus', 1), ('nonhigh', 1), ('mineralocortoid', 1), ('kinek', 1),
('pakipasa', 1), ('mondjam', 1)]...
2018-04-11 08:42:24,381 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 370000 (=100.0%) documents
2018-04-11 08:42:27,324 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:42:27,395 : INFO : adding document #370000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:45,949 : INFO : discarding 41031 tokens: [('miamua', 1),
('transliterature', 1), ('thlen', 1), ('appeariq', 1), ('山鎫', 1),
('deeondeeup', 1), ('yadate', 1), ('travia', 1), ('ruakapanga', 1),
('краљу', 1)]...
2018-04-11 08:42:45,949 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 380000 (=100.0%) documents
2018-04-11 08:42:48,818 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:42:48,883 : INFO : adding document #380000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:43:09,037 : INFO : discarding 38122 tokens: [('megacrammer',
1), ('horroh', 1), ('nasrud', 1), ('proteax', 1), ('cultutes', 1),
('gammarotettix', 1), ('discopolis', 1), ('endophilic', 1), ('caliology',
1), ('mohni', 1)]...
2018-04-11 08:43:09,037 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 390000 (=100.0%) documents
2018-04-11 08:43:11,973 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:43:12,044 : INFO : adding document #390000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:43:30,113 : INFO : discarding 39509 tokens: [('clangourous',
1), ('sulfabid', 1), ('faddul', 1), ('oenephes', 1), ('sulmeprim', 1),
('mawaqif', 1), ('talactoferrin', 1), ('talaglumetad', 1), ('contrate', 1),
('πrad', 1)]...
2018-04-11 08:43:30,114 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 400000 (=100.0%) documents
2018-04-11 08:43:32,996 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:43:33,061 : INFO : adding document #400000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:43:51,657 : INFO : discarding 42423 tokens: [('medeswael',
1), ('wohnungs', 1), ('yŏnghŭng', 1), ('annebella', 1), ('dimmig', 1),
('gosdendiana', 1), ('iwasakisara', 1), ('allānâ', 1), ('teuflische', 1),
('polymelos', 1)]...
2018-04-11 08:43:51,657 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 410000 (=100.0%) documents
2018-04-11 08:43:54,749 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:43:54,827 : INFO : adding document #410000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:13,198 : INFO : discarding 42790 tokens:
[('bowersflybabycf', 1), ('skyote', 1), ('apathēs', 1), ('bolcon', 1),
('きらきらアフロ', 1), ('polysylabi', 1), ('kurtziella', 1), ('waringinkurung',
1), ('pyrgeometers', 1), ('badekuren', 1)]...
2018-04-11 08:44:13,198 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 420000 (=100.0%) documents
2018-04-11 08:44:16,074 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:44:16,139 : INFO : adding document #420000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:34,615 : INFO : discarding 39372 tokens: [('wasserdruck',
1), ('銬蟌', 1), ('sambiranoensis', 1), ('jouyaku', 1), ('aggregometry', 1),
('σጡς', 1), ('neovolcanica', 1), ('laternen', 1), ('victorianforts', 1),
('landfox', 1)]...
2018-04-11 08:44:34,615 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 430000 (=100.0%) documents
2018-04-11 08:44:37,639 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:44:37,714 : INFO : adding document #430000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:56,405 : INFO : discarding 44144 tokens: [('ດໃຈ', 1),
('πόλι', 1), ('chlíodhna', 1), ('francisquine', 1), ('postmemory', 1),
('bildetelegraph', 1), ('abdih', 1), ('kbfw', 1), ('kcbo', 1), ('kfmp',
1)]...
2018-04-11 08:44:56,405 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 440000 (=100.0%) documents
2018-04-11 08:44:59,457 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:44:59,527 : INFO : adding document #440000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:45:17,958 : INFO : discarding 37538 tokens: [('vrschighland',
1), ('xvictory', 1), ('semprill', 1), ('bordesi', 1), ('kokkim', 1),
('batzil', 1), ('kirix', 1), ('hersovits', 1), ('dtlgr', 1), ('bexi', 1)]...
2018-04-11 08:45:17,959 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 450000 (=100.0%) documents
2018-04-11 08:45:20,902 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:45:20,973 : INFO : adding document #450000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:45:39,269 : INFO : discarding 42590 tokens: [('achinsky', 1),
('景行倩皇四十䞉幎', 1), ('wmgx', 1), ('vetriera', 1), ('誉屋別皇子', 1),
('bemilleralbert', 1), ('coxfred', 1), ('gilburgtom', 1), ('takakiirihime',
1), ('greeneron', 1)]...
2018-04-11 08:45:39,269 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 460000 (=100.0%) documents
2018-04-11 08:45:42,131 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:45:42,196 : INFO : adding document #460000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
Post by Ivan Menshikh
Try to run it in "detached" mode like
nohup python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.xml.bz2
~/outdir >log.log&
now, this doesn't block your console and will work after disconnect too
and shouldn't affect your tmux session.
Post by Craig Thomson
Thanks for the pointer, I will take a look at that as it may be useful
generally.
I had some pos tagging running so could not mess with my server until
this morning.
ssh into my server
# setup a tmux session
tmux
# enter the python virtual environment (3.5.2)
source Development/python-env/bin/activate
#
python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.xml.bz2
~/outdir
- htop
- watch -n1 sensors
- watch -n10 def -lh
This was to keep an eye on HDD space and to check for CPU overheat
although the cores barely broke 70 degrees (cpu load holds at 80% on each
of the 4 cores) and there is plenty HDD space. Memory looks fine on htop,
it barely uses 1G (out of 16G).
It freezes after putting out the following output (there is more above
obviously but this is where it freezes). When this happens everything
stops, all the tmux sessions come down and on my laptop I just get the last
readings of each monitor which show the same CPU, RAM, temps and disk
space. Because it is crashing in such a way I am not sure how to get at
any kind of actual error message.
2018-04-10 09:27:01,699 : INFO : adding document #90000 to
Dictionary(1115408 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:27:27,986 : INFO : adding document #100000 to
Dictionary(1216455 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:27:52,628 : INFO : adding document #110000 to
Dictionary(1306640 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:28:16,163 : INFO : adding document #120000 to
Dictionary(1385497 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'gewÀsen', 'wbss']...)
2018-04-10 09:28:38,390 : INFO : adding document #130000 to
Dictionary(1455322 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'aranga']...)
2018-04-10 09:29:01,829 : INFO : adding document #140000 to
Dictionary(1532614 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'aranga']...)
2018-04-10 09:29:23,102 : INFO : adding document #150000 to
Dictionary(1621284 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'kanenas']...)
2018-04-10 09:29:45,919 : INFO : adding document #160000 to
Dictionary(1699019 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'kanenas']...)
2018-04-10 09:30:06,493 : INFO : adding document #170000 to
Dictionary(1763625 unique tokens: ['quadricostate', 'vankulick', 'àž­àžœàž²à¹€àžŠ',
'raniere', 'bakshi']...)
2018-04-10 09:30:26,547 : INFO : adding document #180000 to
Dictionary(1816463 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:30:44,641 : INFO : adding document #190000 to
Dictionary(1873986 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:03,103 : INFO : adding document #200000 to
Dictionary(1932042 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:22,488 : INFO : adding document #210000 to
Dictionary(1982135 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:44,125 : INFO : discarding 27140 tokens: [('ziedu', 1),
('headstroke', 1), ('shawfielders', 1), ('sardisch', 1), ('luxsitpress',
1), ('fameuil', 1), ('munkaszolgálat', 1), ('batruna', 1), ('pigita', 1),
('goreiro', 1)]...
2018-04-10 09:31:44,125 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 220000 (=100.0%) documents
2018-04-10 09:31:47,379 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['quadricostate', 'vankulick', 'raniere', 'bakshi',
'vermoorde']...)
2018-04-10 09:31:47,454 : INFO : adding document #220000 to
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
2018-04-10 09:32:08,284 : INFO : discarding 52395 tokens: [('willouby',
1), ('debuchii', 1), ('llanwynno', 1), ('scurfpea', 1), ('tshogchungs', 1),
('dorsalateral', 1), ('cjmi', 1), ('chierichetti', 1), ('marketized', 1),
('eubetchia', 1)]...
2018-04-10 09:32:08,284 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 230000 (=100.0%) documents
2018-04-10 09:32:11,348 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['quadricostate', 'vankulick', 'raniere', 'bakshi',
'vermoorde']...)
2018-04-10 09:32:11,421 : INFO : adding document #230000 to
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
On Tue, Apr 10, 2018 at 4:06 AM, Ivan Menshikh <
Post by Ivan Menshikh
Hi Craig,
about "more aggressive GC", try this method: https://docs.python.or
g/2/library/gc.html#gc.set_threshold, I'm not sure about the
usefulness of this method in the current case, but feel free to try.
Post by Craig Thomson
Thanks for the response.
I had top up a couple of the times it froze (not htop although I have
switched to that now). On top none of the 3-4 python processes were above
3% RAM (and they possibly share some of that anyway?).
I have actually since had a hard system freeze when pos tagging with
spaCy. I added a forced garbage collection every 10k lines or something
and now it is fine (albeit taking hours so I will need to wait to try
gensim again). SpaCy was not close to running out of RAM either.
I am now running in venv python 3.5.2 (python is new to me, Ruby, PHP
and C++ background).
I will try and freeze my laptop with gensim on the same setup.
Is there some kind of setting to make python more aggressive with
garbage collection or am I barking up the wrong tree with that idea?
Post by Ivan Menshikh
Hello,
looks like you have enough of resources for this command. Try to see
what happens with RAM/CPU at this moment using htop
<https://hisham.hm/htop/>in the different console.
Post by Axiombadger
Hi,
I am just starting to use gensim and am having some issues with the
wikipedia corpus.
https://radimrehurek.com/gensim/wiki.html
python3.5 -m gensim.scripts.make_wiki /home/user/enwiki-latest-pages-
articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,853 : INFO : running /home/user/.local/lib/
python3.5/site-packages/gensim/scripts/make_wiki.py /home/user/enwiki
-latest-pages-articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,936 : INFO : adding document #0 to Dictionary(0
unique tokens: [])
2018-04-08 11:39:14,701 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['minikh', 'meteora', 'simbalist',
'burbano', 'aak']...)
2018-04-08 11:39:53,316 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['cerego', 'minikh', 'constantian',
'študovať', 'meteora']...)
2018-04-08 11:40:25,823 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['minikh', 'arisu', 'študovať', 'veitvet',
'djohor']...)
2018-04-08 11:40:55,901 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['glabrum', 'minikh', 'arisu', 'študovať',
'veitvet']...)
2018-04-08 11:41:19,130 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:32,992 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:45,127 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
2018-04-08 11:41:56,792 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
It eventually reaches a point where it just freezes. I was using
tmux to drop in and out of the terminal, so I tried plugging a monitor into
the machine I am using as a server and just running it from there and the
system locks up.
I am using Mint 18.3, with as you can see Python 3.5. I installed
all of the dependencies with pip and the --user flag and explicitly call
python-3.5
When I run the same with enwiki-latest-pages-articles1.xml-p10p30302.bz2
(a much smaller corpus) the task completes.
Is this just a RAM issue? I have 16GB and about 110GB free space on
an SSD. What would I need in order to run the above command?
I can use smaller corpus, I just ask because it is the first line of
code listed in the above instructions and it fails, has the file creep from
8GB at time of writing to about 14GB now caused problems?
Where might I get logs for something crashing so unceremoniously?
Cheers.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ivan Menshikh
2018-04-12 04:54:49 UTC
Permalink
Very strange. About other distributive - of course, you can try, but I see
no reasons for it here (because this part is pure-python). I am discouraged.
Try to run it without tmux (ssh to machine + nohup with ">logfile.log&")
and control logfile & CPU/RAM. If this reproduced - ssh again to the
machine and check that process still running or not (and again logfile,
cpu, ram).
Post by Craig Thomson
Thanks again,
I ran (within the python venv)
nohup python -m gensim.scripts.make_wiki
~/Development/corpus/downloads/enwiki-latest-pages-articles.xml.bz2
~/Development/corpus/output >log.log &
(I have been tidying up folders a bit hence the slightly different paths).
It crashes after varying amounts of time both with and without tmux. When
it crashes in tmux it brings down every single tmux session with it.
I have nothing else at all on this system so I can change distro, python
environment, anything.
I cannot at the moment test the same thing on my laptop (also Mint 18.3,
with the same python venv) as I have other work to do and am kind of
back-burning this on the desktop which is at home.
Watching this as much as I can in htop, it is still 80% CPU per core, and
about 600M - 1G of RAM at any given time.
2018-04-11 08:28:11,470 : INFO : running
/home/user/Development/python-env/lib/python3.5/site-packages/gensim/scripts/make_wiki.py
/home/user/Development/corpus/downloads/enwiki-latest-pages-articles.xml.bz2
/home/user/Development/corpus/output
2018-04-11 08:28:11,544 : INFO : adding document #0 to Dictionary(0 unique
tokens: [])
2018-04-11 08:28:53,148 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['sandez', 'brickyards', 'ettling',
'attis', 'jse']...)
2018-04-11 08:29:29,801 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['yemek', 'sandez', 'brickyards',
'ettling', 'attis']...)
2018-04-11 08:30:00,202 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['sandez', 'attis', 'jse', 'ejolts',
'skaphidia']...)
2018-04-11 08:30:28,594 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['sandez', 'attis', 'jse', 'ejolts',
'skaphidia']...)
2018-04-11 08:30:50,365 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:03,307 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:14,661 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:25,488 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:48,834 : INFO : adding document #90000 to
Dictionary(1115408 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:32:14,763 : INFO : adding document #100000 to
Dictionary(1216455 unique tokens: ['pentatonics', 'sandez', 'mizormac',
'attis', 'jse']...)
2018-04-11 08:32:39,305 : INFO : adding document #110000 to
Dictionary(1306640 unique tokens: ['pentatonics', 'svellnosbreen',
'sandez', 'mizormac', 'attis']...)
2018-04-11 08:33:02,130 : INFO : adding document #120000 to
Dictionary(1385497 unique tokens: ['pentatonics', 'svellnosbreen',
'sandez', 'mizormac', 'attis']...)
2018-04-11 08:33:23,947 : INFO : adding document #130000 to
Dictionary(1455322 unique tokens: ['checotah', 'jse', 'ejolts', 'hohnadel',
'nightriders']...)
2018-04-11 08:33:46,905 : INFO : adding document #140000 to
Dictionary(1532614 unique tokens: ['checotah', 'jse', 'ejolts', 'hohnadel',
'nightriders']...)
2018-04-11 08:34:08,194 : INFO : adding document #150000 to
Dictionary(1621284 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:34:31,281 : INFO : adding document #160000 to
Dictionary(1699019 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:34:51,961 : INFO : adding document #170000 to
Dictionary(1763625 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:12,260 : INFO : adding document #180000 to
Dictionary(1816463 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:30,444 : INFO : adding document #190000 to
Dictionary(1873986 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:49,295 : INFO : adding document #200000 to
Dictionary(1932042 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:09,197 : INFO : adding document #210000 to
Dictionary(1982135 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:31,133 : INFO : discarding 27140 tokens: [('talost', 1),
('zhizhuan', 1), ('trevuren', 1), ('callachan', 1), ('methylisation', 1),
('blacklo', 1), ('īshat', 1), ('ilahiyyat', 1), ('grīmekhalaṃ', 1),
('afferre', 1)]...
2018-04-11 08:36:31,133 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 220000 (=100.0%) documents
2018-04-11 08:36:34,173 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['uluʁ', 'jse', 'izg', 'ejolts', 'hohnadel']...)
2018-04-11 08:36:34,241 : INFO : adding document #220000 to
Dictionary(2000000 unique tokens: ['uluʁ', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:55,451 : INFO : discarding 52425 tokens: [('xhjhs', 1),
('àŽ…àŽ•', 1), ('creekvale', 1), ('villavincie', 1), ('kurewen', 1),
('askamiciw', 1), ('askipiw', 1), ('manautou', 1), ('zichmini', 1),
('olenoides', 1)]...
2018-04-11 08:36:55,451 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 230000 (=100.0%) documents
2018-04-11 08:36:58,468 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:36:58,541 : INFO : adding document #230000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:19,504 : INFO : discarding 52345 tokens: [('προβούλευΌα',
1), ('adebanjos', 1), ('brajković', 1), ('sepn', 1), ('diastatops', 1),
('tamoshanters', 1), ('zumann', 1), ('тхайМОг', 1), ('rickhardt', 1),
('penceat', 1)]...
2018-04-11 08:37:19,504 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 240000 (=100.0%) documents
2018-04-11 08:37:22,698 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:22,772 : INFO : adding document #240000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:45,184 : INFO : discarding 42278 tokens: [('tizir', 1),
('jacupiranga', 1), ('Ќаєш', 1), ('ninkov', 1), ('chanuyot', 1), ('vođa',
1), ('zhǎnghǎi', 1), ('rpps', 1), ('domasław', 1), ('gaeege', 1)]...
2018-04-11 08:37:45,184 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 250000 (=100.0%) documents
2018-04-11 08:37:48,410 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:48,488 : INFO : adding document #250000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:11,016 : INFO : discarding 50985 tokens: [('zampognari',
1), ('doctorală', 1), ('trinominals', 1), ('mansenc', 1),
('globalresearch', 1), ('mansengou', 1), ('loughans', 1), ('busaeus', 1),
('hirsaugienses', 1), ('paralelă', 1)]...
2018-04-11 08:38:11,016 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 260000 (=100.0%) documents
2018-04-11 08:38:14,018 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:14,087 : INFO : adding document #260000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:36,723 : INFO : discarding 45170 tokens: [('kaikhah', 1),
('kohrausch', 1), ('levate', 1), ('pelagheia', 1), ('ehcr', 1), ('buccata',
1), ('taramov', 1), ('wauch', 1), ('eymer', 1), ('exradius', 1)]...
2018-04-11 08:38:36,724 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 270000 (=100.0%) documents
2018-04-11 08:38:39,635 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:39,704 : INFO : adding document #270000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:02,766 : INFO : discarding 50380 tokens: [('mbil', 1),
('λιΌΜίτης', 1), ('lovaart', 1), ('medabot', 1), ('tebirkes', 1), ('innjō',
1), ('issobel', 1), ('neeby', 1), ('τέΌπλος', 1), ('karlostachys', 1)]...
2018-04-11 08:39:02,766 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 280000 (=100.0%) documents
2018-04-11 08:39:05,693 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:05,760 : INFO : adding document #280000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:25,872 : INFO : discarding 52105 tokens: [('mbb_', 1),
('scioptric', 1), ('sashwo', 1), ('thirachai', 1), ('kharaillah', 1),
('dìnghǎilù', 1), ('taylormusic', 1), ('kongjiang', 1), ('vaughanmusic',
1), ('kòngjiānglù', 1)]...
2018-04-11 08:39:25,872 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 290000 (=100.0%) documents
2018-04-11 08:39:29,010 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:29,088 : INFO : adding document #290000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:48,652 : INFO : discarding 46673 tokens: [('goldsmithry',
1), ('fianṡruth', 1), ('flannacán', 1), ('schaid', 1), ('cayohoga', 1),
('rilasciata', 1), ('attancourt', 1), ('villefore', 1), ('jarsin', 1),
('qeyniy', 1)]...
2018-04-11 08:39:48,652 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 300000 (=100.0%) documents
2018-04-11 08:39:51,546 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:51,611 : INFO : adding document #300000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:11,566 : INFO : discarding 48892 tokens: [('pujadó', 1),
('tikalladislav', 1), ('hilsbach', 1), ('térygéza', 1), ('astly', 1),
('iiibes', 1), ('wartelle', 1), ('carmelito', 1), ('nosless', 1),
('vÀrldsspindeln', 1)]...
2018-04-11 08:40:11,566 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 310000 (=100.0%) documents
2018-04-11 08:40:14,717 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:14,795 : INFO : adding document #310000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:35,018 : INFO : discarding 50222 tokens: [('hinderlist',
1), ('tatenawate', 1), ('spofvenhielm', 1), ('vicarivs', 1), ('starenfelt',
1), ('神歊東埁', 1), ('МачальМОка', 1), ('𣋚𠉞', 1), ('âwaxsîdâr', 1),
('noterid', 1)]...
2018-04-11 08:40:35,018 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 320000 (=100.0%) documents
2018-04-11 08:40:37,915 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:37,980 : INFO : adding document #320000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:57,494 : INFO : discarding 42122 tokens: [('buahan', 1),
('olympisky', 1), ('ōio', 1), ('rasulova', 1), ('treesforlife', 1),
('倧井内芪王', 1), ('usubov', 1), ('blÃ¥dalsvatnet', 1), ('賀楜内芪王', 1),
('ghoraib', 1)]...
2018-04-11 08:40:57,494 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 330000 (=100.0%) documents
2018-04-11 08:41:00,425 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:00,496 : INFO : adding document #330000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:19,188 : INFO : discarding 38954 tokens: [('exoda', 1),
('黃䞖仲', 1), ('kriemelman', 1), ('mandeldrums', 1), ('hihihihi', 1), ('黃䌯思',
1), ('korostelyov', 1), ('skillometer', 1), ('lachrymology', 1),
('processid', 1)]...
2018-04-11 08:41:19,188 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 340000 (=100.0%) documents
2018-04-11 08:41:22,068 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:22,132 : INFO : adding document #340000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:41,704 : INFO : discarding 39759 tokens: [('úlfheðinn',
1), ('neustrasia', 1), ('magdelone', 1), ('hillopathes', 1), ('mexbol', 1),
('tirtonadi', 1), ('batizocoi', 1), ('triwindhu', 1), ('mejiso', 1),
('namerō', 1)]...
2018-04-11 08:41:41,704 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 350000 (=100.0%) documents
2018-04-11 08:41:44,641 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:44,712 : INFO : adding document #350000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:42:03,259 : INFO : discarding 38180 tokens: [('ffyr', 1),
('cösitzer', 1), ('crefft', 1), ('duppenbecker', 1), ('tenguzame', 1),
('dniestrem', 1), ('geronte', 1), ('serpari', 1), ('応汗州郜督府郜督', 1),
('техеМ', 1)]...
2018-04-11 08:42:03,259 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 360000 (=100.0%) documents
2018-04-11 08:42:06,137 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:42:06,202 : INFO : adding document #360000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:24,380 : INFO : discarding 36319 tokens: [('willingii',
1), ('woodmaniorum', 1), ('wildford', 1), ('canibungan', 1),
('arsenophonus', 1), ('nonhigh', 1), ('mineralocortoid', 1), ('kinek', 1),
('pakipasa', 1), ('mondjam', 1)]...
2018-04-11 08:42:24,381 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 370000 (=100.0%) documents
2018-04-11 08:42:27,324 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:42:27,395 : INFO : adding document #370000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:45,949 : INFO : discarding 41031 tokens: [('miamua', 1),
('transliterature', 1), ('thlen', 1), ('appeariq', 1), ('山鎫', 1),
('deeondeeup', 1), ('yadate', 1), ('travia', 1), ('ruakapanga', 1),
('краљу', 1)]...
2018-04-11 08:42:45,949 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 380000 (=100.0%) documents
2018-04-11 08:42:48,818 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:42:48,883 : INFO : adding document #380000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:43:09,037 : INFO : discarding 38122 tokens: [('megacrammer',
1), ('horroh', 1), ('nasrud', 1), ('proteax', 1), ('cultutes', 1),
('gammarotettix', 1), ('discopolis', 1), ('endophilic', 1), ('caliology',
1), ('mohni', 1)]...
2018-04-11 08:43:09,037 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 390000 (=100.0%) documents
2018-04-11 08:43:11,973 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:43:12,044 : INFO : adding document #390000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:43:30,113 : INFO : discarding 39509 tokens: [('clangourous',
1), ('sulfabid', 1), ('faddul', 1), ('oenephes', 1), ('sulmeprim', 1),
('mawaqif', 1), ('talactoferrin', 1), ('talaglumetad', 1), ('contrate', 1),
('πrad', 1)]...
2018-04-11 08:43:30,114 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 400000 (=100.0%) documents
2018-04-11 08:43:32,996 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:43:33,061 : INFO : adding document #400000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:43:51,657 : INFO : discarding 42423 tokens: [('medeswael',
1), ('wohnungs', 1), ('yŏnghŭng', 1), ('annebella', 1), ('dimmig', 1),
('gosdendiana', 1), ('iwasakisara', 1), ('allānâ', 1), ('teuflische', 1),
('polymelos', 1)]...
2018-04-11 08:43:51,657 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 410000 (=100.0%) documents
2018-04-11 08:43:54,749 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:43:54,827 : INFO : adding document #410000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('bowersflybabycf', 1), ('skyote', 1), ('apathēs', 1), ('bolcon', 1),
('きらきらアフロ', 1), ('polysylabi', 1), ('kurtziella', 1), ('waringinkurung',
1), ('pyrgeometers', 1), ('badekuren', 1)]...
2018-04-11 08:44:13,198 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 420000 (=100.0%) documents
2018-04-11 08:44:16,074 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:44:16,139 : INFO : adding document #420000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:34,615 : INFO : discarding 39372 tokens: [('wasserdruck',
1), ('銬蟌', 1), ('sambiranoensis', 1), ('jouyaku', 1), ('aggregometry', 1),
('σጡς', 1), ('neovolcanica', 1), ('laternen', 1), ('victorianforts', 1),
('landfox', 1)]...
2018-04-11 08:44:34,615 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 430000 (=100.0%) documents
2018-04-11 08:44:37,639 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:44:37,714 : INFO : adding document #430000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:56,405 : INFO : discarding 44144 tokens: [('ດໃຈ', 1),
('πόλι', 1), ('chlíodhna', 1), ('francisquine', 1), ('postmemory', 1),
('bildetelegraph', 1), ('abdih', 1), ('kbfw', 1), ('kcbo', 1), ('kfmp',
1)]...
2018-04-11 08:44:56,405 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 440000 (=100.0%) documents
2018-04-11 08:44:59,457 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:44:59,527 : INFO : adding document #440000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('vrschighland', 1), ('xvictory', 1), ('semprill', 1), ('bordesi', 1),
('kokkim', 1), ('batzil', 1), ('kirix', 1), ('hersovits', 1), ('dtlgr', 1),
('bexi', 1)]...
2018-04-11 08:45:17,959 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 450000 (=100.0%) documents
2018-04-11 08:45:20,902 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:45:20,973 : INFO : adding document #450000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:45:39,269 : INFO : discarding 42590 tokens: [('achinsky',
1), ('景行倩皇四十䞉幎', 1), ('wmgx', 1), ('vetriera', 1), ('誉屋別皇子', 1),
('bemilleralbert', 1), ('coxfred', 1), ('gilburgtom', 1), ('takakiirihime',
1), ('greeneron', 1)]...
2018-04-11 08:45:39,269 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 460000 (=100.0%) documents
2018-04-11 08:45:42,131 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:45:42,196 : INFO : adding document #460000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
On Wed, Apr 11, 2018 at 5:19 AM, Ivan Menshikh <
Post by Ivan Menshikh
Try to run it in "detached" mode like
nohup python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.xml
.bz2 ~/outdir >log.log&
now, this doesn't block your console and will work after disconnect too
and shouldn't affect your tmux session.
Post by Craig Thomson
Thanks for the pointer, I will take a look at that as it may be useful
generally.
I had some pos tagging running so could not mess with my server until
this morning.
ssh into my server
# setup a tmux session
tmux
# enter the python virtual environment (3.5.2)
source Development/python-env/bin/activate
#
python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.xml.bz2
~/outdir
- htop
- watch -n1 sensors
- watch -n10 def -lh
This was to keep an eye on HDD space and to check for CPU overheat
although the cores barely broke 70 degrees (cpu load holds at 80% on each
of the 4 cores) and there is plenty HDD space. Memory looks fine on htop,
it barely uses 1G (out of 16G).
It freezes after putting out the following output (there is more above
obviously but this is where it freezes). When this happens everything
stops, all the tmux sessions come down and on my laptop I just get the last
readings of each monitor which show the same CPU, RAM, temps and disk
space. Because it is crashing in such a way I am not sure how to get at
any kind of actual error message.
2018-04-10 09:27:01,699 : INFO : adding document #90000 to
Dictionary(1115408 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:27:27,986 : INFO : adding document #100000 to
Dictionary(1216455 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:27:52,628 : INFO : adding document #110000 to
Dictionary(1306640 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:28:16,163 : INFO : adding document #120000 to
Dictionary(1385497 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'gewÀsen', 'wbss']...)
2018-04-10 09:28:38,390 : INFO : adding document #130000 to
Dictionary(1455322 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'aranga']...)
2018-04-10 09:29:01,829 : INFO : adding document #140000 to
Dictionary(1532614 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'aranga']...)
2018-04-10 09:29:23,102 : INFO : adding document #150000 to
Dictionary(1621284 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'kanenas']...)
2018-04-10 09:29:45,919 : INFO : adding document #160000 to
Dictionary(1699019 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'kanenas']...)
2018-04-10 09:30:06,493 : INFO : adding document #170000 to
Dictionary(1763625 unique tokens: ['quadricostate', 'vankulick', 'àž­àžœàž²à¹€àžŠ',
'raniere', 'bakshi']...)
2018-04-10 09:30:26,547 : INFO : adding document #180000 to
Dictionary(1816463 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:30:44,641 : INFO : adding document #190000 to
Dictionary(1873986 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:03,103 : INFO : adding document #200000 to
Dictionary(1932042 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:22,488 : INFO : adding document #210000 to
Dictionary(1982135 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:44,125 : INFO : discarding 27140 tokens: [('ziedu', 1),
('headstroke', 1), ('shawfielders', 1), ('sardisch', 1), ('luxsitpress',
1), ('fameuil', 1), ('munkaszolgálat', 1), ('batruna', 1), ('pigita', 1),
('goreiro', 1)]...
2018-04-10 09:31:44,125 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 220000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
2018-04-10 09:31:47,454 : INFO : adding document #220000 to
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
2018-04-10 09:32:08,284 : INFO : discarding 52395 tokens: [('willouby',
1), ('debuchii', 1), ('llanwynno', 1), ('scurfpea', 1), ('tshogchungs', 1),
('dorsalateral', 1), ('cjmi', 1), ('chierichetti', 1), ('marketized', 1),
('eubetchia', 1)]...
2018-04-10 09:32:08,284 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 230000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
2018-04-10 09:32:11,421 : INFO : adding document #230000 to
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
On Tue, Apr 10, 2018 at 4:06 AM, Ivan Menshikh <
Post by Ivan Menshikh
Hi Craig,
https://docs.python.org/2/library/gc.html#gc.set_threshold, I'm not
sure about the usefulness of this method in the current case, but feel free
to try.
Post by Craig Thomson
Thanks for the response.
I had top up a couple of the times it froze (not htop although I have
switched to that now). On top none of the 3-4 python processes were above
3% RAM (and they possibly share some of that anyway?).
I have actually since had a hard system freeze when pos tagging with
spaCy. I added a forced garbage collection every 10k lines or something
and now it is fine (albeit taking hours so I will need to wait to try
gensim again). SpaCy was not close to running out of RAM either.
I am now running in venv python 3.5.2 (python is new to me, Ruby, PHP
and C++ background).
I will try and freeze my laptop with gensim on the same setup.
Is there some kind of setting to make python more aggressive with
garbage collection or am I barking up the wrong tree with that idea?
On Mon, 9 Apr 2018 at 07:12, Ivan Menshikh <
Post by Ivan Menshikh
Hello,
looks like you have enough of resources for this command. Try to see
what happens with RAM/CPU at this moment using htop
<https://hisham.hm/htop/>in the different console.
Post by Axiombadger
Hi,
I am just starting to use gensim and am having some issues with the
wikipedia corpus.
https://radimrehurek.com/gensim/wiki.html
python3.5 -m gensim.scripts.make_wiki /home/user/enwiki-latest-pages
-articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,853 : INFO : running /home/user/.local/lib/
python3.5/site-packages/gensim/scripts/make_wiki.py /home/user/
enwiki-latest-pages-articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,936 : INFO : adding document #0 to Dictionary(0
unique tokens: [])
2018-04-08 11:39:14,701 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['minikh', 'meteora', 'simbalist',
'burbano', 'aak']...)
2018-04-08 11:39:53,316 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['cerego', 'minikh', 'constantian',
'študovať', 'meteora']...)
2018-04-08 11:40:25,823 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['minikh', 'arisu', 'študovať', 'veitvet',
'djohor']...)
2018-04-08 11:40:55,901 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['glabrum', 'minikh', 'arisu', 'študovať',
'veitvet']...)
2018-04-08 11:41:19,130 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:32,992 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:45,127 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
2018-04-08 11:41:56,792 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
It eventually reaches a point where it just freezes. I was using
tmux to drop in and out of the terminal, so I tried plugging a monitor into
the machine I am using as a server and just running it from there and the
system locks up.
I am using Mint 18.3, with as you can see Python 3.5. I installed
all of the dependencies with pip and the --user flag and explicitly call
python-3.5
When I run the same
with enwiki-latest-pages-articles1.xml-p10p30302.bz2 (a much smaller
corpus) the task completes.
Is this just a RAM issue? I have 16GB and about 110GB free space on
an SSD. What would I need in order to run the above command?
I can use smaller corpus, I just ask because it is the first line of
code listed in the above instructions and it fails, has the file creep from
8GB at time of writing to about 14GB now caused problems?
Where might I get logs for something crashing so unceremoniously?
Cheers.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Craig Thomson
2018-04-13 08:35:59 UTC
Permalink
Thanks again,

I see no reason why distro should matter either. Clutching at straws. I
will try and replicate on my laptop.

To answer your above question, it crashes without tmux. The processes are
note left running.

RAM and CPU just look like they do whilst the thing is running (except the
terminal is frozen).

The log is like the above, just INFO output and it is not the same place in
the corpus that is crashes each time.

I ran this time:

nohup python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.xml.bz2
./output >logfile.log&
Post by Ivan Menshikh
Very strange. About other distributive - of course, you can try, but I see
no reasons for it here (because this part is pure-python). I am discouraged.
Try to run it without tmux (ssh to machine + nohup with ">logfile.log&")
and control logfile & CPU/RAM. If this reproduced - ssh again to the
machine and check that process still running or not (and again logfile,
cpu, ram).
Post by Craig Thomson
Thanks again,
I ran (within the python venv)
nohup python -m gensim.scripts.make_wiki ~/Development/corpus/downloads
/enwiki-latest-pages-articles.xml.bz2 ~/Development/corpus/output
Post by Craig Thomson
log.log &
(I have been tidying up folders a bit hence the slightly different paths).
It crashes after varying amounts of time both with and without tmux.
When it crashes in tmux it brings down every single tmux session with it.
I have nothing else at all on this system so I can change distro, python
environment, anything.
I cannot at the moment test the same thing on my laptop (also Mint 18.3,
with the same python venv) as I have other work to do and am kind of
back-burning this on the desktop which is at home.
Watching this as much as I can in htop, it is still 80% CPU per core, and
about 600M - 1G of RAM at any given time.
2018-04-11 08:28:11,470 : INFO : running /home/user/Development/python-
env/lib/python3.5/site-packages/gensim/scripts/make_wiki.py
/home/user/Development/corpus/downloads/enwiki-latest-pages-articles.xml.bz2
/home/user/Development/corpus/output
2018-04-11 08:28:11,544 : INFO : adding document #0 to Dictionary(0
unique tokens: [])
2018-04-11 08:28:53,148 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['sandez', 'brickyards', 'ettling',
'attis', 'jse']...)
2018-04-11 08:29:29,801 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['yemek', 'sandez', 'brickyards',
'ettling', 'attis']...)
2018-04-11 08:30:00,202 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['sandez', 'attis', 'jse', 'ejolts',
'skaphidia']...)
2018-04-11 08:30:28,594 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['sandez', 'attis', 'jse', 'ejolts',
'skaphidia']...)
2018-04-11 08:30:50,365 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:03,307 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:14,661 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:25,488 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:48,834 : INFO : adding document #90000 to
Dictionary(1115408 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:32:14,763 : INFO : adding document #100000 to
Dictionary(1216455 unique tokens: ['pentatonics', 'sandez', 'mizormac',
'attis', 'jse']...)
2018-04-11 08:32:39,305 : INFO : adding document #110000 to
Dictionary(1306640 unique tokens: ['pentatonics', 'svellnosbreen',
'sandez', 'mizormac', 'attis']...)
2018-04-11 08:33:02,130 : INFO : adding document #120000 to
Dictionary(1385497 unique tokens: ['pentatonics', 'svellnosbreen',
'sandez', 'mizormac', 'attis']...)
2018-04-11 08:33:23,947 : INFO : adding document #130000 to
Dictionary(1455322 unique tokens: ['checotah', 'jse', 'ejolts', 'hohnadel',
'nightriders']...)
2018-04-11 08:33:46,905 : INFO : adding document #140000 to
Dictionary(1532614 unique tokens: ['checotah', 'jse', 'ejolts', 'hohnadel',
'nightriders']...)
2018-04-11 08:34:08,194 : INFO : adding document #150000 to
Dictionary(1621284 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:34:31,281 : INFO : adding document #160000 to
Dictionary(1699019 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:34:51,961 : INFO : adding document #170000 to
Dictionary(1763625 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:12,260 : INFO : adding document #180000 to
Dictionary(1816463 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:30,444 : INFO : adding document #190000 to
Dictionary(1873986 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:49,295 : INFO : adding document #200000 to
Dictionary(1932042 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:09,197 : INFO : adding document #210000 to
Dictionary(1982135 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:31,133 : INFO : discarding 27140 tokens: [('talost', 1),
('zhizhuan', 1), ('trevuren', 1), ('callachan', 1), ('methylisation', 1),
('blacklo', 1), ('īshat', 1), ('ilahiyyat', 1), ('grīmekhalaṃ', 1),
('afferre', 1)]...
2018-04-11 08:36:31,133 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 220000 (=100.0%) documents
2018-04-11 08:36:34,173 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['uluʁ', 'jse', 'izg', 'ejolts', 'hohnadel']...)
2018-04-11 08:36:34,241 : INFO : adding document #220000 to
Dictionary(2000000 unique tokens: ['uluʁ', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:55,451 : INFO : discarding 52425 tokens: [('xhjhs', 1),
('àŽ…àŽ•', 1), ('creekvale', 1), ('villavincie', 1), ('kurewen', 1),
('askamiciw', 1), ('askipiw', 1), ('manautou', 1), ('zichmini', 1),
('olenoides', 1)]...
2018-04-11 08:36:55,451 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 230000 (=100.0%) documents
2018-04-11 08:36:58,468 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:36:58,541 : INFO : adding document #230000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
[('προβούλευΌα', 1), ('adebanjos', 1), ('brajković', 1), ('sepn', 1),
('diastatops', 1), ('tamoshanters', 1), ('zumann', 1), ('тхайМОг', 1),
('rickhardt', 1), ('penceat', 1)]...
2018-04-11 08:37:19,504 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 240000 (=100.0%) documents
2018-04-11 08:37:22,698 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:22,772 : INFO : adding document #240000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:45,184 : INFO : discarding 42278 tokens: [('tizir', 1),
('jacupiranga', 1), ('Ќаєш', 1), ('ninkov', 1), ('chanuyot', 1), ('vođa',
1), ('zhǎnghǎi', 1), ('rpps', 1), ('domasław', 1), ('gaeege', 1)]...
2018-04-11 08:37:45,184 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 250000 (=100.0%) documents
2018-04-11 08:37:48,410 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:48,488 : INFO : adding document #250000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:11,016 : INFO : discarding 50985 tokens: [('zampognari',
1), ('doctorală', 1), ('trinominals', 1), ('mansenc', 1),
('globalresearch', 1), ('mansengou', 1), ('loughans', 1), ('busaeus', 1),
('hirsaugienses', 1), ('paralelă', 1)]...
2018-04-11 08:38:11,016 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 260000 (=100.0%) documents
2018-04-11 08:38:14,018 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:14,087 : INFO : adding document #260000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:36,723 : INFO : discarding 45170 tokens: [('kaikhah',
1), ('kohrausch', 1), ('levate', 1), ('pelagheia', 1), ('ehcr', 1),
('buccata', 1), ('taramov', 1), ('wauch', 1), ('eymer', 1), ('exradius',
1)]...
2018-04-11 08:38:36,724 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 270000 (=100.0%) documents
2018-04-11 08:38:39,635 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:39,704 : INFO : adding document #270000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:02,766 : INFO : discarding 50380 tokens: [('mbil', 1),
('λιΌΜίτης', 1), ('lovaart', 1), ('medabot', 1), ('tebirkes', 1), ('innjō',
1), ('issobel', 1), ('neeby', 1), ('τέΌπλος', 1), ('karlostachys', 1)]...
2018-04-11 08:39:02,766 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 280000 (=100.0%) documents
2018-04-11 08:39:05,693 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:05,760 : INFO : adding document #280000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:25,872 : INFO : discarding 52105 tokens: [('mbb_', 1),
('scioptric', 1), ('sashwo', 1), ('thirachai', 1), ('kharaillah', 1),
('dìnghǎilù', 1), ('taylormusic', 1), ('kongjiang', 1), ('vaughanmusic',
1), ('kòngjiānglù', 1)]...
2018-04-11 08:39:25,872 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 290000 (=100.0%) documents
2018-04-11 08:39:29,010 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:29,088 : INFO : adding document #290000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
[('goldsmithry', 1), ('fianṡruth', 1), ('flannacán', 1), ('schaid', 1),
('cayohoga', 1), ('rilasciata', 1), ('attancourt', 1), ('villefore', 1),
('jarsin', 1), ('qeyniy', 1)]...
2018-04-11 08:39:48,652 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 300000 (=100.0%) documents
2018-04-11 08:39:51,546 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:51,611 : INFO : adding document #300000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:11,566 : INFO : discarding 48892 tokens: [('pujadó', 1),
('tikalladislav', 1), ('hilsbach', 1), ('térygéza', 1), ('astly', 1),
('iiibes', 1), ('wartelle', 1), ('carmelito', 1), ('nosless', 1),
('vÀrldsspindeln', 1)]...
2018-04-11 08:40:11,566 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 310000 (=100.0%) documents
2018-04-11 08:40:14,717 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:14,795 : INFO : adding document #310000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:35,018 : INFO : discarding 50222 tokens: [('hinderlist',
1), ('tatenawate', 1), ('spofvenhielm', 1), ('vicarivs', 1), ('starenfelt',
1), ('神歊東埁', 1), ('МачальМОка', 1), ('𣋚𠉞', 1), ('âwaxsîdâr', 1),
('noterid', 1)]...
2018-04-11 08:40:35,018 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 320000 (=100.0%) documents
2018-04-11 08:40:37,915 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:37,980 : INFO : adding document #320000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:57,494 : INFO : discarding 42122 tokens: [('buahan', 1),
('olympisky', 1), ('ōio', 1), ('rasulova', 1), ('treesforlife', 1),
('倧井内芪王', 1), ('usubov', 1), ('blÃ¥dalsvatnet', 1), ('賀楜内芪王', 1),
('ghoraib', 1)]...
2018-04-11 08:40:57,494 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 330000 (=100.0%) documents
2018-04-11 08:41:00,425 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:00,496 : INFO : adding document #330000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:19,188 : INFO : discarding 38954 tokens: [('exoda', 1),
('黃䞖仲', 1), ('kriemelman', 1), ('mandeldrums', 1), ('hihihihi', 1), ('黃䌯思',
1), ('korostelyov', 1), ('skillometer', 1), ('lachrymology', 1),
('processid', 1)]...
2018-04-11 08:41:19,188 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 340000 (=100.0%) documents
2018-04-11 08:41:22,068 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:22,132 : INFO : adding document #340000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:41,704 : INFO : discarding 39759 tokens: [('úlfheðinn',
1), ('neustrasia', 1), ('magdelone', 1), ('hillopathes', 1), ('mexbol', 1),
('tirtonadi', 1), ('batizocoi', 1), ('triwindhu', 1), ('mejiso', 1),
('namerō', 1)]...
2018-04-11 08:41:41,704 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 350000 (=100.0%) documents
2018-04-11 08:41:44,641 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:44,712 : INFO : adding document #350000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:42:03,259 : INFO : discarding 38180 tokens: [('ffyr', 1),
('cösitzer', 1), ('crefft', 1), ('duppenbecker', 1), ('tenguzame', 1),
('dniestrem', 1), ('geronte', 1), ('serpari', 1), ('応汗州郜督府郜督', 1),
('техеМ', 1)]...
2018-04-11 08:42:03,259 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 360000 (=100.0%) documents
2018-04-11 08:42:06,137 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:42:06,202 : INFO : adding document #360000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:24,380 : INFO : discarding 36319 tokens: [('willingii',
1), ('woodmaniorum', 1), ('wildford', 1), ('canibungan', 1),
('arsenophonus', 1), ('nonhigh', 1), ('mineralocortoid', 1), ('kinek', 1),
('pakipasa', 1), ('mondjam', 1)]...
2018-04-11 08:42:24,381 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 370000 (=100.0%) documents
2018-04-11 08:42:27,324 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:42:27,395 : INFO : adding document #370000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:45,949 : INFO : discarding 41031 tokens: [('miamua', 1),
('transliterature', 1), ('thlen', 1), ('appeariq', 1), ('山鎫', 1),
('deeondeeup', 1), ('yadate', 1), ('travia', 1), ('ruakapanga', 1),
('краљу', 1)]...
2018-04-11 08:42:45,949 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 380000 (=100.0%) documents
2018-04-11 08:42:48,818 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:42:48,883 : INFO : adding document #380000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('megacrammer', 1), ('horroh', 1), ('nasrud', 1), ('proteax', 1),
('cultutes', 1), ('gammarotettix', 1), ('discopolis', 1), ('endophilic',
1), ('caliology', 1), ('mohni', 1)]...
2018-04-11 08:43:09,037 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 390000 (=100.0%) documents
2018-04-11 08:43:11,973 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:43:12,044 : INFO : adding document #390000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('clangourous', 1), ('sulfabid', 1), ('faddul', 1), ('oenephes', 1),
('sulmeprim', 1), ('mawaqif', 1), ('talactoferrin', 1), ('talaglumetad',
1), ('contrate', 1), ('πrad', 1)]...
2018-04-11 08:43:30,114 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 400000 (=100.0%) documents
2018-04-11 08:43:32,996 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:43:33,061 : INFO : adding document #400000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:43:51,657 : INFO : discarding 42423 tokens: [('medeswael',
1), ('wohnungs', 1), ('yŏnghŭng', 1), ('annebella', 1), ('dimmig', 1),
('gosdendiana', 1), ('iwasakisara', 1), ('allānâ', 1), ('teuflische', 1),
('polymelos', 1)]...
2018-04-11 08:43:51,657 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 410000 (=100.0%) documents
2018-04-11 08:43:54,749 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:43:54,827 : INFO : adding document #410000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('bowersflybabycf', 1), ('skyote', 1), ('apathēs', 1), ('bolcon', 1),
('きらきらアフロ', 1), ('polysylabi', 1), ('kurtziella', 1), ('waringinkurung',
1), ('pyrgeometers', 1), ('badekuren', 1)]...
2018-04-11 08:44:13,198 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 420000 (=100.0%) documents
2018-04-11 08:44:16,074 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:44:16,139 : INFO : adding document #420000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('wasserdruck', 1), ('銬蟌', 1), ('sambiranoensis', 1), ('jouyaku', 1),
('aggregometry', 1), ('σጡς', 1), ('neovolcanica', 1), ('laternen', 1),
('victorianforts', 1), ('landfox', 1)]...
2018-04-11 08:44:34,615 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 430000 (=100.0%) documents
2018-04-11 08:44:37,639 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:44:37,714 : INFO : adding document #430000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:56,405 : INFO : discarding 44144 tokens: [('ດໃຈ', 1),
('πόλι', 1), ('chlíodhna', 1), ('francisquine', 1), ('postmemory', 1),
('bildetelegraph', 1), ('abdih', 1), ('kbfw', 1), ('kcbo', 1), ('kfmp',
1)]...
2018-04-11 08:44:56,405 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 440000 (=100.0%) documents
2018-04-11 08:44:59,457 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:44:59,527 : INFO : adding document #440000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('vrschighland', 1), ('xvictory', 1), ('semprill', 1), ('bordesi', 1),
('kokkim', 1), ('batzil', 1), ('kirix', 1), ('hersovits', 1), ('dtlgr', 1),
('bexi', 1)]...
2018-04-11 08:45:17,959 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 450000 (=100.0%) documents
2018-04-11 08:45:20,902 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:45:20,973 : INFO : adding document #450000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:45:39,269 : INFO : discarding 42590 tokens: [('achinsky',
1), ('景行倩皇四十䞉幎', 1), ('wmgx', 1), ('vetriera', 1), ('誉屋別皇子', 1),
('bemilleralbert', 1), ('coxfred', 1), ('gilburgtom', 1), ('takakiirihime',
1), ('greeneron', 1)]...
2018-04-11 08:45:39,269 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 460000 (=100.0%) documents
2018-04-11 08:45:42,131 : INFO : resulting dictionary: Dictionary(2000000
unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse', 'izg']...)
2018-04-11 08:45:42,196 : INFO : adding document #460000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
On Wed, Apr 11, 2018 at 5:19 AM, Ivan Menshikh <
Post by Craig Thomson
Try to run it in "detached" mode like
nohup python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.x
ml.bz2 ~/outdir >log.log&
now, this doesn't block your console and will work after disconnect too
and shouldn't affect your tmux session.
Post by Craig Thomson
Thanks for the pointer, I will take a look at that as it may be useful
generally.
I had some pos tagging running so could not mess with my server until
this morning.
ssh into my server
# setup a tmux session
tmux
# enter the python virtual environment (3.5.2)
source Development/python-env/bin/activate
#
python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.xml.bz2
~/outdir
- htop
- watch -n1 sensors
- watch -n10 def -lh
This was to keep an eye on HDD space and to check for CPU overheat
although the cores barely broke 70 degrees (cpu load holds at 80% on each
of the 4 cores) and there is plenty HDD space. Memory looks fine on htop,
it barely uses 1G (out of 16G).
It freezes after putting out the following output (there is more above
obviously but this is where it freezes). When this happens everything
stops, all the tmux sessions come down and on my laptop I just get the last
readings of each monitor which show the same CPU, RAM, temps and disk
space. Because it is crashing in such a way I am not sure how to get at
any kind of actual error message.
2018-04-10 09:27:01,699 : INFO : adding document #90000 to
Dictionary(1115408 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:27:27,986 : INFO : adding document #100000 to
Dictionary(1216455 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:27:52,628 : INFO : adding document #110000 to
Dictionary(1306640 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:28:16,163 : INFO : adding document #120000 to
Dictionary(1385497 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'gewÀsen', 'wbss']...)
2018-04-10 09:28:38,390 : INFO : adding document #130000 to
Dictionary(1455322 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'aranga']...)
2018-04-10 09:29:01,829 : INFO : adding document #140000 to
Dictionary(1532614 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'aranga']...)
2018-04-10 09:29:23,102 : INFO : adding document #150000 to
Dictionary(1621284 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'kanenas']...)
2018-04-10 09:29:45,919 : INFO : adding document #160000 to
Dictionary(1699019 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'kanenas']...)
2018-04-10 09:30:06,493 : INFO : adding document #170000 to
Dictionary(1763625 unique tokens: ['quadricostate', 'vankulick', 'àž­àžœàž²à¹€àžŠ',
'raniere', 'bakshi']...)
2018-04-10 09:30:26,547 : INFO : adding document #180000 to
Dictionary(1816463 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:30:44,641 : INFO : adding document #190000 to
Dictionary(1873986 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:03,103 : INFO : adding document #200000 to
Dictionary(1932042 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:22,488 : INFO : adding document #210000 to
Dictionary(1982135 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:44,125 : INFO : discarding 27140 tokens: [('ziedu',
1), ('headstroke', 1), ('shawfielders', 1), ('sardisch', 1),
('luxsitpress', 1), ('fameuil', 1), ('munkaszolgálat', 1), ('batruna', 1),
('pigita', 1), ('goreiro', 1)]...
2018-04-10 09:31:44,125 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 220000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
2018-04-10 09:31:47,454 : INFO : adding document #220000 to
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
2018-04-10 09:32:08,284 : INFO : discarding 52395 tokens: [('willouby',
1), ('debuchii', 1), ('llanwynno', 1), ('scurfpea', 1), ('tshogchungs', 1),
('dorsalateral', 1), ('cjmi', 1), ('chierichetti', 1), ('marketized', 1),
('eubetchia', 1)]...
2018-04-10 09:32:08,284 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 230000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
2018-04-10 09:32:11,421 : INFO : adding document #230000 to
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
On Tue, Apr 10, 2018 at 4:06 AM, Ivan Menshikh <
Post by Ivan Menshikh
Hi Craig,
about "more aggressive GC", try this method: https://docs.python.or
g/2/library/gc.html#gc.set_threshold, I'm not sure about the
usefulness of this method in the current case, but feel free to try.
Post by Craig Thomson
Thanks for the response.
I had top up a couple of the times it froze (not htop although I have
switched to that now). On top none of the 3-4 python processes were above
3% RAM (and they possibly share some of that anyway?).
I have actually since had a hard system freeze when pos tagging with
spaCy. I added a forced garbage collection every 10k lines or something
and now it is fine (albeit taking hours so I will need to wait to try
gensim again). SpaCy was not close to running out of RAM either.
I am now running in venv python 3.5.2 (python is new to me, Ruby, PHP
and C++ background).
I will try and freeze my laptop with gensim on the same setup.
Is there some kind of setting to make python more aggressive with
garbage collection or am I barking up the wrong tree with that idea?
On Mon, 9 Apr 2018 at 07:12, Ivan Menshikh <
Post by Ivan Menshikh
Hello,
looks like you have enough of resources for this command. Try to see
what happens with RAM/CPU at this moment using htop
<https://hisham.hm/htop/>in the different console.
Post by Axiombadger
Hi,
I am just starting to use gensim and am having some issues with the
wikipedia corpus.
https://radimrehurek.com/gensim/wiki.html
python3.5 -m gensim.scripts.make_wiki /home/user/enwiki-latest-
pages-articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,853 : INFO : running /home/user/.local/lib/
python3.5/site-packages/gensim/scripts/make_wiki.py /home/user/
enwiki-latest-pages-articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,936 : INFO : adding document #0 to
Dictionary(0 unique tokens: [])
2018-04-08 11:39:14,701 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['minikh', 'meteora', 'simbalist',
'burbano', 'aak']...)
2018-04-08 11:39:53,316 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['cerego', 'minikh', 'constantian',
'študovať', 'meteora']...)
2018-04-08 11:40:25,823 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['minikh', 'arisu', 'študovať', 'veitvet',
'djohor']...)
2018-04-08 11:40:55,901 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['glabrum', 'minikh', 'arisu', 'študovať',
'veitvet']...)
2018-04-08 11:41:19,130 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:32,992 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:45,127 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
2018-04-08 11:41:56,792 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
It eventually reaches a point where it just freezes. I was using
tmux to drop in and out of the terminal, so I tried plugging a monitor into
the machine I am using as a server and just running it from there and the
system locks up.
I am using Mint 18.3, with as you can see Python 3.5. I installed
all of the dependencies with pip and the --user flag and explicitly call
python-3.5
When I run the same with enwiki-latest-pages-articles1.xml-p10p30302.bz2
(a much smaller corpus) the task completes.
Is this just a RAM issue? I have 16GB and about 110GB free space
on an SSD. What would I need in order to run the above command?
I can use smaller corpus, I just ask because it is the first line
of code listed in the above instructions and it fails, has the file creep
from 8GB at time of writing to about 14GB now caused problems?
Where might I get logs for something crashing so unceremoniously?
Cheers.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Craig Thomson
2018-04-13 09:40:37 UTC
Permalink
I am wondering if there is some general file system problem on the machine,
although there should not be as there is a relatively new Samsung 850 Pro
SSD in there.

I tried the following code, the files in /tagged each contain one document
per line, pos tagged terms which are space delimited. I tried running the
following python code (based on from here
https://rare-technologies.com/word2vec-tutorial/):


import gensim, logging, os

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)

class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname

def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield line.split()

sentences = MySentences('./tagged') # a memory-friendly iterator
model = gensim.models.Word2Vec(sentences)

This crashes too (after working for a short time). At the weekend I will
take the corpus files off the desktop, onto the laptop and see if its a
hardware issue.
Post by Craig Thomson
Thanks again,
I see no reason why distro should matter either. Clutching at straws. I
will try and replicate on my laptop.
To answer your above question, it crashes without tmux. The processes are
note left running.
RAM and CPU just look like they do whilst the thing is running (except the
terminal is frozen).
The log is like the above, just INFO output and it is not the same place
in the corpus that is crashes each time.
nohup python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.xml.bz2
./output >logfile.log&
Post by Ivan Menshikh
Very strange. About other distributive - of course, you can try, but I
see no reasons for it here (because this part is pure-python). I am
discouraged.
Try to run it without tmux (ssh to machine + nohup with ">logfile.log&")
and control logfile & CPU/RAM. If this reproduced - ssh again to the
machine and check that process still running or not (and again logfile,
cpu, ram).
Post by Craig Thomson
Thanks again,
I ran (within the python venv)
nohup python -m gensim.scripts.make_wiki ~/Development/corpus/downloads
/enwiki-latest-pages-articles.xml.bz2 ~/Development/corpus/output
Post by Craig Thomson
log.log &
(I have been tidying up folders a bit hence the slightly different paths).
It crashes after varying amounts of time both with and without tmux.
When it crashes in tmux it brings down every single tmux session with it.
I have nothing else at all on this system so I can change distro, python
environment, anything.
I cannot at the moment test the same thing on my laptop (also Mint 18.3,
with the same python venv) as I have other work to do and am kind of
back-burning this on the desktop which is at home.
Watching this as much as I can in htop, it is still 80% CPU per core,
and about 600M - 1G of RAM at any given time.
2018-04-11 08:28:11,470 : INFO : running /home/user/Development/python-
env/lib/python3.5/site-packages/gensim/scripts/make_wiki.py
/home/user/Development/corpus/downloads/enwiki-latest-pages-articles.xml.bz2
/home/user/Development/corpus/output
2018-04-11 08:28:11,544 : INFO : adding document #0 to Dictionary(0
unique tokens: [])
2018-04-11 08:28:53,148 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['sandez', 'brickyards', 'ettling',
'attis', 'jse']...)
2018-04-11 08:29:29,801 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['yemek', 'sandez', 'brickyards',
'ettling', 'attis']...)
2018-04-11 08:30:00,202 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['sandez', 'attis', 'jse', 'ejolts',
'skaphidia']...)
2018-04-11 08:30:28,594 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['sandez', 'attis', 'jse', 'ejolts',
'skaphidia']...)
2018-04-11 08:30:50,365 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:03,307 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:14,661 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:25,488 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:48,834 : INFO : adding document #90000 to
Dictionary(1115408 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:32:14,763 : INFO : adding document #100000 to
Dictionary(1216455 unique tokens: ['pentatonics', 'sandez', 'mizormac',
'attis', 'jse']...)
2018-04-11 08:32:39,305 : INFO : adding document #110000 to
Dictionary(1306640 unique tokens: ['pentatonics', 'svellnosbreen',
'sandez', 'mizormac', 'attis']...)
2018-04-11 08:33:02,130 : INFO : adding document #120000 to
Dictionary(1385497 unique tokens: ['pentatonics', 'svellnosbreen',
'sandez', 'mizormac', 'attis']...)
2018-04-11 08:33:23,947 : INFO : adding document #130000 to
Dictionary(1455322 unique tokens: ['checotah', 'jse', 'ejolts', 'hohnadel',
'nightriders']...)
2018-04-11 08:33:46,905 : INFO : adding document #140000 to
Dictionary(1532614 unique tokens: ['checotah', 'jse', 'ejolts', 'hohnadel',
'nightriders']...)
2018-04-11 08:34:08,194 : INFO : adding document #150000 to
Dictionary(1621284 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:34:31,281 : INFO : adding document #160000 to
Dictionary(1699019 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:34:51,961 : INFO : adding document #170000 to
Dictionary(1763625 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:12,260 : INFO : adding document #180000 to
Dictionary(1816463 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:30,444 : INFO : adding document #190000 to
Dictionary(1873986 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:49,295 : INFO : adding document #200000 to
Dictionary(1932042 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:09,197 : INFO : adding document #210000 to
Dictionary(1982135 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:31,133 : INFO : discarding 27140 tokens: [('talost',
1), ('zhizhuan', 1), ('trevuren', 1), ('callachan', 1), ('methylisation',
1), ('blacklo', 1), ('īshat', 1), ('ilahiyyat', 1), ('grīmekhalaṃ', 1),
('afferre', 1)]...
2018-04-11 08:36:31,133 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 220000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['uluʁ', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:34,241 : INFO : adding document #220000 to
Dictionary(2000000 unique tokens: ['uluʁ', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:55,451 : INFO : discarding 52425 tokens: [('xhjhs', 1),
('àŽ…àŽ•', 1), ('creekvale', 1), ('villavincie', 1), ('kurewen', 1),
('askamiciw', 1), ('askipiw', 1), ('manautou', 1), ('zichmini', 1),
('olenoides', 1)]...
2018-04-11 08:36:55,451 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 230000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:36:58,541 : INFO : adding document #230000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
[('προβούλευΌα', 1), ('adebanjos', 1), ('brajković', 1), ('sepn', 1),
('diastatops', 1), ('tamoshanters', 1), ('zumann', 1), ('тхайМОг', 1),
('rickhardt', 1), ('penceat', 1)]...
2018-04-11 08:37:19,504 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 240000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:22,772 : INFO : adding document #240000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:45,184 : INFO : discarding 42278 tokens: [('tizir', 1),
('jacupiranga', 1), ('Ќаєш', 1), ('ninkov', 1), ('chanuyot', 1), ('vođa',
1), ('zhǎnghǎi', 1), ('rpps', 1), ('domasław', 1), ('gaeege', 1)]...
2018-04-11 08:37:45,184 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 250000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:48,488 : INFO : adding document #250000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
[('zampognari', 1), ('doctorală', 1), ('trinominals', 1), ('mansenc', 1),
('globalresearch', 1), ('mansengou', 1), ('loughans', 1), ('busaeus', 1),
('hirsaugienses', 1), ('paralelă', 1)]...
2018-04-11 08:38:11,016 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 260000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:14,087 : INFO : adding document #260000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:36,723 : INFO : discarding 45170 tokens: [('kaikhah',
1), ('kohrausch', 1), ('levate', 1), ('pelagheia', 1), ('ehcr', 1),
('buccata', 1), ('taramov', 1), ('wauch', 1), ('eymer', 1), ('exradius',
1)]...
2018-04-11 08:38:36,724 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 270000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:39,704 : INFO : adding document #270000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:02,766 : INFO : discarding 50380 tokens: [('mbil', 1),
('λιΌΜίτης', 1), ('lovaart', 1), ('medabot', 1), ('tebirkes', 1), ('innjō',
1), ('issobel', 1), ('neeby', 1), ('τέΌπλος', 1), ('karlostachys', 1)]...
2018-04-11 08:39:02,766 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 280000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:05,760 : INFO : adding document #280000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:25,872 : INFO : discarding 52105 tokens: [('mbb_', 1),
('scioptric', 1), ('sashwo', 1), ('thirachai', 1), ('kharaillah', 1),
('dìnghǎilù', 1), ('taylormusic', 1), ('kongjiang', 1), ('vaughanmusic',
1), ('kòngjiānglù', 1)]...
2018-04-11 08:39:25,872 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 290000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:29,088 : INFO : adding document #290000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
[('goldsmithry', 1), ('fianṡruth', 1), ('flannacán', 1), ('schaid', 1),
('cayohoga', 1), ('rilasciata', 1), ('attancourt', 1), ('villefore', 1),
('jarsin', 1), ('qeyniy', 1)]...
2018-04-11 08:39:48,652 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 300000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:51,611 : INFO : adding document #300000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:11,566 : INFO : discarding 48892 tokens: [('pujadó',
1), ('tikalladislav', 1), ('hilsbach', 1), ('térygéza', 1), ('astly', 1),
('iiibes', 1), ('wartelle', 1), ('carmelito', 1), ('nosless', 1),
('vÀrldsspindeln', 1)]...
2018-04-11 08:40:11,566 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 310000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:14,795 : INFO : adding document #310000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
[('hinderlist', 1), ('tatenawate', 1), ('spofvenhielm', 1), ('vicarivs',
1), ('starenfelt', 1), ('神歊東埁', 1), ('МачальМОка', 1), ('𣋚𠉞', 1),
('âwaxsîdâr', 1), ('noterid', 1)]...
2018-04-11 08:40:35,018 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 320000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:37,980 : INFO : adding document #320000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:57,494 : INFO : discarding 42122 tokens: [('buahan',
1), ('olympisky', 1), ('ōio', 1), ('rasulova', 1), ('treesforlife', 1),
('倧井内芪王', 1), ('usubov', 1), ('blÃ¥dalsvatnet', 1), ('賀楜内芪王', 1),
('ghoraib', 1)]...
2018-04-11 08:40:57,494 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 330000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:00,496 : INFO : adding document #330000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:19,188 : INFO : discarding 38954 tokens: [('exoda', 1),
('黃䞖仲', 1), ('kriemelman', 1), ('mandeldrums', 1), ('hihihihi', 1), ('黃䌯思',
1), ('korostelyov', 1), ('skillometer', 1), ('lachrymology', 1),
('processid', 1)]...
2018-04-11 08:41:19,188 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 340000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:22,132 : INFO : adding document #340000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:41,704 : INFO : discarding 39759 tokens: [('úlfheðinn',
1), ('neustrasia', 1), ('magdelone', 1), ('hillopathes', 1), ('mexbol', 1),
('tirtonadi', 1), ('batizocoi', 1), ('triwindhu', 1), ('mejiso', 1),
('namerō', 1)]...
2018-04-11 08:41:41,704 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 350000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:44,712 : INFO : adding document #350000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:42:03,259 : INFO : discarding 38180 tokens: [('ffyr', 1),
('cösitzer', 1), ('crefft', 1), ('duppenbecker', 1), ('tenguzame', 1),
('dniestrem', 1), ('geronte', 1), ('serpari', 1), ('応汗州郜督府郜督', 1),
('техеМ', 1)]...
2018-04-11 08:42:03,259 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 360000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:06,202 : INFO : adding document #360000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:24,380 : INFO : discarding 36319 tokens: [('willingii',
1), ('woodmaniorum', 1), ('wildford', 1), ('canibungan', 1),
('arsenophonus', 1), ('nonhigh', 1), ('mineralocortoid', 1), ('kinek', 1),
('pakipasa', 1), ('mondjam', 1)]...
2018-04-11 08:42:24,381 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 370000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:27,395 : INFO : adding document #370000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:45,949 : INFO : discarding 41031 tokens: [('miamua',
1), ('transliterature', 1), ('thlen', 1), ('appeariq', 1), ('山鎫', 1),
('deeondeeup', 1), ('yadate', 1), ('travia', 1), ('ruakapanga', 1),
('краљу', 1)]...
2018-04-11 08:42:45,949 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 380000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:48,883 : INFO : adding document #380000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('megacrammer', 1), ('horroh', 1), ('nasrud', 1), ('proteax', 1),
('cultutes', 1), ('gammarotettix', 1), ('discopolis', 1), ('endophilic',
1), ('caliology', 1), ('mohni', 1)]...
2018-04-11 08:43:09,037 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 390000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:43:12,044 : INFO : adding document #390000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('clangourous', 1), ('sulfabid', 1), ('faddul', 1), ('oenephes', 1),
('sulmeprim', 1), ('mawaqif', 1), ('talactoferrin', 1), ('talaglumetad',
1), ('contrate', 1), ('πrad', 1)]...
2018-04-11 08:43:30,114 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 400000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:43:33,061 : INFO : adding document #400000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:43:51,657 : INFO : discarding 42423 tokens: [('medeswael',
1), ('wohnungs', 1), ('yŏnghŭng', 1), ('annebella', 1), ('dimmig', 1),
('gosdendiana', 1), ('iwasakisara', 1), ('allānâ', 1), ('teuflische', 1),
('polymelos', 1)]...
2018-04-11 08:43:51,657 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 410000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:43:54,827 : INFO : adding document #410000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('bowersflybabycf', 1), ('skyote', 1), ('apathēs', 1), ('bolcon', 1),
('きらきらアフロ', 1), ('polysylabi', 1), ('kurtziella', 1), ('waringinkurung',
1), ('pyrgeometers', 1), ('badekuren', 1)]...
2018-04-11 08:44:13,198 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 420000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:16,139 : INFO : adding document #420000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('wasserdruck', 1), ('銬蟌', 1), ('sambiranoensis', 1), ('jouyaku', 1),
('aggregometry', 1), ('σጡς', 1), ('neovolcanica', 1), ('laternen', 1),
('victorianforts', 1), ('landfox', 1)]...
2018-04-11 08:44:34,615 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 430000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:37,714 : INFO : adding document #430000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:56,405 : INFO : discarding 44144 tokens: [('ດໃຈ', 1),
('πόλι', 1), ('chlíodhna', 1), ('francisquine', 1), ('postmemory', 1),
('bildetelegraph', 1), ('abdih', 1), ('kbfw', 1), ('kcbo', 1), ('kfmp',
1)]...
2018-04-11 08:44:56,405 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 440000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:59,527 : INFO : adding document #440000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('vrschighland', 1), ('xvictory', 1), ('semprill', 1), ('bordesi', 1),
('kokkim', 1), ('batzil', 1), ('kirix', 1), ('hersovits', 1), ('dtlgr', 1),
('bexi', 1)]...
2018-04-11 08:45:17,959 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 450000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:45:20,973 : INFO : adding document #450000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:45:39,269 : INFO : discarding 42590 tokens: [('achinsky',
1), ('景行倩皇四十䞉幎', 1), ('wmgx', 1), ('vetriera', 1), ('誉屋別皇子', 1),
('bemilleralbert', 1), ('coxfred', 1), ('gilburgtom', 1), ('takakiirihime',
1), ('greeneron', 1)]...
2018-04-11 08:45:39,269 : INFO : keeping 2000000 tokens which were in no
less than 0 and no more than 460000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:45:42,196 : INFO : adding document #460000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
On Wed, Apr 11, 2018 at 5:19 AM, Ivan Menshikh <
Post by Craig Thomson
Try to run it in "detached" mode like
nohup python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.x
ml.bz2 ~/outdir >log.log&
now, this doesn't block your console and will work after disconnect too
and shouldn't affect your tmux session.
Post by Craig Thomson
Thanks for the pointer, I will take a look at that as it may be useful
generally.
I had some pos tagging running so could not mess with my server until
this morning.
ssh into my server
# setup a tmux session
tmux
# enter the python virtual environment (3.5.2)
source Development/python-env/bin/activate
#
python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.xml.bz2
~/outdir
- htop
- watch -n1 sensors
- watch -n10 def -lh
This was to keep an eye on HDD space and to check for CPU overheat
although the cores barely broke 70 degrees (cpu load holds at 80% on each
of the 4 cores) and there is plenty HDD space. Memory looks fine on htop,
it barely uses 1G (out of 16G).
It freezes after putting out the following output (there is more above
obviously but this is where it freezes). When this happens everything
stops, all the tmux sessions come down and on my laptop I just get the last
readings of each monitor which show the same CPU, RAM, temps and disk
space. Because it is crashing in such a way I am not sure how to get at
any kind of actual error message.
2018-04-10 09:27:01,699 : INFO : adding document #90000 to
Dictionary(1115408 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:27:27,986 : INFO : adding document #100000 to
Dictionary(1216455 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:27:52,628 : INFO : adding document #110000 to
Dictionary(1306640 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:28:16,163 : INFO : adding document #120000 to
Dictionary(1385497 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'gewÀsen', 'wbss']...)
2018-04-10 09:28:38,390 : INFO : adding document #130000 to
Dictionary(1455322 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'aranga']...)
2018-04-10 09:29:01,829 : INFO : adding document #140000 to
Dictionary(1532614 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'aranga']...)
2018-04-10 09:29:23,102 : INFO : adding document #150000 to
Dictionary(1621284 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'kanenas']...)
2018-04-10 09:29:45,919 : INFO : adding document #160000 to
Dictionary(1699019 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'kanenas']...)
2018-04-10 09:30:06,493 : INFO : adding document #170000 to
Dictionary(1763625 unique tokens: ['quadricostate', 'vankulick', 'àž­àžœàž²à¹€àžŠ',
'raniere', 'bakshi']...)
2018-04-10 09:30:26,547 : INFO : adding document #180000 to
Dictionary(1816463 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:30:44,641 : INFO : adding document #190000 to
Dictionary(1873986 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:03,103 : INFO : adding document #200000 to
Dictionary(1932042 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:22,488 : INFO : adding document #210000 to
Dictionary(1982135 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:44,125 : INFO : discarding 27140 tokens: [('ziedu',
1), ('headstroke', 1), ('shawfielders', 1), ('sardisch', 1),
('luxsitpress', 1), ('fameuil', 1), ('munkaszolgálat', 1), ('batruna', 1),
('pigita', 1), ('goreiro', 1)]...
2018-04-10 09:31:44,125 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 220000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
2018-04-10 09:31:47,454 : INFO : adding document #220000 to
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
[('willouby', 1), ('debuchii', 1), ('llanwynno', 1), ('scurfpea', 1),
('tshogchungs', 1), ('dorsalateral', 1), ('cjmi', 1), ('chierichetti', 1),
('marketized', 1), ('eubetchia', 1)]...
2018-04-10 09:32:08,284 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 230000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
2018-04-10 09:32:11,421 : INFO : adding document #230000 to
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
On Tue, Apr 10, 2018 at 4:06 AM, Ivan Menshikh <
Post by Ivan Menshikh
Hi Craig,
about "more aggressive GC", try this method: https://docs.python.or
g/2/library/gc.html#gc.set_threshold, I'm not sure about the
usefulness of this method in the current case, but feel free to try.
Post by Craig Thomson
Thanks for the response.
I had top up a couple of the times it froze (not htop although I
have switched to that now). On top none of the 3-4 python processes were
above 3% RAM (and they possibly share some of that anyway?).
I have actually since had a hard system freeze when pos tagging with
spaCy. I added a forced garbage collection every 10k lines or something
and now it is fine (albeit taking hours so I will need to wait to try
gensim again). SpaCy was not close to running out of RAM either.
I am now running in venv python 3.5.2 (python is new to me, Ruby,
PHP and C++ background).
I will try and freeze my laptop with gensim on the same setup.
Is there some kind of setting to make python more aggressive with
garbage collection or am I barking up the wrong tree with that idea?
On Mon, 9 Apr 2018 at 07:12, Ivan Menshikh <
Post by Ivan Menshikh
Hello,
looks like you have enough of resources for this command. Try to
see what happens with RAM/CPU at this moment using htop
<https://hisham.hm/htop/>in the different console.
Post by Axiombadger
Hi,
I am just starting to use gensim and am having some issues with
the wikipedia corpus.
https://radimrehurek.com/gensim/wiki.html
python3.5 -m gensim.scripts.make_wiki /home/user/enwiki-latest-
pages-articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,853 : INFO : running /home/user/.local/lib/
python3.5/site-packages/gensim/scripts/make_wiki.py /home/user/
enwiki-latest-pages-articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,936 : INFO : adding document #0 to
Dictionary(0 unique tokens: [])
2018-04-08 11:39:14,701 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['minikh', 'meteora', 'simbalist',
'burbano', 'aak']...)
2018-04-08 11:39:53,316 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['cerego', 'minikh', 'constantian',
'študovať', 'meteora']...)
2018-04-08 11:40:25,823 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['minikh', 'arisu', 'študovať', 'veitvet',
'djohor']...)
2018-04-08 11:40:55,901 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['glabrum', 'minikh', 'arisu', 'študovať',
'veitvet']...)
2018-04-08 11:41:19,130 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:32,992 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:45,127 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
2018-04-08 11:41:56,792 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
It eventually reaches a point where it just freezes. I was using
tmux to drop in and out of the terminal, so I tried plugging a monitor into
the machine I am using as a server and just running it from there and the
system locks up.
I am using Mint 18.3, with as you can see Python 3.5. I installed
all of the dependencies with pip and the --user flag and explicitly call
python-3.5
When I run the same with enwiki-latest-pages-articles1.xml-p10p30302.bz2
(a much smaller corpus) the task completes.
Is this just a RAM issue? I have 16GB and about 110GB free space
on an SSD. What would I need in order to run the above command?
I can use smaller corpus, I just ask because it is the first line
of code listed in the above instructions and it fails, has the file creep
from 8GB at time of writing to about 14GB now caused problems?
Where might I get logs for something crashing so unceremoniously?
Cheers.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Craig Thomson
2018-04-15 07:20:59 UTC
Permalink
Sorry I meant "general system problem" in that last response.

After running on my laptop this weekend which is the same OS, same python
env, I am not having issues.

I do not have a massive amount of time to diagnose the desktop machine
although I will perhaps swap out the SSD, make sure the bios is up to date,
try the other SATA controller (mobo has 2), run a memtest. Basically
anything I can do quickly (in terms of my time) and inexpensively.

Thanks for the help.

Still not sure why this is the only thing crashing an otherwise stable
system.

I may try crashing it on windows, rule out any issues between the kernel
and the motherboard/controller or anything weird like that.
Post by Craig Thomson
I am wondering if there is some general file system problem on the
machine, although there should not be as there is a relatively new Samsung
850 Pro SSD in there.
I tried the following code, the files in /tagged each contain one document
per line, pos tagged terms which are space delimited. I tried running the
following python code (based on from here https://rare-
import gensim, logging, os
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)
self.dirname = dirname
yield line.split()
sentences = MySentences('./tagged') # a memory-friendly iterator
model = gensim.models.Word2Vec(sentences)
This crashes too (after working for a short time). At the weekend I will
take the corpus files off the desktop, onto the laptop and see if its a
hardware issue.
Post by Craig Thomson
Thanks again,
I see no reason why distro should matter either. Clutching at straws. I
will try and replicate on my laptop.
To answer your above question, it crashes without tmux. The processes
are note left running.
RAM and CPU just look like they do whilst the thing is running (except
the terminal is frozen).
The log is like the above, just INFO output and it is not the same place
in the corpus that is crashes each time.
nohup python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.xml.bz2
./output >logfile.log&
On Thu, Apr 12, 2018 at 5:54 AM, Ivan Menshikh <
Post by Ivan Menshikh
Very strange. About other distributive - of course, you can try, but I
see no reasons for it here (because this part is pure-python). I am
discouraged.
Try to run it without tmux (ssh to machine + nohup with
">logfile.log&") and control logfile & CPU/RAM. If this reproduced - ssh
again to the machine and check that process still running or not (and again
logfile, cpu, ram).
Post by Craig Thomson
Thanks again,
I ran (within the python venv)
nohup python -m gensim.scripts.make_wiki ~/Development/corpus/downloads
/enwiki-latest-pages-articles.xml.bz2 ~/Development/corpus/output
Post by Craig Thomson
log.log &
(I have been tidying up folders a bit hence the slightly different paths).
It crashes after varying amounts of time both with and without tmux.
When it crashes in tmux it brings down every single tmux session with it.
I have nothing else at all on this system so I can change distro,
python environment, anything.
I cannot at the moment test the same thing on my laptop (also Mint
18.3, with the same python venv) as I have other work to do and am kind of
back-burning this on the desktop which is at home.
Watching this as much as I can in htop, it is still 80% CPU per core,
and about 600M - 1G of RAM at any given time.
2018-04-11 08:28:11,470 : INFO : running /home/user/Development/python-
env/lib/python3.5/site-packages/gensim/scripts/make_wiki.py
/home/user/Development/corpus/downloads/enwiki-latest-pages-articles.xml.bz2
/home/user/Development/corpus/output
2018-04-11 08:28:11,544 : INFO : adding document #0 to Dictionary(0
unique tokens: [])
2018-04-11 08:28:53,148 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['sandez', 'brickyards', 'ettling',
'attis', 'jse']...)
2018-04-11 08:29:29,801 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['yemek', 'sandez', 'brickyards',
'ettling', 'attis']...)
2018-04-11 08:30:00,202 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['sandez', 'attis', 'jse', 'ejolts',
'skaphidia']...)
2018-04-11 08:30:28,594 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['sandez', 'attis', 'jse', 'ejolts',
'skaphidia']...)
2018-04-11 08:30:50,365 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:03,307 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:14,661 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:25,488 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:48,834 : INFO : adding document #90000 to
Dictionary(1115408 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:32:14,763 : INFO : adding document #100000 to
Dictionary(1216455 unique tokens: ['pentatonics', 'sandez', 'mizormac',
'attis', 'jse']...)
2018-04-11 08:32:39,305 : INFO : adding document #110000 to
Dictionary(1306640 unique tokens: ['pentatonics', 'svellnosbreen',
'sandez', 'mizormac', 'attis']...)
2018-04-11 08:33:02,130 : INFO : adding document #120000 to
Dictionary(1385497 unique tokens: ['pentatonics', 'svellnosbreen',
'sandez', 'mizormac', 'attis']...)
2018-04-11 08:33:23,947 : INFO : adding document #130000 to
Dictionary(1455322 unique tokens: ['checotah', 'jse', 'ejolts', 'hohnadel',
'nightriders']...)
2018-04-11 08:33:46,905 : INFO : adding document #140000 to
Dictionary(1532614 unique tokens: ['checotah', 'jse', 'ejolts', 'hohnadel',
'nightriders']...)
2018-04-11 08:34:08,194 : INFO : adding document #150000 to
Dictionary(1621284 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:34:31,281 : INFO : adding document #160000 to
Dictionary(1699019 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:34:51,961 : INFO : adding document #170000 to
Dictionary(1763625 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:12,260 : INFO : adding document #180000 to
Dictionary(1816463 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:30,444 : INFO : adding document #190000 to
Dictionary(1873986 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:49,295 : INFO : adding document #200000 to
Dictionary(1932042 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:09,197 : INFO : adding document #210000 to
Dictionary(1982135 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:31,133 : INFO : discarding 27140 tokens: [('talost',
1), ('zhizhuan', 1), ('trevuren', 1), ('callachan', 1), ('methylisation',
1), ('blacklo', 1), ('īshat', 1), ('ilahiyyat', 1), ('grīmekhalaṃ', 1),
('afferre', 1)]...
2018-04-11 08:36:31,133 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 220000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['uluʁ', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:34,241 : INFO : adding document #220000 to
Dictionary(2000000 unique tokens: ['uluʁ', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:55,451 : INFO : discarding 52425 tokens: [('xhjhs',
1), ('àŽ…àŽ•', 1), ('creekvale', 1), ('villavincie', 1), ('kurewen', 1),
('askamiciw', 1), ('askipiw', 1), ('manautou', 1), ('zichmini', 1),
('olenoides', 1)]...
2018-04-11 08:36:55,451 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 230000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:36:58,541 : INFO : adding document #230000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
[('προβούλευΌα', 1), ('adebanjos', 1), ('brajković', 1), ('sepn', 1),
('diastatops', 1), ('tamoshanters', 1), ('zumann', 1), ('тхайМОг', 1),
('rickhardt', 1), ('penceat', 1)]...
2018-04-11 08:37:19,504 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 240000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:22,772 : INFO : adding document #240000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:45,184 : INFO : discarding 42278 tokens: [('tizir',
1), ('jacupiranga', 1), ('Ќаєш', 1), ('ninkov', 1), ('chanuyot', 1),
('vođa', 1), ('zhǎnghǎi', 1), ('rpps', 1), ('domasław', 1), ('gaeege',
1)]...
2018-04-11 08:37:45,184 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 250000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:48,488 : INFO : adding document #250000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
[('zampognari', 1), ('doctorală', 1), ('trinominals', 1), ('mansenc', 1),
('globalresearch', 1), ('mansengou', 1), ('loughans', 1), ('busaeus', 1),
('hirsaugienses', 1), ('paralelă', 1)]...
2018-04-11 08:38:11,016 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 260000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:14,087 : INFO : adding document #260000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:36,723 : INFO : discarding 45170 tokens: [('kaikhah',
1), ('kohrausch', 1), ('levate', 1), ('pelagheia', 1), ('ehcr', 1),
('buccata', 1), ('taramov', 1), ('wauch', 1), ('eymer', 1), ('exradius',
1)]...
2018-04-11 08:38:36,724 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 270000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:39,704 : INFO : adding document #270000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:02,766 : INFO : discarding 50380 tokens: [('mbil', 1),
('λιΌΜίτης', 1), ('lovaart', 1), ('medabot', 1), ('tebirkes', 1), ('innjō',
1), ('issobel', 1), ('neeby', 1), ('τέΌπλος', 1), ('karlostachys', 1)]...
2018-04-11 08:39:02,766 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 280000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:05,760 : INFO : adding document #280000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:25,872 : INFO : discarding 52105 tokens: [('mbb_', 1),
('scioptric', 1), ('sashwo', 1), ('thirachai', 1), ('kharaillah', 1),
('dìnghǎilù', 1), ('taylormusic', 1), ('kongjiang', 1), ('vaughanmusic',
1), ('kòngjiānglù', 1)]...
2018-04-11 08:39:25,872 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 290000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:29,088 : INFO : adding document #290000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
[('goldsmithry', 1), ('fianṡruth', 1), ('flannacán', 1), ('schaid', 1),
('cayohoga', 1), ('rilasciata', 1), ('attancourt', 1), ('villefore', 1),
('jarsin', 1), ('qeyniy', 1)]...
2018-04-11 08:39:48,652 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 300000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:51,611 : INFO : adding document #300000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:11,566 : INFO : discarding 48892 tokens: [('pujadó',
1), ('tikalladislav', 1), ('hilsbach', 1), ('térygéza', 1), ('astly', 1),
('iiibes', 1), ('wartelle', 1), ('carmelito', 1), ('nosless', 1),
('vÀrldsspindeln', 1)]...
2018-04-11 08:40:11,566 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 310000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:14,795 : INFO : adding document #310000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
[('hinderlist', 1), ('tatenawate', 1), ('spofvenhielm', 1), ('vicarivs',
1), ('starenfelt', 1), ('神歊東埁', 1), ('МачальМОка', 1), ('𣋚𠉞', 1),
('âwaxsîdâr', 1), ('noterid', 1)]...
2018-04-11 08:40:35,018 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 320000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:37,980 : INFO : adding document #320000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:57,494 : INFO : discarding 42122 tokens: [('buahan',
1), ('olympisky', 1), ('ōio', 1), ('rasulova', 1), ('treesforlife', 1),
('倧井内芪王', 1), ('usubov', 1), ('blÃ¥dalsvatnet', 1), ('賀楜内芪王', 1),
('ghoraib', 1)]...
2018-04-11 08:40:57,494 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 330000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:00,496 : INFO : adding document #330000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:19,188 : INFO : discarding 38954 tokens: [('exoda',
1), ('黃䞖仲', 1), ('kriemelman', 1), ('mandeldrums', 1), ('hihihihi', 1),
('黃䌯思', 1), ('korostelyov', 1), ('skillometer', 1), ('lachrymology', 1),
('processid', 1)]...
2018-04-11 08:41:19,188 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 340000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:22,132 : INFO : adding document #340000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
[('úlfheðinn', 1), ('neustrasia', 1), ('magdelone', 1), ('hillopathes', 1),
('mexbol', 1), ('tirtonadi', 1), ('batizocoi', 1), ('triwindhu', 1),
('mejiso', 1), ('namerō', 1)]...
2018-04-11 08:41:41,704 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 350000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:44,712 : INFO : adding document #350000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:42:03,259 : INFO : discarding 38180 tokens: [('ffyr', 1),
('cösitzer', 1), ('crefft', 1), ('duppenbecker', 1), ('tenguzame', 1),
('dniestrem', 1), ('geronte', 1), ('serpari', 1), ('応汗州郜督府郜督', 1),
('техеМ', 1)]...
2018-04-11 08:42:03,259 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 360000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:06,202 : INFO : adding document #360000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('willingii', 1), ('woodmaniorum', 1), ('wildford', 1), ('canibungan', 1),
('arsenophonus', 1), ('nonhigh', 1), ('mineralocortoid', 1), ('kinek', 1),
('pakipasa', 1), ('mondjam', 1)]...
2018-04-11 08:42:24,381 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 370000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:27,395 : INFO : adding document #370000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:45,949 : INFO : discarding 41031 tokens: [('miamua',
1), ('transliterature', 1), ('thlen', 1), ('appeariq', 1), ('山鎫', 1),
('deeondeeup', 1), ('yadate', 1), ('travia', 1), ('ruakapanga', 1),
('краљу', 1)]...
2018-04-11 08:42:45,949 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 380000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:48,883 : INFO : adding document #380000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('megacrammer', 1), ('horroh', 1), ('nasrud', 1), ('proteax', 1),
('cultutes', 1), ('gammarotettix', 1), ('discopolis', 1), ('endophilic',
1), ('caliology', 1), ('mohni', 1)]...
2018-04-11 08:43:09,037 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 390000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:43:12,044 : INFO : adding document #390000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('clangourous', 1), ('sulfabid', 1), ('faddul', 1), ('oenephes', 1),
('sulmeprim', 1), ('mawaqif', 1), ('talactoferrin', 1), ('talaglumetad',
1), ('contrate', 1), ('πrad', 1)]...
2018-04-11 08:43:30,114 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 400000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:43:33,061 : INFO : adding document #400000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('medeswael', 1), ('wohnungs', 1), ('yŏnghŭng', 1), ('annebella', 1),
('dimmig', 1), ('gosdendiana', 1), ('iwasakisara', 1), ('allānâ', 1),
('teuflische', 1), ('polymelos', 1)]...
2018-04-11 08:43:51,657 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 410000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:43:54,827 : INFO : adding document #410000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('bowersflybabycf', 1), ('skyote', 1), ('apathēs', 1), ('bolcon', 1),
('きらきらアフロ', 1), ('polysylabi', 1), ('kurtziella', 1), ('waringinkurung',
1), ('pyrgeometers', 1), ('badekuren', 1)]...
2018-04-11 08:44:13,198 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 420000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:16,139 : INFO : adding document #420000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('wasserdruck', 1), ('銬蟌', 1), ('sambiranoensis', 1), ('jouyaku', 1),
('aggregometry', 1), ('σጡς', 1), ('neovolcanica', 1), ('laternen', 1),
('victorianforts', 1), ('landfox', 1)]...
2018-04-11 08:44:34,615 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 430000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:37,714 : INFO : adding document #430000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:56,405 : INFO : discarding 44144 tokens: [('ດໃຈ', 1),
('πόλι', 1), ('chlíodhna', 1), ('francisquine', 1), ('postmemory', 1),
('bildetelegraph', 1), ('abdih', 1), ('kbfw', 1), ('kcbo', 1), ('kfmp',
1)]...
2018-04-11 08:44:56,405 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 440000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:59,527 : INFO : adding document #440000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('vrschighland', 1), ('xvictory', 1), ('semprill', 1), ('bordesi', 1),
('kokkim', 1), ('batzil', 1), ('kirix', 1), ('hersovits', 1), ('dtlgr', 1),
('bexi', 1)]...
2018-04-11 08:45:17,959 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 450000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:45:20,973 : INFO : adding document #450000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:45:39,269 : INFO : discarding 42590 tokens: [('achinsky',
1), ('景行倩皇四十䞉幎', 1), ('wmgx', 1), ('vetriera', 1), ('誉屋別皇子', 1),
('bemilleralbert', 1), ('coxfred', 1), ('gilburgtom', 1), ('takakiirihime',
1), ('greeneron', 1)]...
2018-04-11 08:45:39,269 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 460000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:45:42,196 : INFO : adding document #460000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
On Wed, Apr 11, 2018 at 5:19 AM, Ivan Menshikh <
Post by Craig Thomson
Try to run it in "detached" mode like
nohup python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.
xml.bz2 ~/outdir >log.log&
now, this doesn't block your console and will work after disconnect
too and shouldn't affect your tmux session.
Post by Craig Thomson
Thanks for the pointer, I will take a look at that as it may be
useful generally.
I had some pos tagging running so could not mess with my server until
this morning.
ssh into my server
# setup a tmux session
tmux
# enter the python virtual environment (3.5.2)
source Development/python-env/bin/activate
#
python -m gensim.scripts.make_wiki enwik
i-latest-pages-articles.xml.bz2 ~/outdir
- htop
- watch -n1 sensors
- watch -n10 def -lh
This was to keep an eye on HDD space and to check for CPU overheat
although the cores barely broke 70 degrees (cpu load holds at 80% on each
of the 4 cores) and there is plenty HDD space. Memory looks fine on htop,
it barely uses 1G (out of 16G).
It freezes after putting out the following output (there is more
above obviously but this is where it freezes). When this happens
everything stops, all the tmux sessions come down and on my laptop I just
get the last readings of each monitor which show the same CPU, RAM, temps
and disk space. Because it is crashing in such a way I am not sure how to
get at any kind of actual error message.
2018-04-10 09:27:01,699 : INFO : adding document #90000 to
Dictionary(1115408 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:27:27,986 : INFO : adding document #100000 to
Dictionary(1216455 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:27:52,628 : INFO : adding document #110000 to
Dictionary(1306640 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:28:16,163 : INFO : adding document #120000 to
Dictionary(1385497 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'gewÀsen', 'wbss']...)
2018-04-10 09:28:38,390 : INFO : adding document #130000 to
Dictionary(1455322 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'aranga']...)
2018-04-10 09:29:01,829 : INFO : adding document #140000 to
Dictionary(1532614 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'aranga']...)
2018-04-10 09:29:23,102 : INFO : adding document #150000 to
Dictionary(1621284 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'kanenas']...)
2018-04-10 09:29:45,919 : INFO : adding document #160000 to
Dictionary(1699019 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'kanenas']...)
2018-04-10 09:30:06,493 : INFO : adding document #170000 to
Dictionary(1763625 unique tokens: ['quadricostate', 'vankulick', 'àž­àžœàž²à¹€àžŠ',
'raniere', 'bakshi']...)
2018-04-10 09:30:26,547 : INFO : adding document #180000 to
Dictionary(1816463 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:30:44,641 : INFO : adding document #190000 to
Dictionary(1873986 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:03,103 : INFO : adding document #200000 to
Dictionary(1932042 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:22,488 : INFO : adding document #210000 to
Dictionary(1982135 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:44,125 : INFO : discarding 27140 tokens: [('ziedu',
1), ('headstroke', 1), ('shawfielders', 1), ('sardisch', 1),
('luxsitpress', 1), ('fameuil', 1), ('munkaszolgálat', 1), ('batruna', 1),
('pigita', 1), ('goreiro', 1)]...
2018-04-10 09:31:44,125 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 220000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
2018-04-10 09:31:47,454 : INFO : adding document #220000 to
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
[('willouby', 1), ('debuchii', 1), ('llanwynno', 1), ('scurfpea', 1),
('tshogchungs', 1), ('dorsalateral', 1), ('cjmi', 1), ('chierichetti', 1),
('marketized', 1), ('eubetchia', 1)]...
2018-04-10 09:32:08,284 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 230000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
2018-04-10 09:32:11,421 : INFO : adding document #230000 to
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
On Tue, Apr 10, 2018 at 4:06 AM, Ivan Menshikh <
Post by Ivan Menshikh
Hi Craig,
about "more aggressive GC", try this method: https://docs.python.or
g/2/library/gc.html#gc.set_threshold, I'm not sure about the
usefulness of this method in the current case, but feel free to try.
Post by Craig Thomson
Thanks for the response.
I had top up a couple of the times it froze (not htop although I
have switched to that now). On top none of the 3-4 python processes were
above 3% RAM (and they possibly share some of that anyway?).
I have actually since had a hard system freeze when pos tagging
with spaCy. I added a forced garbage collection every 10k lines or
something and now it is fine (albeit taking hours so I will need to wait to
try gensim again). SpaCy was not close to running out of RAM either.
I am now running in venv python 3.5.2 (python is new to me, Ruby,
PHP and C++ background).
I will try and freeze my laptop with gensim on the same setup.
Is there some kind of setting to make python more aggressive with
garbage collection or am I barking up the wrong tree with that idea?
On Mon, 9 Apr 2018 at 07:12, Ivan Menshikh <
Post by Ivan Menshikh
Hello,
looks like you have enough of resources for this command. Try to
see what happens with RAM/CPU at this moment using htop
<https://hisham.hm/htop/>in the different console.
Post by Axiombadger
Hi,
I am just starting to use gensim and am having some issues with
the wikipedia corpus.
https://radimrehurek.com/gensim/wiki.html
python3.5 -m gensim.scripts.make_wiki /home/user/enwiki-latest-
pages-articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,853 : INFO : running /home/user/.local/lib/
python3.5/site-packages/gensim/scripts/make_wiki.py /home/user/
enwiki-latest-pages-articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,936 : INFO : adding document #0 to
Dictionary(0 unique tokens: [])
2018-04-08 11:39:14,701 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['minikh', 'meteora', 'simbalist',
'burbano', 'aak']...)
2018-04-08 11:39:53,316 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['cerego', 'minikh', 'constantian',
'študovať', 'meteora']...)
2018-04-08 11:40:25,823 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['minikh', 'arisu', 'študovať', 'veitvet',
'djohor']...)
2018-04-08 11:40:55,901 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['glabrum', 'minikh', 'arisu', 'študovať',
'veitvet']...)
2018-04-08 11:41:19,130 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:32,992 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:45,127 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
2018-04-08 11:41:56,792 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
It eventually reaches a point where it just freezes. I was using
tmux to drop in and out of the terminal, so I tried plugging a monitor into
the machine I am using as a server and just running it from there and the
system locks up.
I am using Mint 18.3, with as you can see Python 3.5. I
installed all of the dependencies with pip and the --user flag and
explicitly call python-3.5
When I run the same with enwiki-latest-pages-articles1.xml-p10p30302.bz2
(a much smaller corpus) the task completes.
Is this just a RAM issue? I have 16GB and about 110GB free space
on an SSD. What would I need in order to run the above command?
I can use smaller corpus, I just ask because it is the first line
of code listed in the above instructions and it fails, has the file creep
from 8GB at time of writing to about 14GB now caused problems?
Where might I get logs for something crashing so unceremoniously?
Cheers.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ivan Menshikh
2018-04-16 01:59:19 UTC
Permalink
I'm glad that this works now, good luck with experiments!
Post by Craig Thomson
Sorry I meant "general system problem" in that last response.
After running on my laptop this weekend which is the same OS, same python
env, I am not having issues.
I do not have a massive amount of time to diagnose the desktop machine
although I will perhaps swap out the SSD, make sure the bios is up to date,
try the other SATA controller (mobo has 2), run a memtest. Basically
anything I can do quickly (in terms of my time) and inexpensively.
Thanks for the help.
Still not sure why this is the only thing crashing an otherwise stable
system.
I may try crashing it on windows, rule out any issues between the kernel
and the motherboard/controller or anything weird like that.
Post by Craig Thomson
I am wondering if there is some general file system problem on the
machine, although there should not be as there is a relatively new Samsung
850 Pro SSD in there.
I tried the following code, the files in /tagged each contain one
document per line, pos tagged terms which are space delimited. I tried
running the following python code (based on from here
import gensim, logging, os
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)
self.dirname = dirname
yield line.split()
sentences = MySentences('./tagged') # a memory-friendly iterator
model = gensim.models.Word2Vec(sentences)
This crashes too (after working for a short time). At the weekend I will
take the corpus files off the desktop, onto the laptop and see if its a
hardware issue.
Post by Craig Thomson
Thanks again,
I see no reason why distro should matter either. Clutching at straws.
I will try and replicate on my laptop.
To answer your above question, it crashes without tmux. The processes
are note left running.
RAM and CPU just look like they do whilst the thing is running (except
the terminal is frozen).
The log is like the above, just INFO output and it is not the same place
in the corpus that is crashes each time.
nohup python -m gensim.scripts.make_wiki
enwiki-latest-pages-articles.xml.bz2 ./output >logfile.log&
On Thu, Apr 12, 2018 at 5:54 AM, Ivan Menshikh <
Post by Ivan Menshikh
Very strange. About other distributive - of course, you can try, but I
see no reasons for it here (because this part is pure-python). I am
discouraged.
Try to run it without tmux (ssh to machine + nohup with
">logfile.log&") and control logfile & CPU/RAM. If this reproduced - ssh
again to the machine and check that process still running or not (and again
logfile, cpu, ram).
Post by Craig Thomson
Thanks again,
I ran (within the python venv)
nohup python -m gensim.scripts.make_wiki
~/Development/corpus/downloads/enwiki-latest-pages-articles.xml.bz2
~/Development/corpus/output >log.log &
(I have been tidying up folders a bit hence the slightly different paths).
It crashes after varying amounts of time both with and without tmux.
When it crashes in tmux it brings down every single tmux session with it.
I have nothing else at all on this system so I can change distro,
python environment, anything.
I cannot at the moment test the same thing on my laptop (also Mint
18.3, with the same python venv) as I have other work to do and am kind of
back-burning this on the desktop which is at home.
Watching this as much as I can in htop, it is still 80% CPU per core,
and about 600M - 1G of RAM at any given time.
2018-04-11 08:28:11,470 : INFO : running
/home/user/Development/python-env/lib/python3.5/site-packages/gensim/scripts/make_wiki.py
/home/user/Development/corpus/downloads/enwiki-latest-pages-articles.xml.bz2
/home/user/Development/corpus/output
2018-04-11 08:28:11,544 : INFO : adding document #0 to Dictionary(0
unique tokens: [])
2018-04-11 08:28:53,148 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['sandez', 'brickyards', 'ettling',
'attis', 'jse']...)
2018-04-11 08:29:29,801 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['yemek', 'sandez', 'brickyards',
'ettling', 'attis']...)
2018-04-11 08:30:00,202 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['sandez', 'attis', 'jse', 'ejolts',
'skaphidia']...)
2018-04-11 08:30:28,594 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['sandez', 'attis', 'jse', 'ejolts',
'skaphidia']...)
2018-04-11 08:30:50,365 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:03,307 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:14,661 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:25,488 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:31:48,834 : INFO : adding document #90000 to
Dictionary(1115408 unique tokens: ['sandez', 'mizormac', 'attis', 'jse',
'ejolts']...)
2018-04-11 08:32:14,763 : INFO : adding document #100000 to
Dictionary(1216455 unique tokens: ['pentatonics', 'sandez', 'mizormac',
'attis', 'jse']...)
2018-04-11 08:32:39,305 : INFO : adding document #110000 to
Dictionary(1306640 unique tokens: ['pentatonics', 'svellnosbreen',
'sandez', 'mizormac', 'attis']...)
2018-04-11 08:33:02,130 : INFO : adding document #120000 to
Dictionary(1385497 unique tokens: ['pentatonics', 'svellnosbreen',
'sandez', 'mizormac', 'attis']...)
2018-04-11 08:33:23,947 : INFO : adding document #130000 to
Dictionary(1455322 unique tokens: ['checotah', 'jse', 'ejolts', 'hohnadel',
'nightriders']...)
2018-04-11 08:33:46,905 : INFO : adding document #140000 to
Dictionary(1532614 unique tokens: ['checotah', 'jse', 'ejolts', 'hohnadel',
'nightriders']...)
2018-04-11 08:34:08,194 : INFO : adding document #150000 to
Dictionary(1621284 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:34:31,281 : INFO : adding document #160000 to
Dictionary(1699019 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:34:51,961 : INFO : adding document #170000 to
Dictionary(1763625 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:12,260 : INFO : adding document #180000 to
Dictionary(1816463 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:30,444 : INFO : adding document #190000 to
Dictionary(1873986 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:35:49,295 : INFO : adding document #200000 to
Dictionary(1932042 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:09,197 : INFO : adding document #210000 to
Dictionary(1982135 unique tokens: ['checotah', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:31,133 : INFO : discarding 27140 tokens: [('talost',
1), ('zhizhuan', 1), ('trevuren', 1), ('callachan', 1), ('methylisation',
1), ('blacklo', 1), ('īshat', 1), ('ilahiyyat', 1), ('grīmekhalaṃ', 1),
('afferre', 1)]...
2018-04-11 08:36:31,133 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 220000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['uluʁ', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:34,241 : INFO : adding document #220000 to
Dictionary(2000000 unique tokens: ['uluʁ', 'jse', 'izg', 'ejolts',
'hohnadel']...)
2018-04-11 08:36:55,451 : INFO : discarding 52425 tokens: [('xhjhs',
1), ('àŽ…àŽ•', 1), ('creekvale', 1), ('villavincie', 1), ('kurewen', 1),
('askamiciw', 1), ('askipiw', 1), ('manautou', 1), ('zichmini', 1),
('olenoides', 1)]...
2018-04-11 08:36:55,451 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 230000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:36:58,541 : INFO : adding document #230000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
[('προβούλευΌα', 1), ('adebanjos', 1), ('brajković', 1), ('sepn', 1),
('diastatops', 1), ('tamoshanters', 1), ('zumann', 1), ('тхайМОг', 1),
('rickhardt', 1), ('penceat', 1)]...
2018-04-11 08:37:19,504 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 240000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:22,772 : INFO : adding document #240000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:45,184 : INFO : discarding 42278 tokens: [('tizir',
1), ('jacupiranga', 1), ('Ќаєш', 1), ('ninkov', 1), ('chanuyot', 1),
('vođa', 1), ('zhǎnghǎi', 1), ('rpps', 1), ('domasław', 1), ('gaeege',
1)]...
2018-04-11 08:37:45,184 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 250000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:37:48,488 : INFO : adding document #250000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
[('zampognari', 1), ('doctorală', 1), ('trinominals', 1), ('mansenc', 1),
('globalresearch', 1), ('mansengou', 1), ('loughans', 1), ('busaeus', 1),
('hirsaugienses', 1), ('paralelă', 1)]...
2018-04-11 08:38:11,016 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 260000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:14,087 : INFO : adding document #260000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:36,723 : INFO : discarding 45170 tokens: [('kaikhah',
1), ('kohrausch', 1), ('levate', 1), ('pelagheia', 1), ('ehcr', 1),
('buccata', 1), ('taramov', 1), ('wauch', 1), ('eymer', 1), ('exradius',
1)]...
2018-04-11 08:38:36,724 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 270000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:38:39,704 : INFO : adding document #270000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:02,766 : INFO : discarding 50380 tokens: [('mbil',
1), ('λιΌΜίτης', 1), ('lovaart', 1), ('medabot', 1), ('tebirkes', 1),
('innjō', 1), ('issobel', 1), ('neeby', 1), ('τέΌπλος', 1),
('karlostachys', 1)]...
2018-04-11 08:39:02,766 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 280000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:05,760 : INFO : adding document #280000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:25,872 : INFO : discarding 52105 tokens: [('mbb_',
1), ('scioptric', 1), ('sashwo', 1), ('thirachai', 1), ('kharaillah', 1),
('dìnghǎilù', 1), ('taylormusic', 1), ('kongjiang', 1), ('vaughanmusic',
1), ('kòngjiānglù', 1)]...
2018-04-11 08:39:25,872 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 290000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:29,088 : INFO : adding document #290000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
[('goldsmithry', 1), ('fianṡruth', 1), ('flannacán', 1), ('schaid', 1),
('cayohoga', 1), ('rilasciata', 1), ('attancourt', 1), ('villefore', 1),
('jarsin', 1), ('qeyniy', 1)]...
2018-04-11 08:39:48,652 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 300000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:39:51,611 : INFO : adding document #300000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:11,566 : INFO : discarding 48892 tokens: [('pujadó',
1), ('tikalladislav', 1), ('hilsbach', 1), ('térygéza', 1), ('astly', 1),
('iiibes', 1), ('wartelle', 1), ('carmelito', 1), ('nosless', 1),
('vÀrldsspindeln', 1)]...
2018-04-11 08:40:11,566 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 310000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:14,795 : INFO : adding document #310000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
[('hinderlist', 1), ('tatenawate', 1), ('spofvenhielm', 1), ('vicarivs',
1), ('starenfelt', 1), ('神歊東埁', 1), ('МачальМОка', 1), ('𣋚𠉞', 1),
('âwaxsîdâr', 1), ('noterid', 1)]...
2018-04-11 08:40:35,018 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 320000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:37,980 : INFO : adding document #320000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:40:57,494 : INFO : discarding 42122 tokens: [('buahan',
1), ('olympisky', 1), ('ōio', 1), ('rasulova', 1), ('treesforlife', 1),
('倧井内芪王', 1), ('usubov', 1), ('blÃ¥dalsvatnet', 1), ('賀楜内芪王', 1),
('ghoraib', 1)]...
2018-04-11 08:40:57,494 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 330000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:00,496 : INFO : adding document #330000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:19,188 : INFO : discarding 38954 tokens: [('exoda',
1), ('黃䞖仲', 1), ('kriemelman', 1), ('mandeldrums', 1), ('hihihihi', 1),
('黃䌯思', 1), ('korostelyov', 1), ('skillometer', 1), ('lachrymology', 1),
('processid', 1)]...
2018-04-11 08:41:19,188 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 340000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:22,132 : INFO : adding document #340000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
[('úlfheðinn', 1), ('neustrasia', 1), ('magdelone', 1), ('hillopathes', 1),
('mexbol', 1), ('tirtonadi', 1), ('batizocoi', 1), ('triwindhu', 1),
('mejiso', 1), ('namerō', 1)]...
2018-04-11 08:41:41,704 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 350000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:41:44,712 : INFO : adding document #350000 to
Dictionary(2000000 unique tokens: ['цО', 'uluʁ', 'jse', 'izg', 'ejolts']...)
2018-04-11 08:42:03,259 : INFO : discarding 38180 tokens: [('ffyr',
1), ('cösitzer', 1), ('crefft', 1), ('duppenbecker', 1), ('tenguzame', 1),
('dniestrem', 1), ('geronte', 1), ('serpari', 1), ('応汗州郜督府郜督', 1),
('техеМ', 1)]...
2018-04-11 08:42:03,259 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 360000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:06,202 : INFO : adding document #360000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('willingii', 1), ('woodmaniorum', 1), ('wildford', 1), ('canibungan', 1),
('arsenophonus', 1), ('nonhigh', 1), ('mineralocortoid', 1), ('kinek', 1),
('pakipasa', 1), ('mondjam', 1)]...
2018-04-11 08:42:24,381 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 370000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:27,395 : INFO : adding document #370000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:45,949 : INFO : discarding 41031 tokens: [('miamua',
1), ('transliterature', 1), ('thlen', 1), ('appeariq', 1), ('山鎫', 1),
('deeondeeup', 1), ('yadate', 1), ('travia', 1), ('ruakapanga', 1),
('краљу', 1)]...
2018-04-11 08:42:45,949 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 380000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:42:48,883 : INFO : adding document #380000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('megacrammer', 1), ('horroh', 1), ('nasrud', 1), ('proteax', 1),
('cultutes', 1), ('gammarotettix', 1), ('discopolis', 1), ('endophilic',
1), ('caliology', 1), ('mohni', 1)]...
2018-04-11 08:43:09,037 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 390000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:43:12,044 : INFO : adding document #390000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('clangourous', 1), ('sulfabid', 1), ('faddul', 1), ('oenephes', 1),
('sulmeprim', 1), ('mawaqif', 1), ('talactoferrin', 1), ('talaglumetad',
1), ('contrate', 1), ('πrad', 1)]...
2018-04-11 08:43:30,114 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 400000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:43:33,061 : INFO : adding document #400000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('medeswael', 1), ('wohnungs', 1), ('yŏnghŭng', 1), ('annebella', 1),
('dimmig', 1), ('gosdendiana', 1), ('iwasakisara', 1), ('allānâ', 1),
('teuflische', 1), ('polymelos', 1)]...
2018-04-11 08:43:51,657 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 410000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:43:54,827 : INFO : adding document #410000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('bowersflybabycf', 1), ('skyote', 1), ('apathēs', 1), ('bolcon', 1),
('きらきらアフロ', 1), ('polysylabi', 1), ('kurtziella', 1), ('waringinkurung',
1), ('pyrgeometers', 1), ('badekuren', 1)]...
2018-04-11 08:44:13,198 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 420000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:16,139 : INFO : adding document #420000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('wasserdruck', 1), ('銬蟌', 1), ('sambiranoensis', 1), ('jouyaku', 1),
('aggregometry', 1), ('σጡς', 1), ('neovolcanica', 1), ('laternen', 1),
('victorianforts', 1), ('landfox', 1)]...
2018-04-11 08:44:34,615 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 430000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:37,714 : INFO : adding document #430000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:56,405 : INFO : discarding 44144 tokens: [('ດໃຈ', 1),
('πόλι', 1), ('chlíodhna', 1), ('francisquine', 1), ('postmemory', 1),
('bildetelegraph', 1), ('abdih', 1), ('kbfw', 1), ('kcbo', 1), ('kfmp',
1)]...
2018-04-11 08:44:56,405 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 440000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:44:59,527 : INFO : adding document #440000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('vrschighland', 1), ('xvictory', 1), ('semprill', 1), ('bordesi', 1),
('kokkim', 1), ('batzil', 1), ('kirix', 1), ('hersovits', 1), ('dtlgr', 1),
('bexi', 1)]...
2018-04-11 08:45:17,959 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 450000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:45:20,973 : INFO : adding document #450000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
[('achinsky', 1), ('景行倩皇四十䞉幎', 1), ('wmgx', 1), ('vetriera', 1), ('誉屋別皇子',
1), ('bemilleralbert', 1), ('coxfred', 1), ('gilburgtom', 1),
('takakiirihime', 1), ('greeneron', 1)]...
2018-04-11 08:45:39,269 : INFO : keeping 2000000 tokens which were in
no less than 0 and no more than 460000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
2018-04-11 08:45:42,196 : INFO : adding document #460000 to
Dictionary(2000000 unique tokens: ['цО', 'daredevilry', 'uluʁ', 'jse',
'izg']...)
On Wed, Apr 11, 2018 at 5:19 AM, Ivan Menshikh <
Post by Ivan Menshikh
Try to run it in "detached" mode like
nohup python -m gensim.scripts.make_wiki enwiki-latest-pages-articles
.xml.bz2 ~/outdir >log.log&
now, this doesn't block your console and will work after disconnect
too and shouldn't affect your tmux session.
Post by Craig Thomson
Thanks for the pointer, I will take a look at that as it may be
useful generally.
I had some pos tagging running so could not mess with my server
until this morning.
ssh into my server
# setup a tmux session
tmux
# enter the python virtual environment (3.5.2)
source Development/python-env/bin/activate
#
python -m gensim.scripts.make_wiki enwik
i-latest-pages-articles.xml.bz2 ~/outdir
- htop
- watch -n1 sensors
- watch -n10 def -lh
This was to keep an eye on HDD space and to check for CPU overheat
although the cores barely broke 70 degrees (cpu load holds at 80% on each
of the 4 cores) and there is plenty HDD space. Memory looks fine on htop,
it barely uses 1G (out of 16G).
It freezes after putting out the following output (there is more
above obviously but this is where it freezes). When this happens
everything stops, all the tmux sessions come down and on my laptop I just
get the last readings of each monitor which show the same CPU, RAM, temps
and disk space. Because it is crashing in such a way I am not sure how to
get at any kind of actual error message.
2018-04-10 09:27:01,699 : INFO : adding document #90000 to
Dictionary(1115408 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:27:27,986 : INFO : adding document #100000 to
Dictionary(1216455 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:27:52,628 : INFO : adding document #110000 to
Dictionary(1306640 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'wbss', 'raniere']...)
2018-04-10 09:28:16,163 : INFO : adding document #120000 to
Dictionary(1385497 unique tokens: ['quadricostate', 'wjwz', 'unenthused',
'gewÀsen', 'wbss']...)
2018-04-10 09:28:38,390 : INFO : adding document #130000 to
Dictionary(1455322 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'aranga']...)
2018-04-10 09:29:01,829 : INFO : adding document #140000 to
Dictionary(1532614 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'aranga']...)
2018-04-10 09:29:23,102 : INFO : adding document #150000 to
Dictionary(1621284 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'kanenas']...)
2018-04-10 09:29:45,919 : INFO : adding document #160000 to
Dictionary(1699019 unique tokens: ['quadricostate', 'àž­àžœàž²à¹€àžŠ', 'raniere',
'bakshi', 'kanenas']...)
2018-04-10 09:30:06,493 : INFO : adding document #170000 to
Dictionary(1763625 unique tokens: ['quadricostate', 'vankulick', 'àž­àžœàž²à¹€àžŠ',
'raniere', 'bakshi']...)
2018-04-10 09:30:26,547 : INFO : adding document #180000 to
Dictionary(1816463 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:30:44,641 : INFO : adding document #190000 to
Dictionary(1873986 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:03,103 : INFO : adding document #200000 to
Dictionary(1932042 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:22,488 : INFO : adding document #210000 to
Dictionary(1982135 unique tokens: ['quadricostate', 'ballycommon',
'vankulick', 'àž­àžœàž²à¹€àžŠ', 'raniere']...)
2018-04-10 09:31:44,125 : INFO : discarding 27140 tokens: [('ziedu',
1), ('headstroke', 1), ('shawfielders', 1), ('sardisch', 1),
('luxsitpress', 1), ('fameuil', 1), ('munkaszolgálat', 1), ('batruna', 1),
('pigita', 1), ('goreiro', 1)]...
2018-04-10 09:31:44,125 : INFO : keeping 2000000 tokens which were
in no less than 0 and no more than 220000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
2018-04-10 09:31:47,454 : INFO : adding document #220000 to
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
[('willouby', 1), ('debuchii', 1), ('llanwynno', 1), ('scurfpea', 1),
('tshogchungs', 1), ('dorsalateral', 1), ('cjmi', 1), ('chierichetti', 1),
('marketized', 1), ('eubetchia', 1)]...
2018-04-10 09:32:08,284 : INFO : keeping 2000000 tokens which were
in no less than 0 and no more than 230000 (=100.0%) documents
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
2018-04-10 09:32:11,421 : INFO : adding document #230000 to
Dictionary(2000000 unique tokens: ['quadricostate', 'vankulick', 'raniere',
'bakshi', 'vermoorde']...)
On Tue, Apr 10, 2018 at 4:06 AM, Ivan Menshikh <
Post by Ivan Menshikh
Hi Craig,
https://docs.python.org/2/library/gc.html#gc.set_threshold, I'm
not sure about the usefulness of this method in the current case, but feel
free to try.
Post by Craig Thomson
Thanks for the response.
I had top up a couple of the times it froze (not htop although I
have switched to that now). On top none of the 3-4 python processes were
above 3% RAM (and they possibly share some of that anyway?).
I have actually since had a hard system freeze when pos tagging
with spaCy. I added a forced garbage collection every 10k lines or
something and now it is fine (albeit taking hours so I will need to wait to
try gensim again). SpaCy was not close to running out of RAM either.
I am now running in venv python 3.5.2 (python is new to me, Ruby,
PHP and C++ background).
I will try and freeze my laptop with gensim on the same setup.
Is there some kind of setting to make python more aggressive with
garbage collection or am I barking up the wrong tree with that idea?
On Mon, 9 Apr 2018 at 07:12, Ivan Menshikh <
Post by Ivan Menshikh
Hello,
looks like you have enough of resources for this command. Try to
see what happens with RAM/CPU at this moment using htop
<https://hisham.hm/htop/>in the different console.
Post by Axiombadger
Hi,
I am just starting to use gensim and am having some issues with
the wikipedia corpus.
https://radimrehurek.com/gensim/wiki.html
python3.5 -m gensim.scripts.make_wiki /home/user/enwiki-latest-
pages-articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,853 : INFO : running /home/user/.local/lib/
python3.5/site-packages/gensim/scripts/make_wiki.py /home/user/
enwiki-latest-pages-articles.xml.bz2 /home/user/wiki
2018-04-08 11:38:30,936 : INFO : adding document #0 to
Dictionary(0 unique tokens: [])
2018-04-08 11:39:14,701 : INFO : adding document #10000 to
Dictionary(446822 unique tokens: ['minikh', 'meteora', 'simbalist',
'burbano', 'aak']...)
2018-04-08 11:39:53,316 : INFO : adding document #20000 to
Dictionary(642024 unique tokens: ['cerego', 'minikh', 'constantian',
'študovať', 'meteora']...)
2018-04-08 11:40:25,823 : INFO : adding document #30000 to
Dictionary(779925 unique tokens: ['minikh', 'arisu', 'študovať', 'veitvet',
'djohor']...)
2018-04-08 11:40:55,901 : INFO : adding document #40000 to
Dictionary(903213 unique tokens: ['glabrum', 'minikh', 'arisu', 'študovať',
'veitvet']...)
2018-04-08 11:41:19,130 : INFO : adding document #50000 to
Dictionary(982874 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:32,992 : INFO : adding document #60000 to
Dictionary(1001051 unique tokens: ['glabrum', 'minikh', 'arisu', 'kittan',
'tennapel']...)
2018-04-08 11:41:45,127 : INFO : adding document #70000 to
Dictionary(1018903 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
2018-04-08 11:41:56,792 : INFO : adding document #80000 to
Dictionary(1034231 unique tokens: ['glabrum', 'minikh', 'labokla',
'middelmatig', 'arisu']...)
It eventually reaches a point where it just freezes. I was
using tmux to drop in and out of the terminal, so I tried plugging a
monitor into the machine I am using as a server and just running it from
there and the system locks up.
I am using Mint 18.3, with as you can see Python 3.5. I
installed all of the dependencies with pip and the --user flag and
explicitly call python-3.5
When I run the same
with enwiki-latest-pages-articles1.xml-p10p30302.bz2 (a much smaller
corpus) the task completes.
Is this just a RAM issue? I have 16GB and about 110GB free
space on an SSD. What would I need in order to run the above command?
I can use smaller corpus, I just ask because it is the first
line of code listed in the above instructions and it fails, has the file
creep from 8GB at time of writing to about 14GB now caused problems?
Where might I get logs for something crashing so unceremoniously?
Cheers.
--
You received this message because you are subscribed to the
Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...