Discussion:
[gensim:11719] Efficient way to run 1.5GB .txt file through preprocess_text and then get the unigram count?
Jeremy Gollehon
2018-10-27 22:57:16 UTC
Permalink
Hi. I'm hoping to get pointed in the right direction.

I'm using Python 3.6 and loading a 1.5GB text file and trying to get the
unigram count after preprocessing.

The code takes 5 minutes to run on my computer.

text = (Path() / "output" / "longabstract_corpus.txt").read_text(encoding=
"utf-8")
word_list = text.split()
unigram_count = Counter(word_list)

I stopped the process after 60 minutes when running this code.

text = (Path() / "output" / "longabstract_corpus.txt").read_text(encoding=
"utf-8")
word_list = gensim.parsing.preprocessing.preprocess_string(text)
unigram_count = Counter(word_list)

Any ideas on how to speed up preprocessing?

Thanks!
Jeremy
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Alistair Windsor
2018-11-13 16:36:47 UTC
Permalink
Not sure what the problem is however as your code is written the
preprocessing step does not seem to be doing any preprocessing. The
particular preprocessing steps are the second argument to .
preprocess_string. Next, the read_text method reads in the entire1.5GB file
as a string, which you pass to the preprocessing step. There are lots of
things which could happen inside the preprocessing command which could be
massively inefficient with such a huge string.

The solution is to read your file in one line at a time and pass that line
into the preprocessing command and then call the .update method on your
counter.

unigram_count = Counter()
with open(filename,'r') as text_file:

for line in text_file:

word_list = gensim.parsing.preprocessing.preprocess_string(text,'INSERT
SOME FILTERS HERE')
unigram_count.update(word_list)


Something like this should work. Tell me if it improves the running time.
Post by Jeremy Gollehon
Hi. I'm hoping to get pointed in the right direction.
I'm using Python 3.6 and loading a 1.5GB text file and trying to get the
unigram count after preprocessing.
The code takes 5 minutes to run on my computer.
text = (Path() / "output" / "longabstract_corpus.txt").read_text(encoding=
"utf-8")
word_list = text.split()
unigram_count = Counter(word_list)
I stopped the process after 60 minutes when running this code.
text = (Path() / "output" / "longabstract_corpus.txt").read_text(encoding=
"utf-8")
word_list = gensim.parsing.preprocessing.preprocess_string(text)
unigram_count = Counter(word_list)
Any ideas on how to speed up preprocessing?
Thanks!
Jeremy
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...