[gensim:11800] Wikipedia extraction error

Discussion:

t***@gmail.com

2018-11-23 06:35:01 UTC

There
<https://398-1349775-gh.circle-artifacts.com/0/documentation/html/generated/gensim.corpora.wikicorpus.WikiCorpus.html>
is an example:

from gensim.corpora import WikiCorpus, MmCorpus
wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create
word->word_id mapping, takes almost 8h
MmCorpus.serialize('wiki_en_vocab200k.mm', wiki) # another 8h, creates a
file in MatrixMarket format and mapping

I downloaded enwiki-20181020-pages-articles-multistream.xml.bz2 from here
<https://dumps.wikimedia.org/enwiki/20181020/>.

Unfortunately, the interpreter reports an error:
Process InputQueue-4:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 267, in
_bootstrap
self.run()
File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 1197,
in run
wrapped_chunk = [list(chunk)]
File
"/usr/local/lib/python2.7/dist-packages/gensim/corpora/wikicorpus.py", line
667, in <genexpr>
((text, self.lemmatize, title, pageid, tokenization_params)
File
"/usr/local/lib/python2.7/dist-packages/gensim/corpora/wikicorpus.py", line
419, in extract_pages
for elem in elems:
File
"/usr/local/lib/python2.7/dist-packages/gensim/corpora/wikicorpus.py", line
404, in <genexpr>
elems = (elem for _, elem in iterparse(f, events=("end",)))
File "<string>", line 107, in next
ParseError: no element found: line 43, column 0

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alistair Windsor

2018-11-25 13:00:02 UTC

Permalink

You are using a multistream bz2 archive under Python 2. The page you link to has a big red box saying you cannot using multistream archives under python 2.7 due to limitations of the bz2 package under python 2. Download the non-multistream archive or switch to Python 3.

Alistair

t***@gmail.com

2018-11-25 16:32:25 UTC

Permalink

Post by Alistair Windsor
You are using a multistream bz2 archive under Python 2. The page you link
to has a big red box saying you cannot using multistream archives under
python 2.7 due to limitations of the bz2 package under python 2. Download
the non-multistream archive or switch to Python 3.

@Alistair
Thanks, I got it wrong.
I guess enwiki-20181120-pages-articles.xml.bz2 should be fine. Downloading.

omar mustafa

2018-11-26 14:48:50 UTC

Permalink

Dear @Alistair

I have a problem related to the extraction of word2ids from the Wikipedia
corpus.

Here is my code :

import bz2
from gensim.corpora import Dictionary, MmCorpus, HashDictionary, WikiCorpus
DEFAULT_DICT_SIZE = 100000
keep_words = DEFAULT_DICT_SIZE
dictionary = HashDictionary(id_range=keep_words, debug=True)
dictionary.allow_update = True
corpus =
datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")

wikiids = WikiCorpus(corpus, dictionary=dictionary)

dictionary.filter_extremes(no_below=20, no_above=0.1,
keep_n=DEFAULT_DICT_SIZE)
dictionary.save_as_text('Wiki-LSI/wiki_en_wordids_26112018.txt.bz2')
wikiids.save('wiki_en_wordids_pkl_26112018.pkl.bz2')
dictionary.allow_update= False

The text file that supposed to have the word2ids from the wiki corpus is
being downloaded to my pc when will run the above code, but it is empty
file !!

I would be thankful if you can help.
Regards
Omar

Alistair Windsor

2018-11-26 18:19:35 UTC

Permalink

I see what you are trying to do. Since you provided a dictionary no
dictionary is constructed when you call WikiCorpus.

You need to use the constructed iterator to do something in order to
capture the debug information. I believe that if you add

MmCorpus.serialize('filename.mm', wikiids)

before the call to
dictionary.save_as_text('Wiki-LSI/wiki_en_wordids_26112018.txt.bz2')
then you will get something.

Out of interest, what is the running time of this? I have not tried it but
it seems that call to WikiCorpus(corpus, dictionary=dictionary) with a
dictionary should be very fast. The code does not do anything but construct
an iterator. That iterator must be used to produce the dictionary necessary
to save. That is going to take a while. The use of a hash function means
that this requires only one pass. I would save the dictionary once PRIOR to
filtering and then again after filtering.

Yours,

Alistair

Post by omar mustafa
I have a problem related to the extraction of word2ids from the Wikipedia
corpus.
import bz2
from gensim.corpora import Dictionary, MmCorpus, HashDictionary, WikiCorpus
DEFAULT_DICT_SIZE = 100000
keep_words = DEFAULT_DICT_SIZE
dictionary = HashDictionary(id_range=keep_words, debug=True)
dictionary.allow_update = True
corpus =
datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
wikiids = WikiCorpus(corpus, dictionary=dictionary)
dictionary.filter_extremes(no_below=20, no_above=0.1,
keep_n=DEFAULT_DICT_SIZE)
dictionary.save_as_text('Wiki-LSI/wiki_en_wordids_26112018.txt.bz2')
wikiids.save('wiki_en_wordids_pkl_26112018.pkl.bz2')
dictionary.allow_update= False
The text file that supposed to have the word2ids from the wiki corpus is
being downloaded to my pc when will run the above code, but it is empty
file !!
I would be thankful if you can help.
Regards
Omar

omar mustafa

2018-11-27 05:09:14 UTC

Permalink

Dear @Alistair

As you may have noticed I haven't used the whole Wikipedia corpus, it is
the only sample of the wikis contains 105 documents,
where the running time around 20 seconds.

the kernel keeps dying PROBLEM !!!!

code
_______________________________________________________________________________________________________-
import logging, gensim
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)

import bz2
from gensim.test.utils import datapath, get_tmpfile
from gensim.corpora import WikiCorpus, MmCorpus
from gensim.corpora import Dictionary, MmCorpus, HashDictionary, WikiCorpus
DEFAULT_DICT_SIZE = 100000
from gensim.utils import lemmatize
keep_words = DEFAULT_DICT_SIZE
dictionary = HashDictionary(id_range=keep_words, debug=False)
dictionary.allow_update = True

corpus =
datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")

#tmp_fname = get_tmpfile("/Wiki-LSI/wiki_en_wordids_26112018.txt.bz2")
wikiids = WikiCorpus(corpus,lemmatize=lemmatize, dictionary=dictionary)
MmCorpus.serialize('/Wiki-LSI/wiki-corpus-27112018.mm', wikiids,
progress_cnt=10000)

dictionary.filter_extremes(no_below=20, no_above=0.1,
keep_n=DEFAULT_DICT_SIZE)

dictionary.save_as_text('/wiki_en_wordids_H_27112018.txt')
wikiids.save('/wiki_en_wordids_H_pkl_27112018.pkl')
dictionary.allow_update= False

purpose of the implemeantion
_______________________________________________________________________________________________________
I want to apply LSI for wiki corpus
the mapping between words and their integer ids
bag-of-words (word counts) representation
_______________________________________________________________________________________________________
when print the wiki BOW: i will get the following which is mean the BOW is
successfully created .

gensim import corpora, models, similarities
wikimapcorpus = corpora.MmCorpus('Wiki-LSI/wiki-corpus-27112018.mm')
print(wikimapcorpus)

MmCorpus(105 documents, 100000 features, 110797 non-zero entries)

problem
________________________________________________________________________________________________
BUT, for the Ids is :Zero !!

dictionary =
Dictionary.load_from_text('/Wiki-LSI/wiki_en_wordids_27112018.txt')
print(dictionary)

Dictionary(0 unique tokens: [])

___________________________________________________________________________________________

where this issue is make the kernal die, when i would run LSI, Or LDA

Post by Alistair Windsor
I see what you are trying to do. Since you provided a dictionary no
dictionary is constructed when you call WikiCorpus.
You need to use the constructed iterator to do something in order to
capture the debug information. I believe that if you add
MmCorpus.serialize('filename.mm', wikiids)
before the call to
dictionary.save_as_text('Wiki-LSI/wiki_en_wordids_26112018.txt.bz2')
then you will get something.
Out of interest, what is the running time of this? I have not tried it but
it seems that call to WikiCorpus(corpus, dictionary=dictionary) with a
dictionary should be very fast. The code does not do anything but construct
an iterator. That iterator must be used to produce the dictionary necessary
to save. That is going to take a while. The use of a hash function means
that this requires only one pass. I would save the dictionary once PRIOR to
filtering and then again after filtering.
Yours,
Alistair

Post by Alistair Windsor
You are using a multistream bz2 archive under Python 2. The page you
link to has a big red box saying you cannot using multistream archives
under python 2.7 due to limitations of the bz2 package under python 2.
Download the non-multistream archive or switch to Python 3.
Alistair

You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

Alistair Windsor

2018-11-28 07:39:29 UTC

Permalink

My long reply seems to have gone astray so let me send a quick one. I think
the basic source of your problem is thinking of the Hash Dictionary as a
type of Dictionary (whatever could have suggested that). Hash Dictionary
does not produce word <-> id parings. It produces, without training, a map
word -> id. If we set debug = True then it also retain information on what
words were mapped to which ids. This is unlikely to give us a map id ->
word since multiple words are likely mapped to the same id (this is called
a hash collision). Just as importantly, Hash Dictionary is not a derived
class of Dictionary but its own class (based off gensim.utils.SaveLoad).

In your code you call
HashDictionary.save_as_text(dictionary, '/wiki_en_wordids_H_27112018.txt')
but then call
Dictionary.load_from_text('/Wiki-LSI/wiki_en_wordids_27112018.txt')
which will not work since the save_as_text format for a HashDictionary is
not the same as the save_as_text format for a Dictionary. Thus trying to
load_as_text a Dictionary from a HashDictionary save_as_text will fail.
But, horror, there is no load_as_text for a HashDictionary! It doesn't need
one.

I suggest that you look at the provided script

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/scripts/make_wikicorpus.py

and also ask youself carefully whether you want a HashDictionary or a real
Dictionary. The HashDictionary is faster and, quite possibly good enough
for word embedding purposes, but if you ever need to use the id->word
mapping to "interpret" something, be it a topic or an embedding dimension,
then you may wish to use a true Dictionary. Otherwise, if you have a
dictionary of relevant words you are able to posthoc infer an id->word
mapping or at least an id -> word(s) mapping for a HashDictionary by brute
force. This may be better than keeping the Debug information on the
HashDictionary, which will be huge for a wikidump.

Hope that helps,

Alistair

Post by omar mustafa
As you may have noticed I haven't used the whole Wikipedia corpus, it is
the only sample of the wikis contains 105 documents,
where the running time around 20 seconds.
the kernel keeps dying PROBLEM !!!!
code
_______________________________________________________________________________________________________-
import logging, gensim
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)
import bz2
from gensim.test.utils import datapath, get_tmpfile
from gensim.corpora import WikiCorpus, MmCorpus
from gensim.corpora import Dictionary, MmCorpus, HashDictionary, WikiCorpus
DEFAULT_DICT_SIZE = 100000
from gensim.utils import lemmatize
keep_words = DEFAULT_DICT_SIZE
dictionary = HashDictionary(id_range=keep_words, debug=False)
dictionary.allow_update = True
corpus =
datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
#tmp_fname = get_tmpfile("/Wiki-LSI/wiki_en_wordids_26112018.txt.bz2")
wikiids = WikiCorpus(corpus,lemmatize=lemmatize, dictionary=dictionary)
MmCorpus.serialize('/Wiki-LSI/wiki-corpus-27112018.mm', wikiids,
progress_cnt=10000)
dictionary.filter_extremes(no_below=20, no_above=0.1,
keep_n=DEFAULT_DICT_SIZE)
dictionary.save_as_text('/wiki_en_wordids_H_27112018.txt')
wikiids.save('/wiki_en_wordids_H_pkl_27112018.pkl')
dictionary.allow_update= False
purpose of the implemeantion
_______________________________________________________________________________________________________
I want to apply LSI for wiki corpus
the mapping between words and their integer ids
bag-of-words (word counts) representation
_______________________________________________________________________________________________________
when print the wiki BOW: i will get the following which is mean the BOW is
successfully created .
gensim import corpora, models, similarities
wikimapcorpus = corpora.MmCorpus('Wiki-LSI/wiki-corpus-27112018.mm')
print(wikimapcorpus)
MmCorpus(105 documents, 100000 features, 110797 non-zero entries)
problem
________________________________________________________________________________________________
BUT, for the Ids is :Zero !!
dictionary =
Dictionary.load_from_text('/Wiki-LSI/wiki_en_wordids_27112018.txt')
print(dictionary)
Dictionary(0 unique tokens: [])
___________________________________________________________________________________________
where this issue is make the kernal die, when i would run LSI, Or LDA

Post by Alistair Windsor
I see what you are trying to do. Since you provided a dictionary no
dictionary is constructed when you call WikiCorpus.
You need to use the constructed iterator to do something in order to
capture the debug information. I believe that if you add
MmCorpus.serialize('filename.mm', wikiids)
before the call to
dictionary.save_as_text('Wiki-LSI/wiki_en_wordids_26112018.txt.bz2')
then you will get something.
Out of interest, what is the running time of this? I have not tried it
but it seems that call to WikiCorpus(corpus, dictionary=dictionary) with a
dictionary should be very fast. The code does not do anything but construct
an iterator. That iterator must be used to produce the dictionary necessary
to save. That is going to take a while. The use of a hash function means
that this requires only one pass. I would save the dictionary once PRIOR to
filtering and then again after filtering.
Yours,
Alistair

Post by omar mustafa
I have a problem related to the extraction of word2ids from the
Wikipedia corpus.
import bz2
from gensim.corpora import Dictionary, MmCorpus, HashDictionary, WikiCorpus
DEFAULT_DICT_SIZE = 100000
keep_words = DEFAULT_DICT_SIZE
dictionary = HashDictionary(id_range=keep_words, debug=True)
dictionary.allow_update = True
corpus =
datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
wikiids = WikiCorpus(corpus, dictionary=dictionary)
dictionary.filter_extremes(no_below=20, no_above=0.1,
keep_n=DEFAULT_DICT_SIZE)
dictionary.save_as_text('Wiki-LSI/wiki_en_wordids_26112018.txt.bz2')
wikiids.save('wiki_en_wordids_pkl_26112018.pkl.bz2')
dictionary.allow_update= False
The text file that supposed to have the word2ids from the wiki corpus is
being downloaded to my pc when will run the above code, but it is empty
file !!
I would be thankful if you can help.
Regards
Omar

Post by Alistair Windsor
You are using a multistream bz2 archive under Python 2. The page you
link to has a big red box saying you cannot using multistream archives
under python 2.7 due to limitations of the bz2 package under python 2.
Download the non-multistream archive or switch to Python 3.
Alistair

omar mustafa

2018-11-30 04:22:21 UTC

Permalink

Dear @Alistair

Thank you for your helpful reply.

Unfortunately, I'm not able to figure out the problem, even I have tried to
follow the "gensim <https://github.com/RaRe-Technologies/gensim>/gensim
<https://github.com/RaRe-Technologies/gensim/tree/develop/gensim>/scripts
<https://github.com/RaRe-Technologies/gensim/tree/develop/gensim/scripts>/
make_wikicorpus.py" explanation.

Would you please share with me the required code that is needed to produce
id<>word paries using the hash function. and how I can use them for LDA
topic modeling as it is mentioned here
https://radimrehurek.com/gensim/wiki.html.

I know the question seems easy. and I should have solve it myself, but I
have tried many ways but unfortunately, I have failed to do so.

thank you for your time and support
Regards
Omar

Post by Alistair Windsor
My long reply seems to have gone astray so let me send a quick one. I
think the basic source of your problem is thinking of the Hash Dictionary
as a type of Dictionary (whatever could have suggested that). Hash
Dictionary does not produce word <-> id parings. It produces, without
training, a map word -> id. If we set debug = True then it also retain
information on what words were mapped to which ids. This is unlikely to
give us a map id -> word since multiple words are likely mapped to the same
id (this is called a hash collision). Just as importantly, Hash Dictionary
is not a derived class of Dictionary but its own class (based off
gensim.utils.SaveLoad).
In your code you call
HashDictionary.save_as_text(dictionary, '/wiki_en_wordids_H_27112018.txt')
but then call
Dictionary.load_from_text('/Wiki-LSI/wiki_en_wordids_27112018.txt')
which will not work since the save_as_text format for a HashDictionary is
not the same as the save_as_text format for a Dictionary. Thus trying to
load_as_text a Dictionary from a HashDictionary save_as_text will fail.
But, horror, there is no load_as_text for a HashDictionary! It doesn't need
one.
I suggest that you look at the provided script
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/scripts/make_wikicorpus.py
and also ask youself carefully whether you want a HashDictionary or a real
Dictionary. The HashDictionary is faster and, quite possibly good enough
for word embedding purposes, but if you ever need to use the id->word
mapping to "interpret" something, be it a topic or an embedding dimension,
then you may wish to use a true Dictionary. Otherwise, if you have a
dictionary of relevant words you are able to posthoc infer an id->word
mapping or at least an id -> word(s) mapping for a HashDictionary by brute
force. This may be better than keeping the Debug information on the
HashDictionary, which will be huge for a wikidump.
Hope that helps,
Alistair

Post by omar mustafa
As you may have noticed I haven't used the whole Wikipedia corpus, it is
the only sample of the wikis contains 105 documents,
where the running time around 20 seconds.
the kernel keeps dying PROBLEM !!!!
code
_______________________________________________________________________________________________________-
import logging, gensim
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)
import bz2
from gensim.test.utils import datapath, get_tmpfile
from gensim.corpora import WikiCorpus, MmCorpus
from gensim.corpora import Dictionary, MmCorpus, HashDictionary, WikiCorpus
DEFAULT_DICT_SIZE = 100000
from gensim.utils import lemmatize
keep_words = DEFAULT_DICT_SIZE
dictionary = HashDictionary(id_range=keep_words, debug=False)
dictionary.allow_update = True
corpus =
datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
#tmp_fname = get_tmpfile("/Wiki-LSI/wiki_en_wordids_26112018.txt.bz2")
wikiids = WikiCorpus(corpus,lemmatize=lemmatize, dictionary=dictionary)
MmCorpus.serialize('/Wiki-LSI/wiki-corpus-27112018.mm', wikiids,
progress_cnt=10000)
dictionary.filter_extremes(no_below=20, no_above=0.1,
keep_n=DEFAULT_DICT_SIZE)
dictionary.save_as_text('/wiki_en_wordids_H_27112018.txt')
wikiids.save('/wiki_en_wordids_H_pkl_27112018.pkl')
dictionary.allow_update= False
purpose of the implemeantion
_______________________________________________________________________________________________________
I want to apply LSI for wiki corpus
the mapping between words and their integer ids
bag-of-words (word counts) representation
_______________________________________________________________________________________________________
when print the wiki BOW: i will get the following which is mean the BOW
is successfully created .
gensim import corpora, models, similarities
wikimapcorpus = corpora.MmCorpus('Wiki-LSI/wiki-corpus-27112018.mm')
print(wikimapcorpus)
MmCorpus(105 documents, 100000 features, 110797 non-zero entries)
problem
________________________________________________________________________________________________
BUT, for the Ids is :Zero !!
dictionary =
Dictionary.load_from_text('/Wiki-LSI/wiki_en_wordids_27112018.txt')
print(dictionary)
Dictionary(0 unique tokens: [])
___________________________________________________________________________________________
where this issue is make the kernal die, when i would run LSI, Or LDA

Post by Alistair Windsor
I see what you are trying to do. Since you provided a dictionary no
dictionary is constructed when you call WikiCorpus.
You need to use the constructed iterator to do something in order to
capture the debug information. I believe that if you add
MmCorpus.serialize('filename.mm', wikiids)
before the call to
dictionary.save_as_text('Wiki-LSI/wiki_en_wordids_26112018.txt.bz2')
then you will get something.
Out of interest, what is the running time of this? I have not tried it
but it seems that call to WikiCorpus(corpus, dictionary=dictionary) with a
dictionary should be very fast. The code does not do anything but construct
an iterator. That iterator must be used to produce the dictionary necessary
to save. That is going to take a while. The use of a hash function means
that this requires only one pass. I would save the dictionary once PRIOR to
filtering and then again after filtering.
Yours,
Alistair

Post by omar mustafa
I have a problem related to the extraction of word2ids from the
Wikipedia corpus.
import bz2
from gensim.corpora import Dictionary, MmCorpus, HashDictionary, WikiCorpus
DEFAULT_DICT_SIZE = 100000
keep_words = DEFAULT_DICT_SIZE
dictionary = HashDictionary(id_range=keep_words, debug=True)
dictionary.allow_update = True
corpus =
datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
wikiids = WikiCorpus(corpus, dictionary=dictionary)
dictionary.filter_extremes(no_below=20, no_above=0.1,
keep_n=DEFAULT_DICT_SIZE)
dictionary.save_as_text('Wiki-LSI/wiki_en_wordids_26112018.txt.bz2')
wikiids.save('wiki_en_wordids_pkl_26112018.pkl.bz2')
dictionary.allow_update= False
The text file that supposed to have the word2ids from the wiki corpus
is being downloaded to my pc when will run the above code, but it is empty
file !!
I would be thankful if you can help.
Regards
Omar

Post by Alistair Windsor
You are using a multistream bz2 archive under Python 2. The page you
link to has a big red box saying you cannot using multistream archives
under python 2.7 due to limitations of the bz2 package under python 2.
Download the non-multistream archive or switch to Python 3.
Alistair

You received this message because you are subscribed to the Google
Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

Alistair Windsor

2018-12-01 15:22:34 UTC

Permalink

I think you are still confused about what a HashDictionary is. I cannot
provide "the required code that is needed to produce id<>word pairs using
the hash function" because that is not how a HashDictionary works. If you
need id<->word pairs then you must use a Dictionary and not a
HashDictionary. A HashDictionary produces a word>id mapping that is
probably not one to one.

For what you appear to be doing the premade scripts
"gensim/gensim/scripts/make_wikicorpus.py" look like they should provide a
complete solution.

Yours,

Alistair

Post by omar mustafa
Thank you for your helpful reply.
Unfortunately, I'm not able to figure out the problem, even I have tried
to follow the "gensim <https://github.com/RaRe-Technologies/gensim>/gensim
<https://github.com/RaRe-Technologies/gensim/tree/develop/gensim>/scripts
<https://github.com/RaRe-Technologies/gensim/tree/develop/gensim/scripts>/
make_wikicorpus.py" explanation.
Would you please share with me the required code that is needed to produce
id<>word paries using the hash function. and how I can use them for LDA
topic modeling as it is mentioned here
https://radimrehurek.com/gensim/wiki.html.
I know the question seems easy. and I should have solve it myself, but I
have tried many ways but unfortunately, I have failed to do so.
thank you for your time and support
Regards
Omar

Post by Alistair Windsor
My long reply seems to have gone astray so let me send a quick one. I
think the basic source of your problem is thinking of the Hash Dictionary
as a type of Dictionary (whatever could have suggested that). Hash
Dictionary does not produce word <-> id parings. It produces, without
training, a map word -> id. If we set debug = True then it also retain
information on what words were mapped to which ids. This is unlikely to
give us a map id -> word since multiple words are likely mapped to the same
id (this is called a hash collision). Just as importantly, Hash Dictionary
is not a derived class of Dictionary but its own class (based off
gensim.utils.SaveLoad).
In your code you call
HashDictionary.save_as_text(dictionary, '/wiki_en_wordids_H_27112018.txt'
)
but then call
Dictionary.load_from_text('/Wiki-LSI/wiki_en_wordids_27112018.txt')
which will not work since the save_as_text format for a HashDictionary is
not the same as the save_as_text format for a Dictionary. Thus trying to
load_as_text a Dictionary from a HashDictionary save_as_text will fail.
But, horror, there is no load_as_text for a HashDictionary! It doesn't need
one.
I suggest that you look at the provided script
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/scripts/make_wikicorpus.py
and also ask youself carefully whether you want a HashDictionary or a
real Dictionary. The HashDictionary is faster and, quite possibly good
enough for word embedding purposes, but if you ever need to use the
id->word mapping to "interpret" something, be it a topic or an embedding
dimension, then you may wish to use a true Dictionary. Otherwise, if you
have a dictionary of relevant words you are able to posthoc infer an
id->word mapping or at least an id -> word(s) mapping for a HashDictionary
by brute force. This may be better than keeping the Debug information on
the HashDictionary, which will be huge for a wikidump.
Hope that helps,
Alistair

Post by omar mustafa
As you may have noticed I haven't used the whole Wikipedia corpus, it is
the only sample of the wikis contains 105 documents,
where the running time around 20 seconds.
the kernel keeps dying PROBLEM !!!!
code
_______________________________________________________________________________________________________-
import logging, gensim
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)
import bz2
from gensim.test.utils import datapath, get_tmpfile
from gensim.corpora import WikiCorpus, MmCorpus
from gensim.corpora import Dictionary, MmCorpus, HashDictionary, WikiCorpus
DEFAULT_DICT_SIZE = 100000
from gensim.utils import lemmatize
keep_words = DEFAULT_DICT_SIZE
dictionary = HashDictionary(id_range=keep_words, debug=False)
dictionary.allow_update = True
corpus =
datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
#tmp_fname = get_tmpfile("/Wiki-LSI/wiki_en_wordids_26112018.txt.bz2")
wikiids = WikiCorpus(corpus,lemmatize=lemmatize, dictionary=dictionary)
MmCorpus.serialize('/Wiki-LSI/wiki-corpus-27112018.mm', wikiids,
progress_cnt=10000)
dictionary.filter_extremes(no_below=20, no_above=0.1,
keep_n=DEFAULT_DICT_SIZE)
dictionary.save_as_text('/wiki_en_wordids_H_27112018.txt')
wikiids.save('/wiki_en_wordids_H_pkl_27112018.pkl')
dictionary.allow_update= False
purpose of the implemeantion
_______________________________________________________________________________________________________
I want to apply LSI for wiki corpus
the mapping between words and their integer ids
bag-of-words (word counts) representation
_______________________________________________________________________________________________________
when print the wiki BOW: i will get the following which is mean the BOW
is successfully created .
gensim import corpora, models, similarities
wikimapcorpus = corpora.MmCorpus('Wiki-LSI/wiki-corpus-27112018.mm')
print(wikimapcorpus)
MmCorpus(105 documents, 100000 features, 110797 non-zero entries)
problem
________________________________________________________________________________________________
BUT, for the Ids is :Zero !!
dictionary =
Dictionary.load_from_text('/Wiki-LSI/wiki_en_wordids_27112018.txt')
print(dictionary)
Dictionary(0 unique tokens: [])
___________________________________________________________________________________________
where this issue is make the kernal die, when i would run LSI, Or LDA

Post by Alistair Windsor
I see what you are trying to do. Since you provided a dictionary no
dictionary is constructed when you call WikiCorpus.
You need to use the constructed iterator to do something in order to
capture the debug information. I believe that if you add
MmCorpus.serialize('filename.mm', wikiids)
before the call to
dictionary.save_as_text('Wiki-LSI/wiki_en_wordids_26112018.txt.bz2')
then you will get something.
Out of interest, what is the running time of this? I have not tried it
but it seems that call to WikiCorpus(corpus, dictionary=dictionary) with a
dictionary should be very fast. The code does not do anything but construct
an iterator. That iterator must be used to produce the dictionary necessary
to save. That is going to take a while. The use of a hash function means
that this requires only one pass. I would save the dictionary once PRIOR to
filtering and then again after filtering.
Yours,
Alistair

Post by omar mustafa
I have a problem related to the extraction of word2ids from the
Wikipedia corpus.
import bz2
from gensim.corpora import Dictionary, MmCorpus, HashDictionary, WikiCorpus
DEFAULT_DICT_SIZE = 100000
keep_words = DEFAULT_DICT_SIZE
dictionary = HashDictionary(id_range=keep_words, debug=True)
dictionary.allow_update = True
corpus =
datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
wikiids = WikiCorpus(corpus, dictionary=dictionary)
dictionary.filter_extremes(no_below=20, no_above=0.1,
keep_n=DEFAULT_DICT_SIZE)
dictionary.save_as_text('Wiki-LSI/wiki_en_wordids_26112018.txt.bz2')
wikiids.save('wiki_en_wordids_pkl_26112018.pkl.bz2')
dictionary.allow_update= False
The text file that supposed to have the word2ids from the wiki corpus
is being downloaded to my pc when will run the above code, but it is empty
file !!
I would be thankful if you can help.
Regards
Omar

Post by Alistair Windsor
You are using a multistream bz2 archive under Python 2. The page you
link to has a big red box saying you cannot using multistream archives
under python 2.7 due to limitations of the bz2 package under python 2.
Download the non-multistream archive or switch to Python 3.
Alistair

omar mustafa

2018-12-03 03:44:18 UTC

Permalink

Thank you for your response.
Appreciated your help

Post by Alistair Windsor
I think you are still confused about what a HashDictionary is. I cannot
provide "the required code that is needed to produce id<>word pairs using
the hash function" because that is not how a HashDictionary works. If you
need id<->word pairs then you must use a Dictionary and not a
HashDictionary. A HashDictionary produces a word>id mapping that is
probably not one to one.
For what you appear to be doing the premade scripts
"gensim/gensim/scripts/make_wikicorpus.py" look like they should provide a
complete solution.
Yours,
Alistair

Post by omar mustafa
Thank you for your helpful reply.
Unfortunately, I'm not able to figure out the problem, even I have tried
to follow the "gensim <https://github.com/RaRe-Technologies/gensim>/
gensim <https://github.com/RaRe-Technologies/gensim/tree/develop/gensim>/
scripts
<https://github.com/RaRe-Technologies/gensim/tree/develop/gensim/scripts>
/make_wikicorpus.py" explanation.
Would you please share with me the required code that is needed to
produce id<>word paries using the hash function. and how I can use them for
LDA topic modeling as it is mentioned here
https://radimrehurek.com/gensim/wiki.html.
I know the question seems easy. and I should have solve it myself, but I
have tried many ways but unfortunately, I have failed to do so.
thank you for your time and support
Regards
Omar

Post by Alistair Windsor
My long reply seems to have gone astray so let me send a quick one. I
think the basic source of your problem is thinking of the Hash Dictionary
as a type of Dictionary (whatever could have suggested that). Hash
Dictionary does not produce word <-> id parings. It produces, without
training, a map word -> id. If we set debug = True then it also retain
information on what words were mapped to which ids. This is unlikely to
give us a map id -> word since multiple words are likely mapped to the same
id (this is called a hash collision). Just as importantly, Hash Dictionary
is not a derived class of Dictionary but its own class (based off
gensim.utils.SaveLoad).
In your code you call
HashDictionary.save_as_text(dictionary,
'/wiki_en_wordids_H_27112018.txt')
but then call
Dictionary.load_from_text('/Wiki-LSI/wiki_en_wordids_27112018.txt')
which will not work since the save_as_text format for a HashDictionary
is not the same as the save_as_text format for a Dictionary. Thus trying to
load_as_text a Dictionary from a HashDictionary save_as_text will fail.
But, horror, there is no load_as_text for a HashDictionary! It doesn't need
one.
I suggest that you look at the provided script
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/scripts/make_wikicorpus.py
and also ask youself carefully whether you want a HashDictionary or a
real Dictionary. The HashDictionary is faster and, quite possibly good
enough for word embedding purposes, but if you ever need to use the
id->word mapping to "interpret" something, be it a topic or an embedding
dimension, then you may wish to use a true Dictionary. Otherwise, if you
have a dictionary of relevant words you are able to posthoc infer an
id->word mapping or at least an id -> word(s) mapping for a HashDictionary
by brute force. This may be better than keeping the Debug information on
the HashDictionary, which will be huge for a wikidump.
Hope that helps,
Alistair

Post by omar mustafa
As you may have noticed I haven't used the whole Wikipedia corpus, it
is the only sample of the wikis contains 105 documents,
where the running time around 20 seconds.
the kernel keeps dying PROBLEM !!!!
code
_______________________________________________________________________________________________________-
import logging, gensim
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)
import bz2
from gensim.test.utils import datapath, get_tmpfile
from gensim.corpora import WikiCorpus, MmCorpus
from gensim.corpora import Dictionary, MmCorpus, HashDictionary, WikiCorpus
DEFAULT_DICT_SIZE = 100000
from gensim.utils import lemmatize
keep_words = DEFAULT_DICT_SIZE
dictionary = HashDictionary(id_range=keep_words, debug=False)
dictionary.allow_update = True
corpus =
datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
#tmp_fname = get_tmpfile("/Wiki-LSI/wiki_en_wordids_26112018.txt.bz2")
wikiids = WikiCorpus(corpus,lemmatize=lemmatize, dictionary=dictionary)
MmCorpus.serialize('/Wiki-LSI/wiki-corpus-27112018.mm', wikiids,
progress_cnt=10000)
dictionary.filter_extremes(no_below=20, no_above=0.1,
keep_n=DEFAULT_DICT_SIZE)
dictionary.save_as_text('/wiki_en_wordids_H_27112018.txt')
wikiids.save('/wiki_en_wordids_H_pkl_27112018.pkl')
dictionary.allow_update= False
purpose of the implemeantion
_______________________________________________________________________________________________________
I want to apply LSI for wiki corpus
the mapping between words and their integer ids
bag-of-words (word counts) representation
_______________________________________________________________________________________________________
when print the wiki BOW: i will get the following which is mean the BOW
is successfully created .
gensim import corpora, models, similarities
wikimapcorpus = corpora.MmCorpus('Wiki-LSI/wiki-corpus-27112018.mm')
print(wikimapcorpus)
MmCorpus(105 documents, 100000 features, 110797 non-zero entries)
problem
________________________________________________________________________________________________
BUT, for the Ids is :Zero !!
dictionary =
Dictionary.load_from_text('/Wiki-LSI/wiki_en_wordids_27112018.txt')
print(dictionary)
Dictionary(0 unique tokens: [])
___________________________________________________________________________________________
where this issue is make the kernal die, when i would run LSI, Or LDA

Post by Alistair Windsor
I see what you are trying to do. Since you provided a dictionary no
dictionary is constructed when you call WikiCorpus.
You need to use the constructed iterator to do something in order to
capture the debug information. I believe that if you add
MmCorpus.serialize('filename.mm', wikiids)
before the call to
dictionary.save_as_text('Wiki-LSI/wiki_en_wordids_26112018.txt.bz2')
then you will get something.
Out of interest, what is the running time of this? I have not tried it
but it seems that call to WikiCorpus(corpus, dictionary=dictionary) with a
dictionary should be very fast. The code does not do anything but construct
an iterator. That iterator must be used to produce the dictionary necessary
to save. That is going to take a while. The use of a hash function means
that this requires only one pass. I would save the dictionary once PRIOR to
filtering and then again after filtering.
Yours,
Alistair

Post by omar mustafa
I have a problem related to the extraction of word2ids from the
Wikipedia corpus.
import bz2
from gensim.corpora import Dictionary, MmCorpus, HashDictionary, WikiCorpus
DEFAULT_DICT_SIZE = 100000
keep_words = DEFAULT_DICT_SIZE
dictionary = HashDictionary(id_range=keep_words, debug=True)
dictionary.allow_update = True
corpus =
datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
wikiids = WikiCorpus(corpus, dictionary=dictionary)
dictionary.filter_extremes(no_below=20, no_above=0.1,
keep_n=DEFAULT_DICT_SIZE)
dictionary.save_as_text('Wiki-LSI/wiki_en_wordids_26112018.txt.bz2')
wikiids.save('wiki_en_wordids_pkl_26112018.pkl.bz2')
dictionary.allow_update= False
The text file that supposed to have the word2ids from the wiki corpus
is being downloaded to my pc when will run the above code, but it is empty
file !!
I would be thankful if you can help.
Regards
Omar

Post by Alistair Windsor
You are using a multistream bz2 archive under Python 2. The page you
link to has a big red box saying you cannot using multistream archives
under python 2.7 due to limitations of the bz2 package under python 2.
Download the non-multistream archive or switch to Python 3.
Alistair