Thank you for your helpful reply.
/make_wikicorpus.py" explanation.
produce id<>word paries using the hash function. and how I can use them for
https://radimrehurek.com/gensim/wiki.html.
I know the question seems easy. and I should have solve it myself, but I
have tried many ways but unfortunately, I have failed to do so.
Post by Alistair WindsorMy long reply seems to have gone astray so let me send a quick one. I
think the basic source of your problem is thinking of the Hash Dictionary
as a type of Dictionary (whatever could have suggested that). Hash
Dictionary does not produce word <-> id parings. It produces, without
training, a map word -> id. If we set debug = True then it also retain
information on what words were mapped to which ids. This is unlikely to
give us a map id -> word since multiple words are likely mapped to the same
id (this is called a hash collision). Just as importantly, Hash Dictionary
is not a derived class of Dictionary but its own class (based off
gensim.utils.SaveLoad).
In your code you call
HashDictionary.save_as_text(dictionary,
'/wiki_en_wordids_H_27112018.txt')
but then call
Dictionary.load_from_text('/Wiki-LSI/wiki_en_wordids_27112018.txt')
which will not work since the save_as_text format for a HashDictionary
is not the same as the save_as_text format for a Dictionary. Thus trying to
load_as_text a Dictionary from a HashDictionary save_as_text will fail.
But, horror, there is no load_as_text for a HashDictionary! It doesn't need
one.
I suggest that you look at the provided script
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/scripts/make_wikicorpus.py
and also ask youself carefully whether you want a HashDictionary or a
real Dictionary. The HashDictionary is faster and, quite possibly good
enough for word embedding purposes, but if you ever need to use the
id->word mapping to "interpret" something, be it a topic or an embedding
dimension, then you may wish to use a true Dictionary. Otherwise, if you
have a dictionary of relevant words you are able to posthoc infer an
id->word mapping or at least an id -> word(s) mapping for a HashDictionary
by brute force. This may be better than keeping the Debug information on
the HashDictionary, which will be huge for a wikidump.
Hope that helps,
Alistair
Post by omar mustafaAs you may have noticed I haven't used the whole Wikipedia corpus, it
is the only sample of the wikis contains 105 documents,
where the running time around 20 seconds.
the kernel keeps dying PROBLEM !!!!
code
_______________________________________________________________________________________________________-
import logging, gensim
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)
import bz2
from gensim.test.utils import datapath, get_tmpfile
from gensim.corpora import WikiCorpus, MmCorpus
from gensim.corpora import Dictionary, MmCorpus, HashDictionary, WikiCorpus
DEFAULT_DICT_SIZE = 100000
from gensim.utils import lemmatize
keep_words = DEFAULT_DICT_SIZE
dictionary = HashDictionary(id_range=keep_words, debug=False)
dictionary.allow_update = True
corpus =
datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
#tmp_fname = get_tmpfile("/Wiki-LSI/wiki_en_wordids_26112018.txt.bz2")
wikiids = WikiCorpus(corpus,lemmatize=lemmatize, dictionary=dictionary)
MmCorpus.serialize('/Wiki-LSI/wiki-corpus-27112018.mm', wikiids,
progress_cnt=10000)
dictionary.filter_extremes(no_below=20, no_above=0.1,
keep_n=DEFAULT_DICT_SIZE)
dictionary.save_as_text('/wiki_en_wordids_H_27112018.txt')
wikiids.save('/wiki_en_wordids_H_pkl_27112018.pkl')
dictionary.allow_update= False
purpose of the implemeantion
_______________________________________________________________________________________________________
I want to apply LSI for wiki corpus
the mapping between words and their integer ids
bag-of-words (word counts) representation
_______________________________________________________________________________________________________
when print the wiki BOW: i will get the following which is mean the BOW
is successfully created .
gensim import corpora, models, similarities
wikimapcorpus = corpora.MmCorpus('Wiki-LSI/wiki-corpus-27112018.mm')
print(wikimapcorpus)
MmCorpus(105 documents, 100000 features, 110797 non-zero entries)
problem
________________________________________________________________________________________________
BUT, for the Ids is :Zero !!
dictionary =
Dictionary.load_from_text('/Wiki-LSI/wiki_en_wordids_27112018.txt')
print(dictionary)
Dictionary(0 unique tokens: [])
___________________________________________________________________________________________
where this issue is make the kernal die, when i would run LSI, Or LDA
Post by Alistair WindsorI see what you are trying to do. Since you provided a dictionary no
dictionary is constructed when you call WikiCorpus.
You need to use the constructed iterator to do something in order to
capture the debug information. I believe that if you add
MmCorpus.serialize('filename.mm', wikiids)
before the call to
dictionary.save_as_text('Wiki-LSI/wiki_en_wordids_26112018.txt.bz2')
then you will get something.
Out of interest, what is the running time of this? I have not tried it
but it seems that call to WikiCorpus(corpus, dictionary=dictionary) with a
dictionary should be very fast. The code does not do anything but construct
an iterator. That iterator must be used to produce the dictionary necessary
to save. That is going to take a while. The use of a hash function means
that this requires only one pass. I would save the dictionary once PRIOR to
filtering and then again after filtering.
Yours,
Alistair
Post by omar mustafaI have a problem related to the extraction of word2ids from the
Wikipedia corpus.
import bz2
from gensim.corpora import Dictionary, MmCorpus, HashDictionary, WikiCorpus
DEFAULT_DICT_SIZE = 100000
keep_words = DEFAULT_DICT_SIZE
dictionary = HashDictionary(id_range=keep_words, debug=True)
dictionary.allow_update = True
corpus =
datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
wikiids = WikiCorpus(corpus, dictionary=dictionary)
dictionary.filter_extremes(no_below=20, no_above=0.1,
keep_n=DEFAULT_DICT_SIZE)
dictionary.save_as_text('Wiki-LSI/wiki_en_wordids_26112018.txt.bz2')
wikiids.save('wiki_en_wordids_pkl_26112018.pkl.bz2')
dictionary.allow_update= False
The text file that supposed to have the word2ids from the wiki corpus
is being downloaded to my pc when will run the above code, but it is empty
file !!
I would be thankful if you can help.
Regards
Omar
Post by Alistair WindsorYou are using a multistream bz2 archive under Python 2. The page you
link to has a big red box saying you cannot using multistream archives
under python 2.7 due to limitations of the bz2 package under python 2.
Download the non-multistream archive or switch to Python 3.
Alistair
--
You received this message because you are subscribed to the Google
Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.