[gensim:7976] Word2Vec with phrases : train() called with an empty iterator

Discussion:

e***@gmail.com

2017-03-27 13:43:52 UTC

sentences = Text8Corpus('/home/prakhar/text8')
phrases = Phrases(Text8Corpus('/home/prakhar/text8'), min_count=1, threshold
=2)
bigram = Phraser(phrases)
model = models.word2vec.Word2Vec(bigram[sentences], size=200,workers=4,
min_count=1)

The logger info while running this code-

2017-03-27 18:33:23,366 : INFO : training model with 4 workers on 677776
vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=
5
2017-03-27 18:33:24,319 : INFO : expecting 1701 sentences, matching count
from corpus used for vocabulary survey
2017-03-27 18:33:25,170 : WARNING : train() called with an empty iterator (
if not intended, be sure to provide a corpus that offers restartable
iteration = an iterable).

Clearly, it is not desirable as can be seen here -

model.wv.most_similar(positive=['woman', 'king'], negative=['man'])

[(u'davies_welsh', 0.3605641722679138),
(u'add_ins', 0.3399544656276703),
(u'kings_landing', 0.3140672445297241),
(u'the_cordillera', 0.30870741605758667),
(u'giant_anteater', 0.30382204055786133),
(u'analog_clocks', 0.30148613452911377),
(u'back_together', 0.30050382018089294),
(u'ionych', 0.2958505153656006),
(u'be_true', 0.29267528653144836),
(u'particle_physicists', 0.2917472720146179)]

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

2017-03-27 18:50:43 UTC

Permalink

The `bigrams[sentences]` syntax from Phraser (or even Phrases) only creates
an iterator for a single phrase-combining pass over `sentences`.

Word2Vec needs an Iterable object that can be iterated over multiple times
â once for vocabulary-discovery, then again for multiple (default 5)
training passes. You'll get this error if after making the 1st pass, the
iterator you passed in has been exhausted, and can't restart for another
pass.

Some options:

(1) For smaller corpuses that fit in memory, you can turn the single
iteration into an in-memory list:

corpus = list(bigram[sentences])

This has the added benefit of only doing the phrase-combining calculations
once, which might speed later passes.

(2) For larger corpuses, you might want to write your own iterable wrapper,
that re-executes the `bigrams[sentences]` code to create a single-pass
iterator every time a new iteration is requested. Roughly the following
should work:

class PhrasingIterable(object):
def __init__(self, phrasifier, texts):
self. phrasifier, self.texts = phrasifier, texts
def __iter__():
return phrasifier[texts]

Then you'd pass Word2Vec a corpus of `PhrasingIterable(bigrams,
sentences)`.

(3) Similarly for larger corpuses, you might want to write the
phrase-combined texts to a new text file or files, which are then re-read
with a proper IO-based iterable (such as Text8Corpus itself, or the class
LineSentence from a few lines down in the same place as Text8Corpus). This
also has the benefit of only doing the phrase-combining once.

- Gordon

Post by e***@gmail.com
sentences = Text8Corpus('/home/prakhar/text8')
phrases = Phrases(Text8Corpus('/home/prakhar/text8'), min_count=1,
threshold=2)
bigram = Phraser(phrases)
model = models.word2vec.Word2Vec(bigram[sentences], size=200,workers=4,
min_count=1)
The logger info while running this code-
2017-03-27 18:33:23,366 : INFO : training model with 4 workers on 677776
vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5
window=5
2017-03-27 18:33:24,319 : INFO : expecting 1701 sentences, matching count
from corpus used for vocabulary survey
2017-03-27 18:33:25,170 : WARNING : train() called with an empty iterator
(if not intended, be sure to provide a corpus that offers restartable
iteration = an iterable).
Clearly, it is not desirable as can be seen here -
model.wv.most_similar(positive=['woman', 'king'], negative=['man'])
[(u'davies_welsh', 0.3605641722679138),
(u'add_ins', 0.3399544656276703),
(u'kings_landing', 0.3140672445297241),
(u'the_cordillera', 0.30870741605758667),
(u'giant_anteater', 0.30382204055786133),
(u'analog_clocks', 0.30148613452911377),
(u'back_together', 0.30050382018089294),
(u'ionych', 0.2958505153656006),
(u'be_true', 0.29267528653144836),
(u'particle_physicists', 0.2917472720146179)]

Abhishek Dubey

2017-08-23 19:45:06 UTC

Permalink

Hey Gordon,

But when I use the class below:
class PhrasingIterable(object):

def __init__(self, phrasifier, texts):
self.phrasifier, self.texts = phrasifier, texts

def __iter__(self):
return self.phrasifier[self.texts]

with python 3.x, I get
TypeError: iter() returned non-iterator of type 'TransformedCorpus'

Now I know the issue between __next__ & next in python 3.x and 2.x, but how
do we fix it here ?

Post by Gordon Mohr
The `bigrams[sentences]` syntax from Phraser (or even Phrases) only
creates an iterator for a single phrase-combining pass over `sentences`.
Word2Vec needs an Iterable object that can be iterated over multiple times
â once for vocabulary-discovery, then again for multiple (default 5)
training passes. You'll get this error if after making the 1st pass, the
iterator you passed in has been exhausted, and can't restart for another
pass.
(1) For smaller corpuses that fit in memory, you can turn the single
corpus = list(bigram[sentences])
This has the added benefit of only doing the phrase-combining calculations
once, which might speed later passes.
(2) For larger corpuses, you might want to write your own iterable
wrapper, that re-executes the `bigrams[sentences]` code to create a
single-pass iterator every time a new iteration is requested. Roughly the
self. phrasifier, self.texts = phrasifier, texts
return phrasifier[texts]
Then you'd pass Word2Vec a corpus of `PhrasingIterable(bigrams,
sentences)`.
(3) Similarly for larger corpuses, you might want to write the
phrase-combined texts to a new text file or files, which are then re-read
with a proper IO-based iterable (such as Text8Corpus itself, or the class
LineSentence from a few lines down in the same place as Text8Corpus). This
also has the benefit of only doing the phrase-combining once.
- Gordon

Post by e***@gmail.com
sentences = Text8Corpus('/home/prakhar/text8')
phrases = Phrases(Text8Corpus('/home/prakhar/text8'), min_count=1,
threshold=2)
bigram = Phraser(phrases)
model = models.word2vec.Word2Vec(bigram[sentences], size=200,workers=4,
min_count=1)
The logger info while running this code-
2017-03-27 18:33:23,366 : INFO : training model with 4 workers on 677776
vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5
window=5
2017-03-27 18:33:24,319 : INFO : expecting 1701 sentences, matching
count from corpus used for vocabulary survey
2017-03-27 18:33:25,170 : WARNING : train() called with an empty
iterator (if not intended, be sure to provide a corpus that offers
restartable iteration = an iterable).
Clearly, it is not desirable as can be seen here -
model.wv.most_similar(positive=['woman', 'king'], negative=['man'])
[(u'davies_welsh', 0.3605641722679138),
(u'add_ins', 0.3399544656276703),
(u'kings_landing', 0.3140672445297241),
(u'the_cordillera', 0.30870741605758667),
(u'giant_anteater', 0.30382204055786133),
(u'analog_clocks', 0.30148613452911377),
(u'back_together', 0.30050382018089294),
(u'ionych', 0.2958505153656006),
(u'be_true', 0.29267528653144836),
(u'particle_physicists', 0.2917472720146179)]

Gordon Mohr

2017-08-23 20:25:14 UTC

Permalink

Note that it's usually better to follow the approach numbered (3) above:
write the phrase-ified corpus somewhere, then read that for efficiency and
simplicity.

I'd need a lot more context about what you're attempting â or a simple
fully-self-contained example of how to trigger it â to know how to
interpret the new, different error you're reporting.

- Gordon

Post by Abhishek Dubey
Hey Gordon,
self.phrasifier, self.texts = phrasifier, texts
return self.phrasifier[self.texts]
with python 3.x, I get
TypeError: iter() returned non-iterator of type 'TransformedCorpus'
Now I know the issue between __next__ & next in python 3.x and 2.x, but
how do we fix it here ?

Post by Gordon Mohr
The `bigrams[sentences]` syntax from Phraser (or even Phrases) only
creates an iterator for a single phrase-combining pass over `sentences`.
Word2Vec needs an Iterable object that can be iterated over multiple
times â once for vocabulary-discovery, then again for multiple (default 5)
training passes. You'll get this error if after making the 1st pass, the
iterator you passed in has been exhausted, and can't restart for another
pass.
(1) For smaller corpuses that fit in memory, you can turn the single
corpus = list(bigram[sentences])
This has the added benefit of only doing the phrase-combining
calculations once, which might speed later passes.
(2) For larger corpuses, you might want to write your own iterable
wrapper, that re-executes the `bigrams[sentences]` code to create a
single-pass iterator every time a new iteration is requested. Roughly the
self. phrasifier, self.texts = phrasifier, texts
return phrasifier[texts]
Then you'd pass Word2Vec a corpus of `PhrasingIterable(bigrams,
sentences)`.
(3) Similarly for larger corpuses, you might want to write the
phrase-combined texts to a new text file or files, which are then re-read
with a proper IO-based iterable (such as Text8Corpus itself, or the class
LineSentence from a few lines down in the same place as Text8Corpus). This
also has the benefit of only doing the phrase-combining once.
- Gordon

Post by e***@gmail.com
sentences = Text8Corpus('/home/prakhar/text8')
phrases = Phrases(Text8Corpus('/home/prakhar/text8'), min_count=1,
threshold=2)
bigram = Phraser(phrases)
model = models.word2vec.Word2Vec(bigram[sentences], size=200,workers=4,
min_count=1)
The logger info while running this code-
2017-03-27 18:33:23,366 : INFO : training model with 4 workers on 677776
vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5
window=5
2017-03-27 18:33:24,319 : INFO : expecting 1701 sentences, matching
count from corpus used for vocabulary survey
2017-03-27 18:33:25,170 : WARNING : train() called with an empty
iterator (if not intended, be sure to provide a corpus that offers
restartable iteration = an iterable).
Clearly, it is not desirable as can be seen here -
model.wv.most_similar(positive=['woman', 'king'], negative=['man'])
[(u'davies_welsh', 0.3605641722679138),
(u'add_ins', 0.3399544656276703),
(u'kings_landing', 0.3140672445297241),
(u'the_cordillera', 0.30870741605758667),
(u'giant_anteater', 0.30382204055786133),
(u'analog_clocks', 0.30148613452911377),
(u'back_together', 0.30050382018089294),
(u'ionych', 0.2958505153656006),
(u'be_true', 0.29267528653144836),
(u'particle_physicists', 0.2917472720146179)]

Abhishek Dubey

2017-08-24 07:15:33 UTC

Permalink

from __future__ import unicode_literals, print_function
from gensim.parsing import PorterStemmer
from spacy.en import English
from gensim.models import Word2Vec, Phrases, phrases, KeyedVectors
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk import tokenize
import string
import re
import os

stemmer = PorterStemmer()
stopwords = stopwords.words('english')
nlp = English() #nlp = spacy.load("en")
data_dir_path = "full_path"

base_dir = os.path.dirname(data_dir_path)
os.chdir(base_dir)

class Stemming(object):
word_lookup = {}

@classmethod
def stem(cls, word):
stemmed = stemmer.stem(word)
if stemmed not in cls.word_lookup:
cls.word_lookup[stemmed] = {}
cls.word_lookup[stemmed][word] = (
cls.word_lookup[stemmed].get(word, 0) + 1)
return stemmed

@classmethod
def original_form(cls, word):
if word in cls.word_lookup:
return max(cls.word_lookup[word].keys(),
key=lambda x: cls.word_lookup[word][x])
else:
return word

class SentenceClass(object):
def __init__(self, dirname):
self.dirname = dirname

def __iter__(self):
for fname in os.listdir(self.dirname):
with open(os.path.join(self.dirname,fname), 'r') as myfile:
doc = myfile.read().replace('\n', ' ')
for sent in tokenize.sent_tokenize(doc.lower()):
yield [Stemming.stem(word)\
for word in word_tokenize(re.sub("[^A-Za-z]", "
",sent))\
if word not in stopwords]

class PhrasingIterable(object):
def __init__(self, phrasifier, texts):
self.phrasifier, self.texts = phrasifier, texts
def __iter__(self):
yield self.phrasifier[self.texts]

my_sentences = SentenceClass(data_dir_path)

my_phrases = Phrases(my_sentences, min_count=1)
my_corpus = PhrasingIterable(my_phrases,my_sentences)
model = Word2Vec(my_corpus, size=100, window=2, min_count=1, workers=2)

Hey Gordon,
Above is my complete code, the error I am getting as of now is below, this
code above is passing a list somewhere when it is suppose to pass a words.

File "C:/Users/Adubey4/Desktop/rasagit/mycode/error_bigram.py", line 65,
in <module>
model = Word2Vec(my_corpus, size=100, window=2, min_count=1, workers=2)

File "C:\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line
503, in __init__
self.build_vocab(sentences, trim_rule=trim_rule)

File "C:\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line
577, in build_vocab
self.scan_vocab(sentences, progress_per=progress_per,
trim_rule=trim_rule) # initial survey

File "C:\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line
601, in scan_vocab
vocab[word] += 1

TypeError: unhashable type: 'list'

Post by Gordon Mohr
write the phrase-ified corpus somewhere, then read that for efficiency and
simplicity.
I'd need a lot more context about what you're attempting â or a simple
fully-self-contained example of how to trigger it â to know how to
interpret the new, different error you're reporting.
- Gordon

Post by Gordon Mohr
The `bigrams[sentences]` syntax from Phraser (or even Phrases) only
creates an iterator for a single phrase-combining pass over `sentences`.
Word2Vec needs an Iterable object that can be iterated over multiple
times â once for vocabulary-discovery, then again for multiple (default 5)
training passes. You'll get this error if after making the 1st pass, the
iterator you passed in has been exhausted, and can't restart for another
pass.
(1) For smaller corpuses that fit in memory, you can turn the single
corpus = list(bigram[sentences])
This has the added benefit of only doing the phrase-combining
calculations once, which might speed later passes.
(2) For larger corpuses, you might want to write your own iterable
wrapper, that re-executes the `bigrams[sentences]` code to create a
single-pass iterator every time a new iteration is requested. Roughly the
self. phrasifier, self.texts = phrasifier, texts
return phrasifier[texts]
Then you'd pass Word2Vec a corpus of `PhrasingIterable(bigrams,
sentences)`.
(3) Similarly for larger corpuses, you might want to write the
phrase-combined texts to a new text file or files, which are then re-read
with a proper IO-based iterable (such as Text8Corpus itself, or the class
LineSentence from a few lines down in the same place as Text8Corpus). This
also has the benefit of only doing the phrase-combining once.
- Gordon

Post by e***@gmail.com
sentences = Text8Corpus('/home/prakhar/text8')
phrases = Phrases(Text8Corpus('/home/prakhar/text8'), min_count=1,
threshold=2)
bigram = Phraser(phrases)
model = models.word2vec.Word2Vec(bigram[sentences], size=200,workers=4,
min_count=1)
The logger info while running this code-
2017-03-27 18:33:23,366 : INFO : training model with 4 workers on
677776 vocabulary and 200 features, using sg=0 hs=0 sample=0.001
negative=5 window=5
2017-03-27 18:33:24,319 : INFO : expecting 1701 sentences, matching
count from corpus used for vocabulary survey
2017-03-27 18:33:25,170 : WARNING : train() called with an empty
iterator (if not intended, be sure to provide a corpus that offers
restartable iteration = an iterable).
Clearly, it is not desirable as can be seen here -
model.wv.most_similar(positive=['woman', 'king'], negative=['man'])
[(u'davies_welsh', 0.3605641722679138),
(u'add_ins', 0.3399544656276703),
(u'kings_landing', 0.3140672445297241),
(u'the_cordillera', 0.30870741605758667),
(u'giant_anteater', 0.30382204055786133),
(u'analog_clocks', 0.30148613452911377),
(u'back_together', 0.30050382018089294),
(u'ionych', 0.2958505153656006),
(u'be_true', 0.29267528653144836),
(u'particle_physicists', 0.2917472720146179)]

Abhishek Dubey

2017-08-24 07:19:16 UTC

Permalink

Just to update for the actual query:
When I change *yield* with *return* in *PhrasingIterable *class, I get the
error I mentioned earlier

Updated function:

class PhrasingIterable(object):
def __init__(self, phrasifier, texts):
self. phrasifier, self.texts = phrasifier, texts
def __iter__():
*return* phrasifier[texts]

Error:

File "<ipython-input-146-8c8b59b0c842>", line 1, in <module>
model = Word2Vec(my_corpus, size=100, window=2, min_count=1, workers=4)

File "C:\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line
503, in __init__
self.build_vocab(sentences, trim_rule=trim_rule)

File "C:\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line
577, in build_vocab
self.scan_vocab(sentences, progress_per=progress_per,
trim_rule=trim_rule) # initial survey

File "C:\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line
589, in scan_vocab
for sentence_no, sentence in enumerate(sentences):

TypeError: iter() returned non-iterator of type 'TransformedCorpus'

Gordon Mohr

2017-08-25 00:15:59 UTC

Permalink

A good minimal example would trigger the error without recourse to any
outside dataset, or even other libraries/steps (like the stemming you're
doing).

Additionally, it could help to add code that prints checks that each step
has done what's expected, before continuing with the next. (As one example,
does the `my_phrases` object behave as expected before wrapping it up for
later steps?)

In your original message, you mentioned Python 2 vs 3 differences â are you
suggesting this code worked in Python 2 but not Python 3? Or have all your
tests been in 3?

- Gordon

Post by Abhishek Dubey
When I change *yield* with *return* in *PhrasingIterable *class, I get
the error I mentioned earlier
self. phrasifier, self.texts = phrasifier, texts
*return* phrasifier[texts]
File "<ipython-input-146-8c8b59b0c842>", line 1, in <module>
model = Word2Vec(my_corpus, size=100, window=2, min_count=1, workers=4)
File "C:\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line
503, in __init__
self.build_vocab(sentences, trim_rule=trim_rule)
File "C:\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line
577, in build_vocab
self.scan_vocab(sentences, progress_per=progress_per,
trim_rule=trim_rule) # initial survey
File "C:\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line
589, in scan_vocab
TypeError: iter() returned non-iterator of type 'TransformedCorpus'

Mahmood Kohansal

2017-09-26 07:18:13 UTC

Permalink

Hey Abhishek,

Do you find any solution for this error?
I want to train a model like you, first using phrases and then word2vec
training.

Gordon Mohr

2017-09-26 17:48:14 UTC

Permalink

From your other post describing the same `TypeError: iter() returned
non-iterator of type 'TransformedCorpus'` error, after working from the
example in this thread, I I now see that my example code earlier in this
thread does the wrong thing with *its* `__iter__()` return line.

It should not be `return`ing the raw phrasifier, but one that has already
been started-as-an-iterator-object, by use of the `iter()` built-in method.
That is, the `PhrasingIterator` example up-thread should have read:

class PhrasingIterable(object):
def __init__(self, phrasifier, texts):
self. phrasifier, self.texts = phrasifier, texts
def __iter__():
return iter(phrasifier[texts]) # <-- this line fixed

- Gordon

Post by Mahmood Kohansal
Hey Abhishek,
Do you find any solution for this error?
I want to train a model like you, first using phrases and then word2vec
training.

e***@gmail.com

2017-03-28 00:23:11 UTC

Permalink

Thanks for clarifying