[gensim:11742] Document with no words in dictionary

Discussion:

Frastxxx

2018-11-01 17:12:27 UTC

Hello,

I prepared some document for LDA learning. Before I generate Lda Model I do
filtering on given documents. I remove common words and rare words. There
is a situation that one of my documents have only rare words so basically
in final results I have an empty vector. None of those document word's are
present in dictionary.
I am suprised because when I iterate on LDA model (generated with
previously mentioned dictionary) I see that my document has some percentage
value to some topics. Which is weird because topics were generated based on
dictionary that does not contain even single word from this document).
Because of that I get misleading results when I look for completly
different document similarities.

It that a correct behavior, or I'am missing something here(which is
possible, I just started using gensim)

The other thing is that when I try to use previously generated LDA model
with document that also does not contain a single word from dictonary I got
the biggest similarity to previously mentioned document.

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Frastxxx

2018-11-01 17:17:51 UTC

Permalink

Hello.

I have a set of documents for LDA learning.
I also remove rare and common words from does document to improve algorithm.
One of those documents contains only rare words, so after filtering I
receive empty vector.
Dictionary generated based on those documents also does not contain words
from that document (which is of course expected)

Surprisingly, after LDA model has been generated I can see that mentioned
document has some association to some topics, even that none of all topics
contain even single word from that document.

The other thing is that when I try to use generated model with some new
document that also does not contain any word from dictionary, I get the
biggest similarity to previously mentioned document.

Is that expected behaviour, or I am missing something?

Alistair Windsor

2018-11-13 20:02:58 UTC

Permalink

It is probably expected behavior. Empty documents do not match the
generative model. Probably the code should be revised to deal more
gracefully with this edge case. Simply drop the "empty" document. Of
course all empty documents are similar they are identical! You will want to
catch these cases yourself.

Post by Frastxxx
Hello.
I have a set of documents for LDA learning.
I also remove rare and common words from does document to improve algorithm.
One of those documents contains only rare words, so after filtering I
receive empty vector.
Dictionary generated based on those documents also does not contain words
from that document (which is of course expected)
Surprisingly, after LDA model has been generated I can see that mentioned
document has some association to some topics, even that none of all
topics contain even single word from that document.
The other thing is that when I try to use generated model with some new
document that also does not contain any word from dictionary, I get the
biggest similarity to previously mentioned document.
Is that expected behaviour, or I am missing something?