[gensim:11743] Question on Coherence API

Alistair Windsor

2018-11-02 15:25:35 UTC

My understanding of the Gensim models.coherencemodel API is that for
coherence measures c_v, c_uci, c_pmi, c_npmi we have to pass in texts and
it takes

list of list of str,

However, these coherence measures try to validate the topics against some
external corpus (such a wikipedia dump) and as such the external
"validation" corpus could be larger than the corpus that is being analyzed.
The corpus is accepted as a iterable returning (id, freq) for each document
but the texts seems to accept a list of lists. I have two questions

1. Does texts really not accept an iterable? I understand that if would
be a iterable of list of str or list of id (see below) since order matters
for the pmi calculations.
2. Why use the raw text? Why not consider the id mapped text (with a
placeholder for out of dictionary words). This is much more compact (and
obviously is what is being used internally anyway).

I am also not sure if there is a way to initialize a Coherence model
without some topics. This would calculate the pointwise mutual information
measures from the external corpus so that it does not have to be
recomputed. Hopefully, if we take all our models and feed them in once to
for_models(*models*, *dictionary*, *topn=20*, ***kwargs*) then we will only
calculate the PMI once.

There are some scripts out there that repeated call CoherenceModel but
don't reuse it. In my experience this is very slow (even without using a
huge external corpus).

At some point I need to crawl through the code myself but for now I hope
some more knowledgeable folks can point me in the right direction.

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.