[gensim:11829] How to Doc2Vec prune/trim vocabulary based on document frequency

Gordon Mohr

2018-11-28 19:08:51 UTC

`Doc2Vec` doesn't use the `Dictionary` class. The main methods of slimming
the vocabulary are via the original `min_count`, which can be used to
ignore words below a certain frequency, or newer `max_final_vocab`, which
will automatically discard enough of the lowest-frequency words to ensure
the surviving vocabulary size is under the specified `max_final_vocab`.

The `trim_rule` is also a way to implement more sophisticated policies,
such as those that except certain words, by providing a function that makes
the decision about whether to keep a particular candidate word.

Finally, rather than simply providing your corpus to the constructor (so
all steps run automatically), or even calling `build_vocab()` yourself
(which does all vocabulary-scanning, trimming, then
data-structure-allocation in one call), you can look at the `build_vocab()`
source and call its internal steps individually, directly â and insert your
own extra vocabulary-mutating logic before the late steps that actually
allocate the final dict/arrays. (This used to be after `scan_vocab()` and
`scale_vocab()`, but before `finalize_vocab()`, but the oaths may be
slightly different in the latest versions, you should familiarize yourself
with the source to see where to jump in.)

Note that more-aggressive use of the `sample` feature, by randomly skipping
higher-frequency words during training in escalating proportion to their
frequency, may be preferable to fully discarding very-frequent words. In
larger corpuses, `sample` can be made much more aggressive, which is to say
smaller, than its default of `1e-03` â say `1e-04`, `1e-05`, or even
smaller â to save a lot of time training overrepresented words, while also
generally improving overall vector quality.

- Gordon

Post by q***@gmail.com
Is there any way to filter the vocabulary based on the document frequency
of the words?
Like we have the ability in the `*Dictionary*` object where we call `
*filter_extremes*` to do it.
Doc2Vec does have the `*min_count*` parameter, which i think represents
the term frequency. additionally `*trim_rule*` is there, which i think
can be a way but may have some performance issues .
Please suggest if `*Dictionary*` object can be passed to *Doc2Vec *for
building vocabulary or are there any other methods

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.