prat
2018-11-26 07:54:00 UTC
Dear Sirs,
I have been wondering on the correct approach to updating LDA model when
tfidf is used and cannot figure it out.
So, let's say that I have some documents (inital_docs), and in future I
expect new ones (new_docs) that may differ (e.g. topics may be different).
Let's say that the LDA model takes long to train, so I would like to avoid
re-training on all docs if possible.
My pipeline for the initial_docs looks like:
1. create the dictionary (*initial_dictionary*)
2. run tf-idf using the *initial_dictionary *to get the transformed corpus (
*initial_tfidf_corpus*)
3. run LDA on *initial_tfidf_corpus* to dicsover topics (*initial_lda*)
But how should I attempt updating the model? What is correct for sure is
re-running the above with initial_docs + new_docs, but can I only run the
pipeline on new_docs? In other words, would the below be correct (let's
call it *version A*)?
1. create a new dictionary using new_docs only (*new_dictionary*)
2. run tfidf on the new_docs to get the new corpus (*new_tfidf_corpus*)
3. run *initial_lda.update* on *new_tfidf_corpus*
Or should I rather do the following (*version B*)? But here I still
re-train tfidf...
1. add new documents to the *initial_dictionary* to get an extended one (
*extended_dictionary*)
2. calculate tf-idf anew using the *extended_dictionary *to get the
corpus (*extended_tfidf_corpus*)
3. run *initial_lda.update* with the *extended_tfidf_corpus*
Or maybe I should keep the old tfidf transformer? (*version C*) This seems
wrong to me...
1. add new documents to the *initial_dictionary* to get an extended one (
*extended_dictionary*)
2. calculate tf-idf using the *initial_tfidf* transformer on the
extended_dictionary to get the corpus (
*extended_tfidf_corpus_with_old_tfidf_transformer*)
3. run *initial_lda.update* with the
*extended_tfidf_corpus_with_old_tfidf_transformer*
Or maybe all above is wrong and it is recommended to do it differently?
Please help, I'm at my wits' end.
I have been wondering on the correct approach to updating LDA model when
tfidf is used and cannot figure it out.
So, let's say that I have some documents (inital_docs), and in future I
expect new ones (new_docs) that may differ (e.g. topics may be different).
Let's say that the LDA model takes long to train, so I would like to avoid
re-training on all docs if possible.
My pipeline for the initial_docs looks like:
1. create the dictionary (*initial_dictionary*)
2. run tf-idf using the *initial_dictionary *to get the transformed corpus (
*initial_tfidf_corpus*)
3. run LDA on *initial_tfidf_corpus* to dicsover topics (*initial_lda*)
But how should I attempt updating the model? What is correct for sure is
re-running the above with initial_docs + new_docs, but can I only run the
pipeline on new_docs? In other words, would the below be correct (let's
call it *version A*)?
1. create a new dictionary using new_docs only (*new_dictionary*)
2. run tfidf on the new_docs to get the new corpus (*new_tfidf_corpus*)
3. run *initial_lda.update* on *new_tfidf_corpus*
Or should I rather do the following (*version B*)? But here I still
re-train tfidf...
1. add new documents to the *initial_dictionary* to get an extended one (
*extended_dictionary*)
2. calculate tf-idf anew using the *extended_dictionary *to get the
corpus (*extended_tfidf_corpus*)
3. run *initial_lda.update* with the *extended_tfidf_corpus*
Or maybe I should keep the old tfidf transformer? (*version C*) This seems
wrong to me...
1. add new documents to the *initial_dictionary* to get an extended one (
*extended_dictionary*)
2. calculate tf-idf using the *initial_tfidf* transformer on the
extended_dictionary to get the corpus (
*extended_tfidf_corpus_with_old_tfidf_transformer*)
3. run *initial_lda.update* with the
*extended_tfidf_corpus_with_old_tfidf_transformer*
Or maybe all above is wrong and it is recommended to do it differently?
Please help, I'm at my wits' end.
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.