hans mohan
2018-10-28 05:55:20 UTC
Hi,
I have trained a LSI model for finding the similarity against the new
document/query.
Total Training Set : *73K *(Each record is english based text sentences)
Pipeline used to Construct the similarity index for 73 K training ideas is
as follows:
1. Create Dictionary . We use filter_extremes method :
*no_below*= 2 ,*no_above *= 1.0
Total dictionary tokens : resulting dictionary: Dictionary(34523
unique tokens: )
2. Then BoW model is created and it is then transformed to Tf-idf.
3. Finally Tf-idf is transformed to LSI representation (with *num_topics *=
500)
4. At last Similarity Index is created (with *num_features* = 500)
Now , For testing purpose , we calculated the similarity score with 1000
queries which were part of training part (i.e. we selected 1000 records
from 73K training
corpus)
*Question :*
For some of the queries , we received scores like 0.88, and then for rest
of queries scores lie in the range of 0.9 to 0.99/1.0.
A) Is it necessary that for testing queries which are already part of
Training model, the score should come out as 0.98/0.99/1.0
B) What could be the possible reasons for receiving the scores in a range
(0.88 to 0.99) for the 1000 queries which are essentially word by word same
as they are
part of training model ?
I need to understand the reasons as they have to be explained to the client
and higher management of our team , as answers from Gensim Team would
become a kind of authenticated benchmark.
Hope to receive your reply soon.
Thanks and Regards,
Hans Mohan
I have trained a LSI model for finding the similarity against the new
document/query.
Total Training Set : *73K *(Each record is english based text sentences)
Pipeline used to Construct the similarity index for 73 K training ideas is
as follows:
1. Create Dictionary . We use filter_extremes method :
*no_below*= 2 ,*no_above *= 1.0
Total dictionary tokens : resulting dictionary: Dictionary(34523
unique tokens: )
2. Then BoW model is created and it is then transformed to Tf-idf.
3. Finally Tf-idf is transformed to LSI representation (with *num_topics *=
500)
4. At last Similarity Index is created (with *num_features* = 500)
Now , For testing purpose , we calculated the similarity score with 1000
queries which were part of training part (i.e. we selected 1000 records
from 73K training
corpus)
*Question :*
For some of the queries , we received scores like 0.88, and then for rest
of queries scores lie in the range of 0.9 to 0.99/1.0.
A) Is it necessary that for testing queries which are already part of
Training model, the score should come out as 0.98/0.99/1.0
B) What could be the possible reasons for receiving the scores in a range
(0.88 to 0.99) for the 1000 queries which are essentially word by word same
as they are
part of training model ?
I need to understand the reasons as they have to be explained to the client
and higher management of our team , as answers from Gensim Team would
become a kind of authenticated benchmark.
Hope to receive your reply soon.
Thanks and Regards,
Hans Mohan
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.