[gensim:11720] LSI Similarity Score for queries which are already trained

Discussion:

hans mohan

2018-10-28 05:55:20 UTC

Hi,

I have trained a LSI model for finding the similarity against the new
document/query.

Total Training Set : *73K *(Each record is english based text sentences)

Pipeline used to Construct the similarity index for 73 K training ideas is
as follows:
1. Create Dictionary . We use filter_extremes method :
*no_below*= 2 ,*no_above *= 1.0
Total dictionary tokens : resulting dictionary: Dictionary(34523
unique tokens: )
2. Then BoW model is created and it is then transformed to Tf-idf.
3. Finally Tf-idf is transformed to LSI representation (with *num_topics *=
500)
4. At last Similarity Index is created (with *num_features* = 500)

Now , For testing purpose , we calculated the similarity score with 1000
queries which were part of training part (i.e. we selected 1000 records
from 73K training
corpus)

*Question :*
For some of the queries , we received scores like 0.88, and then for rest
of queries scores lie in the range of 0.9 to 0.99/1.0.
A) Is it necessary that for testing queries which are already part of
Training model, the score should come out as 0.98/0.99/1.0
B) What could be the possible reasons for receiving the scores in a range
(0.88 to 0.99) for the 1000 queries which are essentially word by word same
as they are
part of training model ?

I need to understand the reasons as they have to be explained to the client
and higher management of our team , as answers from Gensim Team would
become a kind of authenticated benchmark.

Hope to receive your reply soon.

Thanks and Regards,
Hans Mohan

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Radim Řehůřek

2018-10-28 11:27:45 UTC

Permalink

Hi Hans,

Post by hans mohan
Hi,
I have trained a LSI model for finding the similarity against the new
document/query.
Total Training Set : *73K *(Each record is english based text sentences)
Pipeline used to Construct the similarity index for 73 K training ideas is
*no_below*= 2 ,*no_above *= 1.0
Total dictionary tokens : resulting dictionary: Dictionary(34523
unique tokens: )
2. Then BoW model is created and it is then transformed to Tf-idf.
3. Finally Tf-idf is transformed to LSI representation (with *num_topics *=
500)
4. At last Similarity Index is created (with *num_features* = 500)
Now , For testing purpose , we calculated the similarity score with 1000
queries which were part of training part (i.e. we selected 1000 records
from 73K training
corpus)
*Question :*
For some of the queries , we received scores like 0.88, and then for rest
of queries scores lie in the range of 0.9 to 0.99/1.0.
A) Is it necessary that for testing queries which are already part of
Training model, the score should come out as 0.98/0.99/1.0

No. Indexed documents must always have a score of 1.0 (or maybe 0.9999 due
to rounding errors) -- something's wrong with your pipeline.

Post by hans mohan
B) What could be the possible reasons for receiving the scores in a range
(0.88 to 0.99) for the 1000 queries which are essentially word by word same
as they are
part of training model ?

Some bug in how you process/index documents. Common culprits are processing
your training data differently to how you process your queries, a
vocabulary / vector model mismatch.

I need to understand the reasons as they have to be explained to the client

Post by hans mohan
and higher management of our team , as answers from Gensim Team would
become a kind of authenticated benchmark.

We built a robust commercial engine, scaletext.com, for semantically
analyzing, indexing and searching large volumes of documents. It might be
an easier sell to management / clients than an open source lib.

Best regards,
Radim

Post by hans mohan
Hope to receive your reply soon.
Thanks and Regards,
Hans Mohan

h***@gmail.com

2018-10-28 13:16:41 UTC

Permalink

Thanks a lot for your response . Will check through the pipeline and then
Get back to you with details.

Regards ,
Hans Mohan

Sent from my iPhone

Post by Radim ÅehÅ¯Åek
Hi Hans,

Hi,
I have trained a LSI model for finding the similarity against the new document/query.
Total Training Set : 73K (Each record is english based text sentences)
no_below= 2 ,no_above = 1.0
Total dictionary tokens : resulting dictionary: Dictionary(34523 unique tokens: )
2. Then BoW model is created and it is then transformed to Tf-idf.
3. Finally Tf-idf is transformed to LSI representation (with num_topics = 500)
4. At last Similarity Index is created (with num_features = 500)
Now , For testing purpose , we calculated the similarity score with 1000 queries which were part of training part (i.e. we selected 1000 records from 73K training
corpus)
For some of the queries , we received scores like 0.88, and then for rest of queries scores lie in the range of 0.9 to 0.99/1.0.
A) Is it necessary that for testing queries which are already part of Training model, the score should come out as 0.98/0.99/1.0

No. Indexed documents must always have a score of 1.0 (or maybe 0.9999 due to rounding errors) -- something's wrong with your pipeline.

B) What could be the possible reasons for receiving the scores in a range (0.88 to 0.99) for the 1000 queries which are essentially word by word same as they are
part of training model ?

Some bug in how you process/index documents. Common culprits are processing your training data differently to how you process your queries, a vocabulary / vector model mismatch.

I need to understand the reasons as they have to be explained to the client and higher management of our team , as answers from Gensim Team would become a kind of authenticated benchmark.

We built a robust commercial engine, scaletext.com, for semantically analyzing, indexing and searching large volumes of documents. It might be an easier sell to management / clients than an open source lib.
Best regards,
Radim

Hope to receive your reply soon.
Thanks and Regards,
Hans Mohan

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
For more options, visit https://groups.google.com/d/optout.

hans mohan

2018-11-01 08:05:39 UTC

Permalink

Hi Radim,

I had gone through the entire code pipeline. And as suspected by you ,
vector model mismatch turned out to be a culprit.

I performed stacked transformation while training *(bow ->tf-idf->lsi) *but
while forming the query vector , I did (*bow -> lsi) and * skipped the
*tf-idf* transformation.
I can see immediate improvement in the similarity score for indexed queries
set only.

I tested for 500 indexed queries and received 1 or 0.99 as score for 490
queries , but still for 10 odd queries I received score like
*[0.91,0.93,0.95,0.96,0.97,0.98].*

As per your other suggestions , I see that pre-processing is handled in the
same way for both train and test phase.

Do you see some other possibility where I can make a try ?

Still, many thanks for the previous suggestions.

Thanks and Regards,
Hans MOhan

Post by h***@gmail.com
Thanks a lot for your response . Will check through the pipeline and then
Get back to you with details.
Regards ,
Hans Mohan
Sent from my iPhone
Hi Hans,

Hi,
I have trained a LSI model for finding the similarity against the new document/query.
Total Training Set : *73K *(Each record is english based text sentences)
*no_below*= 2 ,*no_above *= 1.0
Total dictionary tokens : resulting dictionary: Dictionary(34523 unique tokens: )
2. Then BoW model is created and it is then transformed to Tf-idf.
3. Finally Tf-idf is transformed to LSI representation (with *num_topics
*= 500)
4. At last Similarity Index is created (with *num_features* = 500)
Now , For testing purpose , we calculated the similarity score with 1000
queries which were part of training part (i.e. we selected 1000 records
from 73K training
corpus)
*Question :*
For some of the queries , we received scores like 0.88, and then for rest
of queries scores lie in the range of 0.9 to 0.99/1.0.
A) Is it necessary that for testing queries which are already part of
Training model, the score should come out as 0.98/0.99/1.0

No. Indexed documents must always have a score of 1.0 (or maybe 0.9999 due
to rounding errors) -- something's wrong with your pipeline.

B) What could be the possible reasons for receiving the scores in a range
(0.88 to 0.99) for the 1000 queries which are essentially word by word same
as they are
part of training model ?

Some bug in how you process/index documents. Common culprits are
processing your training data differently to how you process your queries,
a vocabulary / vector model mismatch.
I need to understand the reasons as they have to be explained to the

client and higher management of our team , as answers from Gensim Team
would become a kind of authenticated benchmark.

Hope to receive your reply soon.
Thanks and Regards,
Hans Mohan

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.