[gensim:11665] Gensim Word2Vec

Discussion:

Rachit Garg

2018-10-09 12:31:31 UTC

I have model created with text data of user using gensim word2vec, I am
using it for finding any of the news usefulness/relevancy with user.

suppose a news "*there may be a technology related talk on tuesday",*

i am find the news similarity with my user (trained model from user text
data)

but while finding similarity i am also getting similarity with some of the
irrelevant data like any *"cricket news- there may be cricket match on
tuesday between india and england"* as i dont have cricket related data in
my user text data.

please help me how to separate or discard such kind of news that is not
relevant with the user , I tried appending such kind of data ( non
relevant) with dummy tag thinking it will increase the similarity with
dummy and reduce it with user

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

2018-10-09 19:23:56 UTC

Permalink

What method of doing text-similarity checks from word-vectors are you
using?

If the words 'technology', 'cricket', 'england', 'india' aren't in your
training set, then they're likely no-ops in your comparison, and then your
two comparison texts are:

there may be ... related talk on tuesday
... news there may be ... ... on tuesday between ... and ...

And those are a pretty good match!

You could:

* double-check that the training you're doing is working right for any
purposes
* get more training data to ensure more words have useful representations
* do more preprocessing if there are rules-of-thumb about words that are
rarely relevant (like 'tuesday')
* try alternative text-comparisons, which might be any or all of:
- word2vec-average-based but weighted by word relevance, which might
make words like "there" "may" "be" "on" "tuesday" less important
- a more sophisticated/expensive comparison still based on word2vec like
"Word Mover's Distance"
- Doc2Vec or other algorithms that make a vector for a full-text
(without necessarily composing it from word2vec-vectors

- Gordon

Post by Rachit Garg
I have model created with text data of user using gensim word2vec, I am
using it for finding any of the news usefulness/relevancy with user.
suppose a news "*there may be a technology related talk on tuesday",*
i am find the news similarity with my user (trained model from user text
data)
but while finding similarity i am also getting similarity with some of
the irrelevant data like any *"cricket news- there may be cricket match
on tuesday between india and england"* as i dont have cricket related
data in my user text data.
please help me how to separate or discard such kind of news that is not
relevant with the user , I tried appending such kind of data ( non
relevant) with dummy tag thinking it will increase the similarity with
dummy and reduce it with user

Rachit Garg

2018-10-10 05:44:17 UTC

Permalink

Hello Gordon
Greetings

with reference to previous post

For example, i have trained the nokia and NASA text data and news data
from website and other news agency ,

Now when I try the similarity of some thing related to agriculture news
having some words like technology just for instance say "*new wireless
technology for agriculture is coming that use space oriented techniques "
or "new wireless technology in cricket for decision makers to make decision
on catch out of a player"* , Now it is true that both the news is not
related with NASA and Nokia , but when i check the similarity with Nokia or
NASA it shows more than 70 % similarity because of match like* technology,
wireless space, decision ,catch,player* ( may be these words are present in
various different news related to NASA and Nokia ) ,

i am using *model.n_similarity (s1,s2) where s1 is lets say Nokia/NASA and
s2 is news (list of words in news present in vocab) *

How can i discard or lesser down this similarity index of 70 % to say less
than 50%

I hope you are getting my query.

thanks in advance

Post by Gordon Mohr
What method of doing text-similarity checks from word-vectors are you
using?
If the words 'technology', 'cricket', 'england', 'india' aren't in your
training set, then they're likely no-ops in your comparison, and then your
there may be ... related talk on tuesday
... news there may be ... ... on tuesday between ... and ...
And those are a pretty good match!
* double-check that the training you're doing is working right for any
purposes
* get more training data to ensure more words have useful representations
* do more preprocessing if there are rules-of-thumb about words that are
rarely relevant (like 'tuesday')
- word2vec-average-based but weighted by word relevance, which might
make words like "there" "may" "be" "on" "tuesday" less important
- a more sophisticated/expensive comparison still based on word2vec
like "Word Mover's Distance"
- Doc2Vec or other algorithms that make a vector for a full-text
(without necessarily composing it from word2vec-vectors
- Gordon

Gordon Mohr

2018-10-10 17:35:31 UTC

Permalink

I'm not really sure what you mean when you say something like "check the
similarity with Nokia or NASA". (Do you mean, check a many-word-headline
against a single word like 'Nokia'?)

Comparing such dissimilar word-sets â a many-word realistic sentence vs a
single word â might not ever give great results. And, it could be very
fragile to any other problems in your setup, like undertraining or poor
choice of training parameters. Without seeing more of your code, there
could be many sorts of other problems.

But also: you shouldn't view the similarity-values as absolute "similarity
percentages". They're only meaningful within a certain model, compared to
other similarity values from the same model. If the similarity is 0.7, but
the result is properly ranked as less-similar than texts that you agree are
better matches, and more-similar than texts that you agree are poor
matches, that's as good as you should expect from the model. If for
aesthetic reasons you want the number to be 0.5 instead of 0.7, you can
scale the results by 5/7ths, or some dynamic shift/scale based on the
actual range of similarities, from -1.0 to 1.0, that you're seeing.

On the other hand, if the *relative* similarities seem wildly off â texts
with a cosine-similarity 0.7 don't seem less-similar than those with 0.8,
and don't seem more-similar than those with 0.6 â then you'd want to focus
on gathering more data, preprocessing it differently, or tuning other
training parameters.

- Gordon

Post by Rachit Garg
Hello Gordon
Greetings
with reference to previous post
For example, i have trained the nokia and NASA text data and news data
from website and other news agency ,
Now when I try the similarity of some thing related to agriculture news
having some words like technology just for instance say "*new wireless
technology for agriculture is coming that use space oriented techniques "
or "new wireless technology in cricket for decision makers to make decision
on catch out of a player"* , Now it is true that both the news is not
related with NASA and Nokia , but when i check the similarity with Nokia or
NASA it shows more than 70 % similarity because of match like*
technology, wireless space, decision ,catch,player* ( may be these words
are present in various different news related to NASA and Nokia ) ,
i am using *model.n_similarity (s1,s2) where s1 is lets say Nokia/NASA
and s2 is news (list of words in news present in vocab) *
How can i discard or lesser down this similarity index of 70 % to say less
than 50%
I hope you are getting my query.
thanks in advance

Rachit Garg

2018-10-10 18:49:11 UTC

Permalink

Hello Gordon
Greetings

Yes actually I am checking similarity of many words headline with single
word i.e. my user say Nokia

My objective is to check whether any news headline or news text is relevant
to my user or not

lets say i have news that "*Motorola is launching new mobile phone with 8G
features " , Now this news is relevant/important for Nokia whereas news
like "new wireless technology for agriculture is coming out" is not
relevant for nokia. *

*i need to find out whether is news is relevant for my user or not *

For that let me brief you what I have done

1. with text data of nokia i have build a vector space model using
word2vec,
2. now i am using the similarity of many word headline with my user (nokia
in our example) in order to get whether that headline is relevant with my
user or not.

If this approach is not correct please feel free to tell and suggest me the
better approach for the same objective

Thanks in advance

Post by Gordon Mohr
I'm not really sure what you mean when you say something like "check the
similarity with Nokia or NASA". (Do you mean, check a many-word-headline
against a single word like 'Nokia'?)
Comparing such dissimilar word-sets â a many-word realistic sentence vs a
single word â might not ever give great results. And, it could be very
fragile to any other problems in your setup, like undertraining or poor
choice of training parameters. Without seeing more of your code, there
could be many sorts of other problems.
But also: you shouldn't view the similarity-values as absolute "similarity
percentages". They're only meaningful within a certain model, compared to
other similarity values from the same model. If the similarity is 0.7, but
the result is properly ranked as less-similar than texts that you agree are
better matches, and more-similar than texts that you agree are poor
matches, that's as good as you should expect from the model. If for
aesthetic reasons you want the number to be 0.5 instead of 0.7, you can
scale the results by 5/7ths, or some dynamic shift/scale based on the
actual range of similarities, from -1.0 to 1.0, that you're seeing.
On the other hand, if the *relative* similarities seem wildly off â texts
with a cosine-similarity 0.7 don't seem less-similar than those with 0.8,
and don't seem more-similar than those with 0.6 â then you'd want to focus
on gathering more data, preprocessing it differently, or tuning other
training parameters.
- Gordon

Post by Rachit Garg
Hello Gordon
Greetings
with reference to previous post
For example, i have trained the nokia and NASA text data and news data
from website and other news agency ,
Now when I try the similarity of some thing related to agriculture news
having some words like technology just for instance say "*new
wireless technology for agriculture is coming that use space oriented
techniques " or "new wireless technology in cricket for decision makers to
make decision on catch out of a player"* , Now it is true that both the
news is not related with NASA and Nokia , but when i check the similarity
with Nokia or NASA it shows more than 70 % similarity because of match like*
technology, wireless space, decision ,catch,player* ( may be these words
are present in various different news related to NASA and Nokia ) ,
i am using *model.n_similarity (s1,s2) where s1 is lets say Nokia/NASA
and s2 is news (list of words in news present in vocab) *
How can i discard or lesser down this similarity index of 70 % to say
less than 50%
I hope you are getting my query.
thanks in advance

Post by Rachit Garg
I have model created with text data of user using gensim word2vec, I am
using it for finding any of the news usefulness/relevancy with user.
suppose a news "*there may be a technology related talk on tuesday",*
i am find the news similarity with my user (trained model from user
text data)
but while finding similarity i am also getting similarity with some of
the irrelevant data like any *"cricket news- there may be cricket
match on tuesday between india and england"* as i dont have cricket
related data in my user text data.
please help me how to separate or discard such kind of news that is not
relevant with the user , I tried appending such kind of data ( non
relevant) with dummy tag thinking it will increase the similarity with
dummy and reduce it with user

Rachit Garg

2018-10-12 14:30:29 UTC

Permalink

Hello Gordon
Greetings
I hope you got my objective in last discussion
Waiting for your reply

Post by Rachit Garg
Hello Gordon
Greetings
with reference to previous post
For example, i have trained the nokia and NASA text data and news data
from website and other news agency ,
Now when I try the similarity of some thing related to agriculture news
having some words like technology just for instance say "*new
wireless technology for agriculture is coming that use space oriented
techniques " or "new wireless technology in cricket for decision makers to
make decision on catch out of a player"* , Now it is true that both the
news is not related with NASA and Nokia , but when i check the similarity
with Nokia or NASA it shows more than 70 % similarity because of match like*
technology, wireless space, decision ,catch,player* ( may be these words
are present in various different news related to NASA and Nokia ) ,
i am using *model.n_similarity (s1,s2) where s1 is lets say Nokia/NASA
and s2 is news (list of words in news present in vocab) *
How can i discard or lesser down this similarity index of 70 % to say
less than 50%
I hope you are getting my query.
thanks in advance

Post by Rachit Garg
I have model created with text data of user using gensim word2vec, I am
using it for finding any of the news usefulness/relevancy with user.
suppose a news "*there may be a technology related talk on tuesday",*
i am find the news similarity with my user (trained model from user
text data)
but while finding similarity i am also getting similarity with some of
the irrelevant data like any *"cricket news- there may be cricket
match on tuesday between india and england"* as i dont have cricket
related data in my user text data.
please help me how to separate or discard such kind of news that is not
relevant with the user , I tried appending such kind of data ( non
relevant) with dummy tag thinking it will increase the similarity with
dummy and reduce it with user

You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

2018-10-12 20:02:48 UTC

Permalink

As previously mentioned, comparing such dissimilar word-sets â a many-word
realistic sentence vs a single word â might not ever give great results.
And, it could be very fragile to any other problems in your setup, like
undertraining or poor choice of training parameters. Without seeing more of
your code, there could be many sorts of other problems.

Without code, it's still not clear what you're doing. When you say "with
text data of nokia i have build a vector space model using word2vec",
that sounds misguided: you'd want to train any word or text models on large
datasets with many domain-words in the same training set, not just the data
related to one topic (like "text data of nokia" alone).

As previously mentioned, if your training dataset isn't good â with no or
few examples of key words â then those words add nothing (or randomness) to
the comparisons, so the matches you've said are bad might not be recognized
as bad. Also, as far as I know, 'Nokia' might in fact make 'wireless
technology for agriculture', so I'm not sure that's an absolutely "not
relevant" result â just that it's probably less relevant than pure 'mobile
phone' results. So again, you shouldn't count on these algorithms for
*absolute* measures of appropriateness, just *relative* measures. Whatever
"new wireless technology for agriculture is coming out" scores, the
absolute number doesn't matter. Only whether it's properly higher or lower
than some preferred, more-similar statement.

You approach could work - you haven't shown enough details to convince me
that it's not. If it's not working well, the issue could be a bug in your
code, underoptimized parameters, or an inadequate corpus (which hasn't yet
been described in terms of size or contents).

- Gordon

Post by Rachit Garg
Hello Gordon
Greetings
I hope you got my objective in last discussion
Waiting for your reply

Post by Gordon Mohr
I'm not really sure what you mean when you say something like "check the
similarity with Nokia or NASA". (Do you mean, check a many-word-headline
against a single word like 'Nokia'?)
Comparing such dissimilar word-sets â a many-word realistic sentence vs a
single word â might not ever give great results. And, it could be very
fragile to any other problems in your setup, like undertraining or poor
choice of training parameters. Without seeing more of your code, there
could be many sorts of other problems.
But also: you shouldn't view the similarity-values as absolute
"similarity percentages". They're only meaningful within a certain model,
compared to other similarity values from the same model. If the similarity
is 0.7, but the result is properly ranked as less-similar than texts that
you agree are better matches, and more-similar than texts that you agree
are poor matches, that's as good as you should expect from the model. If
for aesthetic reasons you want the number to be 0.5 instead of 0.7, you can
scale the results by 5/7ths, or some dynamic shift/scale based on the
actual range of similarities, from -1.0 to 1.0, that you're seeing.
On the other hand, if the *relative* similarities seem wildly off â texts
with a cosine-similarity 0.7 don't seem less-similar than those with 0.8,
and don't seem more-similar than those with 0.6 â then you'd want to focus
on gathering more data, preprocessing it differently, or tuning other
training parameters.
- Gordon

Post by Rachit Garg
Hello Gordon
Greetings
with reference to previous post
For example, i have trained the nokia and NASA text data and news data
from website and other news agency ,
Now when I try the similarity of some thing related to agriculture news
having some words like technology just for instance say "*new
wireless technology for agriculture is coming that use space oriented
techniques " or "new wireless technology in cricket for decision makers to
make decision on catch out of a player"* , Now it is true that both the
news is not related with NASA and Nokia , but when i check the similarity
with Nokia or NASA it shows more than 70 % similarity because of match like*
technology, wireless space, decision ,catch,player* ( may be these
words are present in various different news related to NASA and Nokia ) ,
i am using *model.n_similarity (s1,s2) where s1 is lets say Nokia/NASA
and s2 is news (list of words in news present in vocab) *
How can i discard or lesser down this similarity index of 70 % to say
less than 50%
I hope you are getting my query.
thanks in advance

Post by Gordon Mohr
What method of doing text-similarity checks from word-vectors are you
using?
If the words 'technology', 'cricket', 'england', 'india' aren't in your
training set, then they're likely no-ops in your comparison, and then your
there may be ... related talk on tuesday
... news there may be ... ... on tuesday between ... and ...
And those are a pretty good match!
* double-check that the training you're doing is working right for any
purposes
* get more training data to ensure more words have useful
representations
* do more preprocessing if there are rules-of-thumb about words that
are rarely relevant (like 'tuesday')
- word2vec-average-based but weighted by word relevance, which might
make words like "there" "may" "be" "on" "tuesday" less important
- a more sophisticated/expensive comparison still based on word2vec
like "Word Mover's Distance"
- Doc2Vec or other algorithms that make a vector for a full-text
(without necessarily composing it from word2vec-vectors
- Gordon

Post by Rachit Garg
I have model created with text data of user using gensim word2vec, I
am using it for finding any of the news usefulness/relevancy with user.
suppose a news "*there may be a technology related talk on tuesday",*
i am find the news similarity with my user (trained model from user
text data)
but while finding similarity i am also getting similarity with some
of the irrelevant data like any *"cricket news- there may be cricket
match on tuesday between india and england"* as i dont have cricket
related data in my user text data.
please help me how to separate or discard such kind of news that is
not relevant with the user , I tried appending such kind of data ( non
relevant) with dummy tag thinking it will increase the similarity with
dummy and reduce it with user

Rachit Garg

2018-10-13 06:53:22 UTC

Permalink

Hello Gordon
Greetings

Thanks for the reply , I understood what you said.

I am struggling for the objective for finding the relevancy of any news
with my user (nokia).

What approach should I follow from the basic steps as I have already
discussed the way i started
I want to make a new start. what approach will help me to achieve the
target?

Objective : want to find out the relevancy of any upcoming news with my
user(nokia)

1. Should i get multiple domain data or only Nokia data will be sufficient
?

I may be asking silly question, but you have cooperate with me in previous
discussions and helped me a lot (hearlty thanks to you)

Please show me the direction for achieving the objective as i have tried so
many model building and fails everytime.

Post by Gordon Mohr
As previously mentioned, comparing such dissimilar word-sets â a many-word
realistic sentence vs a single word â might not ever give great results.
And, it could be very fragile to any other problems in your setup, like
undertraining or poor choice of training parameters. Without seeing more of
your code, there could be many sorts of other problems.
Without code, it's still not clear what you're doing. When you say "with
text data of nokia i have build a vector space model using word2vec",
that sounds misguided: you'd want to train any word or text models on large
datasets with many domain-words in the same training set, not just the data
related to one topic (like "text data of nokia" alone).
As previously mentioned, if your training dataset isn't good â with no or
few examples of key words â then those words add nothing (or randomness) to
the comparisons, so the matches you've said are bad might not be recognized
as bad. Also, as far as I know, 'Nokia' might in fact make 'wireless
technology for agriculture', so I'm not sure that's an absolutely "not
relevant" result â just that it's probably less relevant than pure 'mobile
phone' results. So again, you shouldn't count on these algorithms for
*absolute* measures of appropriateness, just *relative* measures. Whatever
"new wireless technology for agriculture is coming out" scores, the
absolute number doesn't matter. Only whether it's properly higher or lower
than some preferred, more-similar statement.
You approach could work - you haven't shown enough details to convince me
that it's not. If it's not working well, the issue could be a bug in your
code, underoptimized parameters, or an inadequate corpus (which hasn't yet
been described in terms of size or contents).
- Gordon

Post by Rachit Garg
Hello Gordon
Greetings
I hope you got my objective in last discussion
Waiting for your reply

Post by Gordon Mohr
I'm not really sure what you mean when you say something like "check the
similarity with Nokia or NASA". (Do you mean, check a many-word-headline
against a single word like 'Nokia'?)
Comparing such dissimilar word-sets â a many-word realistic sentence vs
a single word â might not ever give great results. And, it could be very
fragile to any other problems in your setup, like undertraining or poor
choice of training parameters. Without seeing more of your code, there
could be many sorts of other problems.
But also: you shouldn't view the similarity-values as absolute
"similarity percentages". They're only meaningful within a certain model,
compared to other similarity values from the same model. If the similarity
is 0.7, but the result is properly ranked as less-similar than texts that
you agree are better matches, and more-similar than texts that you agree
are poor matches, that's as good as you should expect from the model. If
for aesthetic reasons you want the number to be 0.5 instead of 0.7, you can
scale the results by 5/7ths, or some dynamic shift/scale based on the
actual range of similarities, from -1.0 to 1.0, that you're seeing.
On the other hand, if the *relative* similarities seem wildly off â
texts with a cosine-similarity 0.7 don't seem less-similar than those with
0.8, and don't seem more-similar than those with 0.6 â then you'd want to
focus on gathering more data, preprocessing it differently, or tuning other
training parameters.
- Gordon

Post by Rachit Garg
Hello Gordon
Greetings
with reference to previous post
For example, i have trained the nokia and NASA text data and news data
from website and other news agency ,
Now when I try the similarity of some thing related to agriculture news
having some words like technology just for instance say "*new
wireless technology for agriculture is coming that use space oriented
techniques " or "new wireless technology in cricket for decision makers to
make decision on catch out of a player"* , Now it is true that both
the news is not related with NASA and Nokia , but when i check the
similarity with Nokia or NASA it shows more than 70 % similarity because of
match like* technology, wireless space, decision ,catch,player* ( may
be these words are present in various different news related to NASA and
Nokia ) ,
i am using *model.n_similarity (s1,s2) where s1 is lets say Nokia/NASA
and s2 is news (list of words in news present in vocab) *
How can i discard or lesser down this similarity index of 70 % to say
less than 50%
I hope you are getting my query.
thanks in advance

Post by Gordon Mohr
What method of doing text-similarity checks from word-vectors are you
using?
If the words 'technology', 'cricket', 'england', 'india' aren't in
your training set, then they're likely no-ops in your comparison, and then
there may be ... related talk on tuesday
... news there may be ... ... on tuesday between ... and ...
And those are a pretty good match!
* double-check that the training you're doing is working right for any
purposes
* get more training data to ensure more words have useful
representations
* do more preprocessing if there are rules-of-thumb about words that
are rarely relevant (like 'tuesday')
- word2vec-average-based but weighted by word relevance, which
might make words like "there" "may" "be" "on" "tuesday" less important
- a more sophisticated/expensive comparison still based on word2vec
like "Word Mover's Distance"
- Doc2Vec or other algorithms that make a vector for a full-text
(without necessarily composing it from word2vec-vectors
- Gordon

Post by Rachit Garg
I have model created with text data of user using gensim word2vec, I
am using it for finding any of the news usefulness/relevancy with user.
suppose a news "*there may be a technology related talk on tuesday",*
i am find the news similarity with my user (trained model from user
text data)
but while finding similarity i am also getting similarity with some
of the irrelevant data like any *"cricket news- there may be cricket
match on tuesday between india and england"* as i dont have cricket
related data in my user text data.
please help me how to separate or discard such kind of news that is
not relevant with the user , I tried appending such kind of data ( non
relevant) with dummy tag thinking it will increase the similarity with
dummy and reduce it with user

You received this message because you are subscribed to the Google
Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

2018-10-13 22:30:01 UTC

Permalink

It's unclear what you mean by "my user (nokia)".

You should use as much domain-relevant training data as possible, so that
as many words as possible are known, and have good representations.

You haven't reported the *relative* similarity-scores that you've seen, so
it's not yet clear to me that it "fails everytime". Be specific about what
code you've run, what exact results you've seen, and why you think those
are insufficient â and then it might be possible to make further
suggestions.

(The approach you seem to be describing â train word-vectors, use
averages-of-word-vectors as the single vector for headlines, compare those
averages with other averages or single word-vectors â is something that
could give OK baseline results, as a start. And then one you have that
kinda-sorta working, then you can tune it or compare it against other
methods. But there's no one way to recommend/explain, only things worth
trying, *if* you can be much more specific about your data and goals.)

- Gordon

Post by Rachit Garg
Hello Gordon
Greetings
Thanks for the reply , I understood what you said.
I am struggling for the objective for finding the relevancy of any news
with my user (nokia).
What approach should I follow from the basic steps as I have already
discussed the way i started
I want to make a new start. what approach will help me to achieve the
target?
Objective : want to find out the relevancy of any upcoming news with my
user(nokia)
1. Should i get multiple domain data or only Nokia data will be sufficient
?
I may be asking silly question, but you have cooperate with me in previous
discussions and helped me a lot (hearlty thanks to you)
Please show me the direction for achieving the objective as i have tried
so many model building and fails everytime.

Post by Gordon Mohr
As previously mentioned, comparing such dissimilar word-sets â a
many-word realistic sentence vs a single word â might not ever give great
results. And, it could be very fragile to any other problems in your setup,
like undertraining or poor choice of training parameters. Without seeing
more of your code, there could be many sorts of other problems.
Without code, it's still not clear what you're doing. When you say "with
text data of nokia i have build a vector space model using word2vec",
that sounds misguided: you'd want to train any word or text models on large
datasets with many domain-words in the same training set, not just the data
related to one topic (like "text data of nokia" alone).
As previously mentioned, if your training dataset isn't good â with no or
few examples of key words â then those words add nothing (or randomness) to
the comparisons, so the matches you've said are bad might not be recognized
as bad. Also, as far as I know, 'Nokia' might in fact make 'wireless
technology for agriculture', so I'm not sure that's an absolutely "not
relevant" result â just that it's probably less relevant than pure 'mobile
phone' results. So again, you shouldn't count on these algorithms for
*absolute* measures of appropriateness, just *relative* measures. Whatever
"new wireless technology for agriculture is coming out" scores, the
absolute number doesn't matter. Only whether it's properly higher or lower
than some preferred, more-similar statement.
You approach could work - you haven't shown enough details to convince me
that it's not. If it's not working well, the issue could be a bug in your
code, underoptimized parameters, or an inadequate corpus (which hasn't yet
been described in terms of size or contents).
- Gordon

Post by Rachit Garg
Hello Gordon
Greetings
I hope you got my objective in last discussion
Waiting for your reply

Post by Gordon Mohr
I'm not really sure what you mean when you say something like "check
the similarity with Nokia or NASA". (Do you mean, check a
many-word-headline against a single word like 'Nokia'?)
Comparing such dissimilar word-sets â a many-word realistic sentence vs
a single word â might not ever give great results. And, it could be very
fragile to any other problems in your setup, like undertraining or poor
choice of training parameters. Without seeing more of your code, there
could be many sorts of other problems.
But also: you shouldn't view the similarity-values as absolute
"similarity percentages". They're only meaningful within a certain model,
compared to other similarity values from the same model. If the similarity
is 0.7, but the result is properly ranked as less-similar than texts that
you agree are better matches, and more-similar than texts that you agree
are poor matches, that's as good as you should expect from the model. If
for aesthetic reasons you want the number to be 0.5 instead of 0.7, you can
scale the results by 5/7ths, or some dynamic shift/scale based on the
actual range of similarities, from -1.0 to 1.0, that you're seeing.
On the other hand, if the *relative* similarities seem wildly off â
texts with a cosine-similarity 0.7 don't seem less-similar than those with
0.8, and don't seem more-similar than those with 0.6 â then you'd want to
focus on gathering more data, preprocessing it differently, or tuning other
training parameters.
- Gordon

Post by Rachit Garg
Hello Gordon
Greetings
with reference to previous post
For example, i have trained the nokia and NASA text data and news
data from website and other news agency ,
Now when I try the similarity of some thing related to agriculture
news having some words like technology just for instance say "*new
wireless technology for agriculture is coming that use space oriented
techniques " or "new wireless technology in cricket for decision makers to
make decision on catch out of a player"* , Now it is true that both
the news is not related with NASA and Nokia , but when i check the
similarity with Nokia or NASA it shows more than 70 % similarity because of
match like* technology, wireless space, decision ,catch,player* ( may
be these words are present in various different news related to NASA and
Nokia ) ,
i am using *model.n_similarity (s1,s2) where s1 is lets say
Nokia/NASA and s2 is news (list of words in news present in vocab) *
How can i discard or lesser down this similarity index of 70 % to say
less than 50%
I hope you are getting my query.
thanks in advance

Post by Gordon Mohr
What method of doing text-similarity checks from word-vectors are you
using?
If the words 'technology', 'cricket', 'england', 'india' aren't in
your training set, then they're likely no-ops in your comparison, and then
there may be ... related talk on tuesday
... news there may be ... ... on tuesday between ... and ...
And those are a pretty good match!
* double-check that the training you're doing is working right for
any purposes
* get more training data to ensure more words have useful
representations
* do more preprocessing if there are rules-of-thumb about words that
are rarely relevant (like 'tuesday')
- word2vec-average-based but weighted by word relevance, which
might make words like "there" "may" "be" "on" "tuesday" less important
- a more sophisticated/expensive comparison still based on
word2vec like "Word Mover's Distance"
- Doc2Vec or other algorithms that make a vector for a full-text
(without necessarily composing it from word2vec-vectors
- Gordon

Post by Rachit Garg
I have model created with text data of user using gensim word2vec, I
am using it for finding any of the news usefulness/relevancy with user.
suppose a news "*there may be a technology related talk on tuesday",*
i am find the news similarity with my user (trained model from user
text data)
but while finding similarity i am also getting similarity with some
of the irrelevant data like any *"cricket news- there may be
cricket match on tuesday between india and england"* as i dont have
cricket related data in my user text data.
please help me how to separate or discard such kind of news that is
not relevant with the user , I tried appending such kind of data ( non
relevant) with dummy tag thinking it will increase the similarity with
dummy and reduce it with user

Rachit Garg

2018-11-20 07:44:58 UTC

Permalink

I am still confused to how to get proper results....

In simple words my query is

I need to find out whether any upcoming news is useful for me or not.....

Show me the way to find it out...tried many models but not successful.....