Post by Gordon MohrI'm not really sure what you mean when you say something like "check the
similarity with Nokia or NASA". (Do you mean, check a many-word-headline
against a single word like 'Nokia'?)
Comparing such dissimilar word-sets â a many-word realistic sentence vs a
single word â might not ever give great results. And, it could be very
fragile to any other problems in your setup, like undertraining or poor
choice of training parameters. Without seeing more of your code, there
could be many sorts of other problems.
But also: you shouldn't view the similarity-values as absolute "similarity
percentages". They're only meaningful within a certain model, compared to
other similarity values from the same model. If the similarity is 0.7, but
the result is properly ranked as less-similar than texts that you agree are
better matches, and more-similar than texts that you agree are poor
matches, that's as good as you should expect from the model. If for
aesthetic reasons you want the number to be 0.5 instead of 0.7, you can
scale the results by 5/7ths, or some dynamic shift/scale based on the
actual range of similarities, from -1.0 to 1.0, that you're seeing.
On the other hand, if the *relative* similarities seem wildly off â texts
with a cosine-similarity 0.7 don't seem less-similar than those with 0.8,
and don't seem more-similar than those with 0.6 â then you'd want to focus
on gathering more data, preprocessing it differently, or tuning other
training parameters.
- Gordon
Post by Rachit GargHello Gordon
Greetings
with reference to previous post
For example, i have trained the nokia and NASA text data and news data
from website and other news agency ,
Now when I try the similarity of some thing related to agriculture news
having some words like technology just for instance say "*new
wireless technology for agriculture is coming that use space oriented
techniques " or "new wireless technology in cricket for decision makers to
make decision on catch out of a player"* , Now it is true that both the
news is not related with NASA and Nokia , but when i check the similarity
with Nokia or NASA it shows more than 70 % similarity because of match like*
technology, wireless space, decision ,catch,player* ( may be these words
are present in various different news related to NASA and Nokia ) ,
i am using *model.n_similarity (s1,s2) where s1 is lets say Nokia/NASA
and s2 is news (list of words in news present in vocab) *
How can i discard or lesser down this similarity index of 70 % to say
less than 50%
I hope you are getting my query.
thanks in advance
Post by Gordon MohrWhat method of doing text-similarity checks from word-vectors are you
using?
If the words 'technology', 'cricket', 'england', 'india' aren't in your
training set, then they're likely no-ops in your comparison, and then your
there may be ... related talk on tuesday
... news there may be ... ... on tuesday between ... and ...
And those are a pretty good match!
* double-check that the training you're doing is working right for any
purposes
* get more training data to ensure more words have useful representations
* do more preprocessing if there are rules-of-thumb about words that are
rarely relevant (like 'tuesday')
- word2vec-average-based but weighted by word relevance, which might
make words like "there" "may" "be" "on" "tuesday" less important
- a more sophisticated/expensive comparison still based on word2vec
like "Word Mover's Distance"
- Doc2Vec or other algorithms that make a vector for a full-text
(without necessarily composing it from word2vec-vectors
- Gordon
Post by Rachit GargI have model created with text data of user using gensim word2vec, I am
using it for finding any of the news usefulness/relevancy with user.
suppose a news "*there may be a technology related talk on tuesday",*
i am find the news similarity with my user (trained model from user
text data)
but while finding similarity i am also getting similarity with some of
the irrelevant data like any *"cricket news- there may be cricket
match on tuesday between india and england"* as i dont have cricket
related data in my user text data.
please help me how to separate or discard such kind of news that is not
relevant with the user , I tried appending such kind of data ( non
relevant) with dummy tag thinking it will increase the similarity with
dummy and reduce it with user
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.