[gensim:6111] Doc2Vec parameters for Wikipedia.

Discussion:

jose ipc

2016-06-01 13:20:58 UTC

Hello everyone.

I'm working on my thesis, using word2vec and doc2vec for topic detection in
TC-STAR corpus. I have already experimented with word2vec obtaining an
accuracy of 64%, doing a previous training with spanish wikipedia articles.
In this case I want to repeat the experimentation with doc2vec but I am
confused with its parameters. Should I use PV-DM or PV-DBOW, hierarchical
softmax or negative sampling, concatenation or sum?.

The script that i'm using to train doc2vec model with spanish wiki corpus
is:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
import os.path
import sys
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
from random import shuffle

if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))

# check and process input arguments

if len(sys.argv) < 3:
print globals()['__doc__'] % locals()
sys.exit(1)
inp, outp = sys.argv[1:3]
with open(inp,"r") as fd:
i = 0
train_labeled_sentences = []
for line in fd.readlines():
train_labeled_sentences.append(LabeledSentence(line,tags=[str(i)]))
if (i % 10000 == 0): logger.info("Saved " + str(i) + " articles")
i += 1
fd.close()
model = Doc2Vec(size=400, window=8, min_count=3, workers=8, dm=1, hs=0,
dbow_words=0, dm_concat=1)
model.build_vocab(train_labeled_sentences)
for epoch in range(10):
shuffle(train_labeled_sentences)
model.train(train_labeled_sentences)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
model.save("wiki.recommend.mikolov.doc2vec")

In this script, PV-DM with concatenation is used such as Mikolov recommends
in [Distributed Representations of Sentences and Documents]. It's possible
to obtain good results with this configuration and wiki corpus?.

Thanks for your answers.

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

2016-06-01 21:00:05 UTC

Permalink

There's no set of options best for all corpuses and purposes â you have to
experiment with your own data and goals.

You may be interested in the paper "Document Embeddings with Paragraph
Vectors" (http://arxiv.org/abs/1507.07998), which trains PV-DBOW vectors
along with word-vectors on Wikipedia, and gets interesting results. (In
gensim Doc2Vec, this is equivalent to both `dm=0, dbow_words=1` non-default
options.) Unfortunately as with some other papers, they don't seem to
completely specify their choice of options. (For example, I can't find them
ever saying what 'window' size they're using.)

A few observations on your existing code/choices:

* DM-with-concatenation results in the largest, slowest models and I
haven't yet found a demo dataset/problem where it gives the best results
(as is implied in the Mikolov/Le paper). So, I'd try it last, if you've got
lots of free time and RAM.

* The 'window' is the maximum number of context words used on both sides of
the 'target' word, so a a value of 8 actually uses up to 16 words. It's not
automatically the case that larger-is-better; I've seen datasets where
`window=2` resulted in the best analogies-scores.

* It's tough to learn much from words that only appear 3 times; the paper
above says they used a cutoff that resulted in a 915,000-word vocabulary,
which I think the Wikipedia dumps I've worked with is a `min_count` closer
to 40 or 50.

* if specifying `hs=0` it's good to be explicit about the number of
negative-samples used (though your code is OK in the latest gensim
versions, where a default of 'negative=5` applies). As with 'window',
though, sometimes even fewer negative-samples are sufficient or even
best-performing (on larger training sets).

* because the default 'min_alpha' is 0.001, and the default 'iter'
(controlling the class's own multiple-passes-per-`train()`), your first
epoch will actually be 5 passes over the data, *and* descend the
effective-alpha to 0.001. Then, on the next epoch, you'll do another 5
passes, but now in the fixed new alpha/min_alpha value. You probably don't
want this behavior. You can just let the class do the iterations and
alpha-management â set 'iter' to 10 â OR if you want to manage it manually
set 'iter' to 1 (so each 'train()` does one pass) and set 'min_alpha' to be
equal to 'alpha' (or whatever you want it to be at the end of the next
epoch), so it doesn't zig-zag across your iterations.

- Gordon

Post by jose ipc
Hello everyone.
I'm working on my thesis, using word2vec and doc2vec for topic detection
in TC-STAR corpus. I have already experimented with word2vec obtaining an
accuracy of 64%, doing a previous training with spanish wikipedia articles.
In this case I want to repeat the experimentation with doc2vec but I am
confused with its parameters. Should I use PV-DM or PV-DBOW, hierarchical
softmax or negative sampling, concatenation or sum?.
The script that i'm using to train doc2vec model with spanish wiki corpus
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
import os.path
import sys
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
from random import shuffle
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s'
)
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
print globals()['__doc__'] % locals()
sys.exit(1)
inp, outp = sys.argv[1:3]
i = 0
train_labeled_sentences = []
train_labeled_sentences.append(LabeledSentence(line,tags=[str(i
)]))
if (i % 10000 == 0): logger.info("Saved " + str(i) + " articles")
i += 1
fd.close()
model = Doc2Vec(size=400, window=8, min_count=3, workers=8, dm=1, hs=0
, dbow_words=0, dm_concat=1)
model.build_vocab(train_labeled_sentences)
shuffle(train_labeled_sentences)
model.train(train_labeled_sentences)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
model.save("wiki.recommend.mikolov.doc2vec")
In this script, PV-DM with concatenation is used such as Mikolov
recommends in [Distributed Representations of Sentences and Documents].
It's possible to obtain good results with this configuration and wiki
corpus?.
Thanks for your answers.

Kamal Garg

2018-04-25 06:29:40 UTC

Permalink

Hi Gordon,
I used doc2vec PV-DBOW in two ways:
1) Doc2Vec(dm=0, dbow_words=1, size=200, window=8, min_count=20, iter=5,
workers=cores),
2) Doc2Vec(dm=0, dbow_words=1, size=200, window=5, min_count=12, iter=8,
workers=cores),

to train my wikicorpus(14GB).

The model trained successfully. It gave relevant results for many phrases
but when i tried artificial intelligence.
In first model, i got the following suggestions:
1) [('Existential risk from artificial general intelligence',
0.7284922003746033),

('Ethics of artificial intelligence', 0.7267584800720215),
("Turing's Wager", 0.7224212884902954),
('Oracle (AI)', 0.7094788551330566),
('AI aftermath scenarios', 0.703824520111084),
('AI control problem', 0.6999846696853638),
('Superintelligence: Paths, Dangers, Strategies', 0.691785454750061),
('Murray Shanahan', 0.6860222220420837),
('Artificial empathy', 0.6842677593231201),
('Explainable Artificial Intelligence', 0.682081937789917),
('Iyad Rahwan', 0.681956946849823),
('Moral Machine', 0.6816681027412415),
('Timeline of artificial intelligence', 0.676627516746521),
('Susan Schneider (philosopher)', 0.6764435768127441),
('From Bacteria to Bach and Back', 0.6752616167068481),
('AI-complete', 0.6739200353622437),
('David A. McAllester', 0.673627495765686),
('Knowledge acquisition', 0.6730433702468872),
('OpenAI', 0.6718262434005737),
('Open Letter on Artificial Intelligence', 0.6698791980743408)]

In second model-

2) [('Existential risk from artificial general intelligence', 0.7561817765235901),

('History of artificial intelligence', 0.734763503074646),
('Ethics of artificial intelligence', 0.7274946570396423),
('Oracle (AI)', 0.7165532112121582),
("Turing's Wager", 0.7119142413139343),
('Artificial general intelligence', 0.7059307098388672),
('Deep learning', 0.7024167776107788),
('AI takeover', 0.701856791973114),
('AI aftermath scenarios', 0.6950700879096985),
('Cognitive science', 0.6925462484359741),
('Symbolic artificial intelligence', 0.6894776821136475),
('AI-complete', 0.6873871088027954),
("Hubert Dreyfus's views on artificial intelligence", 0.6849253177642822),
('Moral Machine', 0.6835113167762756),
('Artificial neural network', 0.6826612949371338),
('Mind uploading', 0.6812909841537476),
('Cognitive bias mitigation', 0.6788017749786377),
('Explainable Artificial Intelligence', 0.6765998601913452),
('Bayesian cognitive science', 0.6736477017402649),
('Intelligence explosion', 0.671064019203186)]

Model two seems to be giving better results, but i want to eliminate suggestions like Existential risk from artificial general intelligence,History of artificial intelligence.

Is there a way where I can tune the parameters to get better results. Also should i try PV-DM w/average so that i can get better phrases and if so what window size and min_count should I use.

Any help will be appreciated. Thank you in advance

Gordon Mohr

2018-04-25 17:43:43 UTC

Permalink

You can always try different values for the meta-parameters to see if they
give better results for your purposes. This works best if you create an
repeatable, automated evaluation that scores each model (rather than just
manually eyeballing results), then use that score to pick from among many
parameter combinations.

Published `Doc2Vec` work tends to use 10-20 (or more) training iterations,
so your current choices of 5 and 8 are on the low side.

Note that a larger `window` means relatively more word-to-word training,
and thus slower training overall and proportionately less tag-to-word
(doc-vector) training. If your main interest in the article-titles
(doc-tag) vector quality, you *might* find smaller windows but more
iterations a useful tradeoff.

If you want to eliminate results like 'Existential risk from artificial
general intelligence', you will likely have to devise your own heuristics
for eliminating those kind of articles from your training, or filter those
titles from your results. The `Doc2Vec` algorithm looks at text content,
and I would expect an article like 'Existential risk from artificial
general intelligence' to be an excellent match, by text content, with the
article `Artificial intelligence'.

- Gordon

Post by Kamal Garg
Hi Gordon,
1) Doc2Vec(dm=0, dbow_words=1, size=200, window=8, min_count=20, iter=5,
workers=cores),
2) Doc2Vec(dm=0, dbow_words=1, size=200, window=5, min_count=12, iter=8,
workers=cores),
to train my wikicorpus(14GB).
The model trained successfully. It gave relevant results for many phrases
but when i tried artificial intelligence.
1) [('Existential risk from artificial general intelligence',
0.7284922003746033),
('Ethics of artificial intelligence', 0.7267584800720215),
("Turing's Wager", 0.7224212884902954),
('Oracle (AI)', 0.7094788551330566),
('AI aftermath scenarios', 0.703824520111084),
('AI control problem', 0.6999846696853638),
('Superintelligence: Paths, Dangers, Strategies', 0.691785454750061),
('Murray Shanahan', 0.6860222220420837),
('Artificial empathy', 0.6842677593231201),
('Explainable Artificial Intelligence', 0.682081937789917),
('Iyad Rahwan', 0.681956946849823),
('Moral Machine', 0.6816681027412415),
('Timeline of artificial intelligence', 0.676627516746521),
('Susan Schneider (philosopher)', 0.6764435768127441),
('From Bacteria to Bach and Back', 0.6752616167068481),
('AI-complete', 0.6739200353622437),
('David A. McAllester', 0.673627495765686),
('Knowledge acquisition', 0.6730433702468872),
('OpenAI', 0.6718262434005737),
('Open Letter on Artificial Intelligence', 0.6698791980743408)]
In second model-
2) [('Existential risk from artificial general intelligence', 0.7561817765235901),
('History of artificial intelligence', 0.734763503074646),
('Ethics of artificial intelligence', 0.7274946570396423),
('Oracle (AI)', 0.7165532112121582),
("Turing's Wager", 0.7119142413139343),
('Artificial general intelligence', 0.7059307098388672),
('Deep learning', 0.7024167776107788),
('AI takeover', 0.701856791973114),
('AI aftermath scenarios', 0.6950700879096985),
('Cognitive science', 0.6925462484359741),
('Symbolic artificial intelligence', 0.6894776821136475),
('AI-complete', 0.6873871088027954),
("Hubert Dreyfus's views on artificial intelligence", 0.6849253177642822),
('Moral Machine', 0.6835113167762756),
('Artificial neural network', 0.6826612949371338),
('Mind uploading', 0.6812909841537476),
('Cognitive bias mitigation', 0.6788017749786377),
('Explainable Artificial Intelligence', 0.6765998601913452),
('Bayesian cognitive science', 0.6736477017402649),
('Intelligence explosion', 0.671064019203186)]
Model two seems to be giving better results, but i want to eliminate suggestions like Existential risk from artificial general intelligence,History of artificial intelligence.
Is there a way where I can tune the parameters to get better results. Also should i try PV-DM w/average so that i can get better phrases and if so what window size and min_count should I use.
Any help will be appreciated. Thank you in advance

Kamal Garg

2018-05-01 12:54:38 UTC

Permalink

Thank you for the reply Gordon. I have worked around the problem of
artificial intelligence. But I am stuck in another problem. For e.g I tried
clay mineral on wikipedia doc2vec trained model, i got none results. But
when I tried to find similar words for clay minerals it worked and showed
me the results because wikipedia has an article on clay minerals not clay
mineral. Is there a way,when the user tries to find similar words related
to word like clay mineral and even if clay mineral is not present, it
searches for the closed string in doc2vec dictionary related to clay
mineral and give me the results, For e.g in this case, i searched for clay
mineral and if similar words are not found, it gives me results of clay
minerals.
Thank you for the help in advance.

Gordon Mohr

2018-05-01 19:00:21 UTC

Permalink

Do you mean that you're looking among the `Doc2Vec` model `docvecs` tags
for the exact string 'clay minerals', and getting results because you
trained with a lowercased article title 'Clay minerals') â but then not
finding anything when you look for the exact string `clay mineral`, because
you didn't train any documents with that exact string tag? (It'd be clearer
if you used precise quoting of the strings & code you're trying â helpful
to me, but also a good habit for thinking about the problem because such
absolute precision is required for working code.)

`Doc2Vec` only offers exact tag lookup of doc-vectors - if a string wasn't
offered as a trained tag, it's just not present.

But there are lots of techniques, mostly outside the purview of gensim,
that can help in such situations. There's no one that's best. Many go under
the name 'query expansion'. Some might involve extra
preprocessing/stemming/lemmatization before training, or be able to
leverage word-vectors in other ways. For example:

* if you added auto-complete or live similar-string-matching, seeded with
the set of known tags, someone typing 'clay mi...' would see good
contiuations or small-edit-distance variations of what they've typed, and
be able to select/self-correct to something that's known.

* similarly, in the case of 'no results' you could run extra code to try
things like â (1) list small edit-distance variations of what was typed
from among the known set; or (2) tokenize the full string (int ['clay',
'mineral']) and fall back to traditional keyword- or pattern/substring-
matching against known tags â and then offer those matches as suggestions,
or just compose results from some blend of those transformations, as if
they were what was originally queried

* you might go beyond just lowercasing titles, to more aggressive removal
of plurals/suffixes/etc (stemming) or coercing terms into unified forms
(lemmatization). Even if you still use the original unique titles as
training-tags (or the display-titles), you'd calculate the 'canonical name'
of a tag via this extra processing, and also do it to any queries, to force
more variations of "the same or similar idea" to collapse to the same
lookup keys

* since you're using a mode that also creates compatible word-vectors,
breaking an unknown multi-word string (like 'clay mineral') into its words
(['clay', 'mineral']), then using each word, or some average of the words,
as a search might yield usefully-related tags

These are just a few of the tricks used to improve
search/information-retrieval over "perfect string matches" - there are many
more provided by other IR/text libraries, which are practical or worthwhile
for you will depend on the specifics of your project/goals.

- Gordon

Post by Kamal Garg
Thank you for the reply Gordon. I have worked around the problem of
artificial intelligence. But I am stuck in another problem. For e.g I tried
clay mineral on wikipedia doc2vec trained model, i got none results. But
when I tried to find similar words for clay minerals it worked and showed
me the results because wikipedia has an article on clay minerals not clay
mineral. Is there a way,when the user tries to find similar words related
to word like clay mineral and even if clay mineral is not present, it
searches for the closed string in doc2vec dictionary related to clay
mineral and give me the results, For e.g in this case, i searched for clay
mineral and if similar words are not found, it gives me results of clay
minerals.
Thank you for the help in advance.