jose ipc
2016-06-01 13:20:58 UTC
Hello everyone.
I'm working on my thesis, using word2vec and doc2vec for topic detection in
TC-STAR corpus. I have already experimented with word2vec obtaining an
accuracy of 64%, doing a previous training with spanish wikipedia articles.
In this case I want to repeat the experimentation with doc2vec but I am
confused with its parameters. Should I use PV-DM or PV-DBOW, hierarchical
softmax or negative sampling, concatenation or sum?.
The script that i'm using to train doc2vec model with spanish wiki corpus
is:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
import os.path
import sys
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
from random import shuffle
if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
if len(sys.argv) < 3:
print globals()['__doc__'] % locals()
sys.exit(1)
inp, outp = sys.argv[1:3]
with open(inp,"r") as fd:
i = 0
train_labeled_sentences = []
for line in fd.readlines():
train_labeled_sentences.append(LabeledSentence(line,tags=[str(i)]))
if (i % 10000 == 0): logger.info("Saved " + str(i) + " articles")
i += 1
fd.close()
model = Doc2Vec(size=400, window=8, min_count=3, workers=8, dm=1, hs=0,
dbow_words=0, dm_concat=1)
model.build_vocab(train_labeled_sentences)
for epoch in range(10):
shuffle(train_labeled_sentences)
model.train(train_labeled_sentences)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
model.save("wiki.recommend.mikolov.doc2vec")
In this script, PV-DM with concatenation is used such as Mikolov recommends
in [Distributed Representations of Sentences and Documents]. It's possible
to obtain good results with this configuration and wiki corpus?.
Thanks for your answers.
I'm working on my thesis, using word2vec and doc2vec for topic detection in
TC-STAR corpus. I have already experimented with word2vec obtaining an
accuracy of 64%, doing a previous training with spanish wikipedia articles.
In this case I want to repeat the experimentation with doc2vec but I am
confused with its parameters. Should I use PV-DM or PV-DBOW, hierarchical
softmax or negative sampling, concatenation or sum?.
The script that i'm using to train doc2vec model with spanish wiki corpus
is:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
import os.path
import sys
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
from random import shuffle
if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
if len(sys.argv) < 3:
print globals()['__doc__'] % locals()
sys.exit(1)
inp, outp = sys.argv[1:3]
with open(inp,"r") as fd:
i = 0
train_labeled_sentences = []
for line in fd.readlines():
train_labeled_sentences.append(LabeledSentence(line,tags=[str(i)]))
if (i % 10000 == 0): logger.info("Saved " + str(i) + " articles")
i += 1
fd.close()
model = Doc2Vec(size=400, window=8, min_count=3, workers=8, dm=1, hs=0,
dbow_words=0, dm_concat=1)
model.build_vocab(train_labeled_sentences)
for epoch in range(10):
shuffle(train_labeled_sentences)
model.train(train_labeled_sentences)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
model.save("wiki.recommend.mikolov.doc2vec")
In this script, PV-DM with concatenation is used such as Mikolov recommends
in [Distributed Representations of Sentences and Documents]. It's possible
to obtain good results with this configuration and wiki corpus?.
Thanks for your answers.
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.