[gensim:11866] Doc2vec hyperparameter tuning with custom classifier

Discussion:

R. M.

2018-12-04 11:56:41 UTC

Hi,

I've written a script to automatize the tuning of hyperparameters for a
doc2vec model. However, I wanted to check if I'm doing things right, since
my results are decent but not great.

Basically, I'm defining a number of values to test for each hyperparameter
of interest, and then the script does a doc2vec model for every combination
of these hyperparameter values on a training corpus (85% of my total
corpus). Each resulting model is tested on a testing corpus (15% of my
total corpus), by trying to guess the tag of each document (by inferring
the vector for the document, then looking at the most similar tag vector).
The overall accuracy and f1 score of each model is computed.

Here is the relevant part my code:

# Set values of interest for hyperparameters

hyperparams = {
'vector_size': [50, 200],
'min_count': [2, 10],
'epochs': [20, 50],
'window': [2, 10],
'alpha': [0.025, 0.01],
}

param_size = hyperparams['vector_size']
param_count = hyperparams['min_count']
param_epochs = hyperparams['epochs']
param_window = hyperparams['window']
param_alpha = hyperparams['alpha']

# Split data into training corpus and testing corpus

x_data = df_restricted["words"]
y_data = df_restricted["tags"]
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.15, random_state=42)

# including genre tag (label) as well

train_tagged_docs = [TaggedDocument(t, [label]) for t, label in zip(X_train,
y_train)]

# Train and evaluate models for every combination of hyperparameters

for param_size, param_count, param_epochs,param_window, param_alpha in
product(param_size, param_count, param_epochs,param_window, param_alpha):

# Train model
np.random.shuffle(train_tagged_docs)
model = gensim.models.doc2vec.Doc2Vec(dm=1, vector_size=param_size,
min_count=param_count, epochs=param_epochs, workers=cores, window=
param_window, alpha=param_alpha)
model.random.seed(0)
model.build_vocab(train_tagged_docs)
model.train(train_tagged_docs, total_examples=model.corpus_count, epochs
=model.iter)

# Evaluate model
X_val = np.array([model.infer_vector(t) for t in X_test])
genre_vectors = np.array([model.docvecs[x] for x in genre_list])
sims = cosine_similarity(X_val, genre_vectors)
y_val_pred = np.array(genre_list)[sims.argmax(axis=1)]
acc = accuracy_score(y_test, y_val_pred).round(4)
score = f1_score(y_test, y_val_pred, average='weighted').round(4)
Saisissez le code ici...

Here are the results I get:

f1_score accuracy size min_count epochs window alpha
0 0.332 0.3419 50 2 20 2 0.025
1 0.1729 0.2233 50 2 20 2 0.01
2 0.3273 0.3373 50 2 20 10 0.025
3 0.1786 0.2279 50 2 20 10 0.01
4 0.3131 0.3044 50 2 50 2 0.025
5 0.2956 0.326 50 2 50 2 0.01
6 0.3243 0.3142 50 2 50 10 0.025
7 0.2942 0.3249 50 2 50 10 0.01
8 0.3483 0.3511 50 10 20 2 0.025
9 0.2477 0.2849 50 10 20 2 0.01
10 0.3413 0.3522 50 10 20 10 0.025
11 0.2425 0.2787 50 10 20 10 0.01
12 0.3211 0.307 50 10 50 2 0.025
13 0.321 0.3522 50 10 50 2 0.01
14 0.3251 0.3157 50 10 50 10 0.025
15 0.3149 0.346 50 10 50 10 0.01
16 0.3173 0.3563 200 2 20 2 0.025
17 0.1893 0.2474 200 2 20 2 0.01
18 0.3223 0.3645 200 2 20 10 0.025
19 0.1847 0.2413 200 2 20 10 0.01
20 0.3053 0.3219 200 2 50 2 0.025
21 0.2915 0.3311 200 2 50 2 0.01
22 0.3069 0.3249 200 2 50 10 0.025
23 0.2914 0.3337 200 2 50 10 0.01
24 0.3352 0.3634 200 10 20 2 0.025
25 0.2547 0.3054 200 10 20 2 0.01
26 0.3459 0.3753 200 10 20 10 0.025
27 0.2475 0.3008 200 10 20 10 0.01
28 0.3125 0.3244 200 10 50 2 0.025
29 0.3035 0.3496 200 10 50 2 0.01
30 0.315 0.326 200 10 50 10 0.025
31 0.3077 0.3516 200 10 50 10 0.01

Am I doing something wrong? The f1 score and accuracy score don't seem very
good. But bear in mind that there are around 60 different labels, so the
classifier doesn't have an easy job.

Many thanks in advance!

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

R. M.

2018-12-04 12:13:38 UTC

Permalink

I have another question.

The results above are obtained when I remove from my corpus (before
splitting) all documents for labels which have less than 50 docs (each doc
in my corpus only has a singltwe label, some labels have only a couple of
docs while others have more than a thousand).

The reason why I've removed docs whose label has less than 50 docs, is that
otherwise I get the following warnings:

UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in

labels with no predicted samples.

UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in

labels with no true samples.

In other words, when I split the corpus with 85/15 training/testing, some
docs in the training corpus have labels that are absent in the testing
corpus, and vice versa. The problem is that if I only keep docs whose label
has more than 50 docs, I significantly shrink my overall corpus.

How could I make it so that models are evaluated only by testing docs whose
label is present both in the training and in the testing corpora? Or is
there a better way to go around this problem?

Thanks a lot!

RM

Hi,
I've written a script to automatize the tuning of hyperparameters for a
doc2vec model. However, I wanted to check if I'm doing things right, since
my results are decent but not great.
Basically, I'm defining a number of values to test for each hyperparameter
of interest, and then the script does a doc2vec model for every combination
of these hyperparameter values on a training corpus (85% of my total
corpus). Each resulting model is tested on a testing corpus (15% of my
total corpus), by trying to guess the tag of each document (by inferring
the vector for the document, then looking at the most similar tag vector).
The overall accuracy and f1 score of each model is computed.
# Set values of interest for hyperparameters
hyperparams = {
'vector_size': [50, 200],
'min_count': [2, 10],
'epochs': [20, 50],
'window': [2, 10],
'alpha': [0.025, 0.01],
}
param_size = hyperparams['vector_size']
param_count = hyperparams['min_count']
param_epochs = hyperparams['epochs']
param_window = hyperparams['window']
param_alpha = hyperparams['alpha']
# Split data into training corpus and testing corpus
x_data = df_restricted["words"]
y_data = df_restricted["tags"]
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.15, random_state=42)
# including genre tag (label) as well
train_tagged_docs = [TaggedDocument(t, [label]) for t, label in zip(
X_train, y_train)]
# Train and evaluate models for every combination of hyperparameters
for param_size, param_count, param_epochs,param_window, param_alpha in
# Train model
np.random.shuffle(train_tagged_docs)
model = gensim.models.doc2vec.Doc2Vec(dm=1, vector_size=param_size,
min_count=param_count, epochs=param_epochs, workers=cores, window=
param_window, alpha=param_alpha)
model.random.seed(0)
model.build_vocab(train_tagged_docs)
model.train(train_tagged_docs, total_examples=model.corpus_count,
epochs=model.iter)
# Evaluate model
X_val = np.array([model.infer_vector(t) for t in X_test])
genre_vectors = np.array([model.docvecs[x] for x in genre_list])
sims = cosine_similarity(X_val, genre_vectors)
y_val_pred = np.array(genre_list)[sims.argmax(axis=1)]
acc = accuracy_score(y_test, y_val_pred).round(4)
score = f1_score(y_test, y_val_pred, average='weighted').round(4)
Saisissez le code ici...
f1_score accuracy size min_count epochs window alpha
0 0.332 0.3419 50 2 20 2 0.025
1 0.1729 0.2233 50 2 20 2 0.01
2 0.3273 0.3373 50 2 20 10 0.025
3 0.1786 0.2279 50 2 20 10 0.01
4 0.3131 0.3044 50 2 50 2 0.025
5 0.2956 0.326 50 2 50 2 0.01
6 0.3243 0.3142 50 2 50 10 0.025
7 0.2942 0.3249 50 2 50 10 0.01
8 0.3483 0.3511 50 10 20 2 0.025
9 0.2477 0.2849 50 10 20 2 0.01
10 0.3413 0.3522 50 10 20 10 0.025
11 0.2425 0.2787 50 10 20 10 0.01
12 0.3211 0.307 50 10 50 2 0.025
13 0.321 0.3522 50 10 50 2 0.01
14 0.3251 0.3157 50 10 50 10 0.025
15 0.3149 0.346 50 10 50 10 0.01
16 0.3173 0.3563 200 2 20 2 0.025
17 0.1893 0.2474 200 2 20 2 0.01
18 0.3223 0.3645 200 2 20 10 0.025
19 0.1847 0.2413 200 2 20 10 0.01
20 0.3053 0.3219 200 2 50 2 0.025
21 0.2915 0.3311 200 2 50 2 0.01
22 0.3069 0.3249 200 2 50 10 0.025
23 0.2914 0.3337 200 2 50 10 0.01
24 0.3352 0.3634 200 10 20 2 0.025
25 0.2547 0.3054 200 10 20 2 0.01
26 0.3459 0.3753 200 10 20 10 0.025
27 0.2475 0.3008 200 10 20 10 0.01
28 0.3125 0.3244 200 10 50 2 0.025
29 0.3035 0.3496 200 10 50 2 0.01
30 0.315 0.326 200 10 50 10 0.025
31 0.3077 0.3516 200 10 50 10 0.01
Am I doing something wrong? The f1 score and accuracy score don't seem
very good. But bear in mind that there are around 60 different labels, so
the classifier doesn't have an easy job.
Many thanks in advance!

Gordon Mohr

2018-12-04 23:37:46 UTC

Permalink

Look into 'stratified' train-test split options, such as the `stratify`
option to the `train_test_split()` method you're using, or the
StratifiedShuffleSplit class - it can help ensure as much
balance-across-labels as possible, so you don't inadvertently discard all
examples of some labels from any data subset. (However, classes with such
few examples are likely to be nearly noise with poor performance in your
final model, anyway.)

- Gordon

Post by R. M.
I have another question.
The results above are obtained when I remove from my corpus (before
splitting) all documents for labels which have less than 50 docs (each doc
in my corpus only has a singltwe label, some labels have only a couple of
docs while others have more than a thousand).
The reason why I've removed docs whose label has less than 50 docs, is
UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in

labels with no predicted samples.

UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in

labels with no true samples.

In other words, when I split the corpus with 85/15 training/testing, some
docs in the training corpus have labels that are absent in the testing
corpus, and vice versa. The problem is that if I only keep docs whose label
has more than 50 docs, I significantly shrink my overall corpus.
How could I make it so that models are evaluated only by testing docs
whose label is present both in the training and in the testing corpora? Or
is there a better way to go around this problem?
Thanks a lot!
RM

Hi,
I've written a script to automatize the tuning of hyperparameters for a
doc2vec model. However, I wanted to check if I'm doing things right, since
my results are decent but not great.
Basically, I'm defining a number of values to test for each
hyperparameter of interest, and then the script does a doc2vec model for
every combination of these hyperparameter values on a training corpus (85%
of my total corpus). Each resulting model is tested on a testing corpus
(15% of my total corpus), by trying to guess the tag of each document (by
inferring the vector for the document, then looking at the most similar tag
vector). The overall accuracy and f1 score of each model is computed.
# Set values of interest for hyperparameters
hyperparams = {
'vector_size': [50, 200],
'min_count': [2, 10],
'epochs': [20, 50],
'window': [2, 10],
'alpha': [0.025, 0.01],
}
param_size = hyperparams['vector_size']
param_count = hyperparams['min_count']
param_epochs = hyperparams['epochs']
param_window = hyperparams['window']
param_alpha = hyperparams['alpha']
# Split data into training corpus and testing corpus
x_data = df_restricted["words"]
y_data = df_restricted["tags"]
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.15, random_state=42)
# including genre tag (label) as well
train_tagged_docs = [TaggedDocument(t, [label]) for t, label in zip(
X_train, y_train)]
# Train and evaluate models for every combination of hyperparameters
for param_size, param_count, param_epochs,param_window, param_alpha in
# Train model
np.random.shuffle(train_tagged_docs)
model = gensim.models.doc2vec.Doc2Vec(dm=1, vector_size=param_size,
min_count=param_count, epochs=param_epochs, workers=cores, window=
param_window, alpha=param_alpha)
model.random.seed(0)
model.build_vocab(train_tagged_docs)
model.train(train_tagged_docs, total_examples=model.corpus_count,
epochs=model.iter)
# Evaluate model
X_val = np.array([model.infer_vector(t) for t in X_test])
genre_vectors = np.array([model.docvecs[x] for x in genre_list])
sims = cosine_similarity(X_val, genre_vectors)
y_val_pred = np.array(genre_list)[sims.argmax(axis=1)]
acc = accuracy_score(y_test, y_val_pred).round(4)
score = f1_score(y_test, y_val_pred, average='weighted').round(4)
Saisissez le code ici...
f1_score accuracy size min_count epochs window alpha
0 0.332 0.3419 50 2 20 2 0.025
1 0.1729 0.2233 50 2 20 2 0.01
2 0.3273 0.3373 50 2 20 10 0.025
3 0.1786 0.2279 50 2 20 10 0.01
4 0.3131 0.3044 50 2 50 2 0.025
5 0.2956 0.326 50 2 50 2 0.01
6 0.3243 0.3142 50 2 50 10 0.025
7 0.2942 0.3249 50 2 50 10 0.01
8 0.3483 0.3511 50 10 20 2 0.025
9 0.2477 0.2849 50 10 20 2 0.01
10 0.3413 0.3522 50 10 20 10 0.025
11 0.2425 0.2787 50 10 20 10 0.01
12 0.3211 0.307 50 10 50 2 0.025
13 0.321 0.3522 50 10 50 2 0.01
14 0.3251 0.3157 50 10 50 10 0.025
15 0.3149 0.346 50 10 50 10 0.01
16 0.3173 0.3563 200 2 20 2 0.025
17 0.1893 0.2474 200 2 20 2 0.01
18 0.3223 0.3645 200 2 20 10 0.025
19 0.1847 0.2413 200 2 20 10 0.01
20 0.3053 0.3219 200 2 50 2 0.025
21 0.2915 0.3311 200 2 50 2 0.01
22 0.3069 0.3249 200 2 50 10 0.025
23 0.2914 0.3337 200 2 50 10 0.01
24 0.3352 0.3634 200 10 20 2 0.025
25 0.2547 0.3054 200 10 20 2 0.01
26 0.3459 0.3753 200 10 20 10 0.025
27 0.2475 0.3008 200 10 20 10 0.01
28 0.3125 0.3244 200 10 50 2 0.025
29 0.3035 0.3496 200 10 50 2 0.01
30 0.315 0.326 200 10 50 10 0.025
31 0.3077 0.3516 200 10 50 10 0.01
Am I doing something wrong? The f1 score and accuracy score don't seem
very good. But bear in mind that there are around 60 different labels, so
the classifier doesn't have an easy job.
Many thanks in advance!

Gordon Mohr

2018-12-04 23:33:10 UTC

Permalink

You're on the right track: trying all permutations of a set of possible
parameters is a typical meta-optimization strategy, sometimes called "grid
search". But, you're using a very crude classification technique, and ther
are other things worth trying with respect to Doc2Vec training.

By supplying only the desired labels as the document-tags, you're
essentially training the Doc2Vec model with just 60 virtual mega-documents,
and it's learning just 60 unique doc-vectors. That might be OK, especially
if the docs are short or you lack the memory to give every doc its own
vector, but also limits the model expressiveness quite a bit. (And, trying
to train just 60 unique vectors in a 200-dimensional space could risk
severe overfitting â each of the vectors could trend towards being a
one-hot-like vector that just optimizes itself, with no useful
trading-off-against-related-peers. But, the mode you're using, PV-DM, by
including word-training for likely tens of thousands of other words
alongside the 60 doc-vectors, probably mitigates that.

But that means: each of your labels gets summarized as a single doc-vector.
And future test data will be classified by the single nearest label-vector,
reducing all label-volumes to be roughly
spheres-around-those-summary-points. If in fact the labeled data populates
"lumpy" 200-dimensional regions of your space, this classification method â
roughly "K-Nearest-Neighbors where K=1 and there's only 1 example of each
class" â this process can't capture that.

There are lots of potential ways to train Doc2Vec vectors from labeled data
for use in a downstream classifier. A few would be:

(1) Train to give each doc a unique doc-vector (such as an integer doc-id),
oblivious to the known-labels. This uses Doc2Vec entirely in its original
'unsupervised' definition â so it's also plausibly principled to do this
training without any holdout test set. (It's just learning the patterns of
the texts, and it can do this without known labels.) Then, use the
resulting (vector, label) data as input to a separately-chosen classifier.

(2) Give each doc both a unique doc-id *and* the known-label, where
available, using the ability of `TaggedDocument` to assign multiple tags.
If I recall correctly, I've seen cases where mixing those
many-document-tags in helped make the resulting model more sensitive to the
label-distinctions. Still just use the (per-doc-vector, label) to train the
downstream-classifier

(3) Train each doc with the known-label as its only doc-tag. (This is what
you've shown in your code so far.) This might streamline some steps but
also lose some expressiveness.

Further, especially for methods (2) and (3), you could consider
re-calculating the per-document vectors via inference at the end of
training. During training, a lot of effort is spent making the per-label
vectors predictive, even though those may not be what's interesting with
respect to individual texts. So this final re-inference ensures each
document has a vector optimized to represent its text, with respect to the
final trained model. (In case (3), there would have been no per-document
vectors until this step, and thus no way for a doc to get a vector that
reflects things like, "this text is somewhere between the A-label, D-label,
and R-label neighborhoods".)

However you do the doc-vector training, with or without known-labels being
mixed-in, you then need to do classification as a separate step.

Using a KNearestNeighbor-style classification may make sense if you don't
have too many data points. (Its performance drops rapidly with remembering
all the known training points.) But even with K=1, you might get better
evaluation scores than you get now â a test text wouldn't have to be
closest to the centroid of all examples of a label, just one close
neighbor, to be assigned a label. And higher Ks might smooth out the
effects of outlier examples, and do even better.

You could also try other classifiers. Trying one of sklearn's
linear-classifiers (`SGDClassifier`, `LinearSVC`, `SVC(kernel='linear') and
its RandomForestClassifier would be two options that constrast with a
simple nearest-neighbor approach. (Though each of these has its own
metaparameters to tweak.)

Other observations/suggestions:

* especially if your documents are short and main task is classification,
the PV-DBOW mode (`dm=0`) is often fast and a top-performer. (Beware,
though: if you pursued the "only 60 docs" tag-assignment method, and use

60 vector dimensions, it'd be severely prone to overfitting and there'd be

no mixed-in word-training to help offset that.) In pure PV-DBOW, the
`window` size doesn't matter, but you can also add in skip-gram
word-training (with `dbow_words=1`), where `window` again matters. That's
slower, and can sometimes help or hurt the doc-vector quality via the
interplay with word-training.

* I haven't seen a lower-than-0.025 alpha (`0.01`) tried as often as a
higher alpha (0.05).

* if your dataset is small enough it's quick and easy to run all these
permutations, then this might be a bad fit for Doc2Vec, or some of your
parameters might be missized (like a vector-size too large for a small
dataset). OTOH, if the runtime is a concern, note that one of the
time-consuming steps on a large corpus, the initial vocabulary-scan &
model-allocation, is actually only affected by one of the parameters you're
varying: `min_count`. You could plausibly optimize your code to prep a
model with a single `min_count` value, then save that half-initialized
model to disk, then re-load it for variants of the other parameters - which
you would direct tamper-modify on the model before the substantive
`train()` occurs.

- Gordon

Hi,
I've written a script to automatize the tuning of hyperparameters for a
doc2vec model. However, I wanted to check if I'm doing things right, since
my results are decent but not great.
Basically, I'm defining a number of values to test for each hyperparameter
of interest, and then the script does a doc2vec model for every combination
of these hyperparameter values on a training corpus (85% of my total
corpus). Each resulting model is tested on a testing corpus (15% of my
total corpus), by trying to guess the tag of each document (by inferring
the vector for the document, then looking at the most similar tag vector).
The overall accuracy and f1 score of each model is computed.
# Set values of interest for hyperparameters
hyperparams = {
'vector_size': [50, 200],
'min_count': [2, 10],
'epochs': [20, 50],
'window': [2, 10],
'alpha': [0.025, 0.01],
}
param_size = hyperparams['vector_size']
param_count = hyperparams['min_count']
param_epochs = hyperparams['epochs']
param_window = hyperparams['window']
param_alpha = hyperparams['alpha']
# Split data into training corpus and testing corpus
x_data = df_restricted["words"]
y_data = df_restricted["tags"]
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.15, random_state=42)
# including genre tag (label) as well
train_tagged_docs = [TaggedDocument(t, [label]) for t, label in zip(
X_train, y_train)]
# Train and evaluate models for every combination of hyperparameters
for param_size, param_count, param_epochs,param_window, param_alpha in
# Train model
np.random.shuffle(train_tagged_docs)
model = gensim.models.doc2vec.Doc2Vec(dm=1, vector_size=param_size,
min_count=param_count, epochs=param_epochs, workers=cores, window=
param_window, alpha=param_alpha)
model.random.seed(0)
model.build_vocab(train_tagged_docs)
model.train(train_tagged_docs, total_examples=model.corpus_count,
epochs=model.iter)
# Evaluate model
X_val = np.array([model.infer_vector(t) for t in X_test])
genre_vectors = np.array([model.docvecs[x] for x in genre_list])
sims = cosine_similarity(X_val, genre_vectors)
y_val_pred = np.array(genre_list)[sims.argmax(axis=1)]
acc = accuracy_score(y_test, y_val_pred).round(4)
score = f1_score(y_test, y_val_pred, average='weighted').round(4)
Saisissez le code ici...
f1_score accuracy size min_count epochs window alpha
0 0.332 0.3419 50 2 20 2 0.025
1 0.1729 0.2233 50 2 20 2 0.01
2 0.3273 0.3373 50 2 20 10 0.025
3 0.1786 0.2279 50 2 20 10 0.01
4 0.3131 0.3044 50 2 50 2 0.025
5 0.2956 0.326 50 2 50 2 0.01
6 0.3243 0.3142 50 2 50 10 0.025
7 0.2942 0.3249 50 2 50 10 0.01
8 0.3483 0.3511 50 10 20 2 0.025
9 0.2477 0.2849 50 10 20 2 0.01
10 0.3413 0.3522 50 10 20 10 0.025
11 0.2425 0.2787 50 10 20 10 0.01
12 0.3211 0.307 50 10 50 2 0.025
13 0.321 0.3522 50 10 50 2 0.01
14 0.3251 0.3157 50 10 50 10 0.025
15 0.3149 0.346 50 10 50 10 0.01
16 0.3173 0.3563 200 2 20 2 0.025
17 0.1893 0.2474 200 2 20 2 0.01
18 0.3223 0.3645 200 2 20 10 0.025
19 0.1847 0.2413 200 2 20 10 0.01
20 0.3053 0.3219 200 2 50 2 0.025
21 0.2915 0.3311 200 2 50 2 0.01
22 0.3069 0.3249 200 2 50 10 0.025
23 0.2914 0.3337 200 2 50 10 0.01
24 0.3352 0.3634 200 10 20 2 0.025
25 0.2547 0.3054 200 10 20 2 0.01
26 0.3459 0.3753 200 10 20 10 0.025
27 0.2475 0.3008 200 10 20 10 0.01
28 0.3125 0.3244 200 10 50 2 0.025
29 0.3035 0.3496 200 10 50 2 0.01
30 0.315 0.326 200 10 50 10 0.025
31 0.3077 0.3516 200 10 50 10 0.01
Am I doing something wrong? The f1 score and accuracy score don't seem
very good. But bear in mind that there are around 60 different labels, so
the classifier doesn't have an easy job.
Many thanks in advance!

R. M.

2018-12-05 12:01:41 UTC

Permalink

Hi Gordon,

Thanks a lot for your detailed reply -- this is all very helpful.

Let me tell you a bit more about my corpus and my goals, hopefully this can
help determine what the best approach would be.

My corpus, as you guessed, is small-ish for doc2vec: it contains a little
over 15,000 documents. These docs are neither tiny nor huge -- at least
several paragraphs, a lot more in some cases. Each doc is tagged with a
unique label (every doc has a label).

Here are my goals with this project:

1. Obtain a similarity matrix of the labels. The way I have done this
previously is by training the doc2vec model with the labels (rather than
doc IDs) as tags, and then creating a heatmap representing the cosine
similarity of vectors for labels.
2. Obtain words that are characteristic of a given label. The way I've
previously done this is by training the model with the labels (& dm=1), and
plotting the top *n* most similar word vectors to each label vector.
3. Obtain labels that are most related to a given word. The way I've
previously done this is by training the model with the labels (& dm=1), and
plotting the top *n* most similar label vectors to specific word vectors.

As you can see, I presumably need to train word vectors, which on the face
of it rules out the basic dbow method. I've previously tried your solution
#2 (double tagging docs with doc ID & label), and the results for the three
tasks above seemed worse with the hyperparameters I tried.

There is also the question of how I could perform the three tasks above if
I don't train vectors for labels, as in your solution #1. I suppose what I
could do is calculate the average vector of all document vectors for docs
that have the same label. Would you recommend that?

I would be very grateful if you could give me some further advice based on
the nature of my corpus and my goals.

Many thanks in advance!

RM

Post by Gordon Mohr
You're on the right track: trying all permutations of a set of possible
parameters is a typical meta-optimization strategy, sometimes called "grid
search". But, you're using a very crude classification technique, and ther
are other things worth trying with respect to Doc2Vec training.
By supplying only the desired labels as the document-tags, you're
essentially training the Doc2Vec model with just 60 virtual mega-documents,
and it's learning just 60 unique doc-vectors. That might be OK, especially
if the docs are short or you lack the memory to give every doc its own
vector, but also limits the model expressiveness quite a bit. (And, trying
to train just 60 unique vectors in a 200-dimensional space could risk
severe overfitting â each of the vectors could trend towards being a
one-hot-like vector that just optimizes itself, with no useful
trading-off-against-related-peers. But, the mode you're using, PV-DM, by
including word-training for likely tens of thousands of other words
alongside the 60 doc-vectors, probably mitigates that.
But that means: each of your labels gets summarized as a single
doc-vector. And future test data will be classified by the single nearest
label-vector, reducing all label-volumes to be roughly
spheres-around-those-summary-points. If in fact the labeled data populates
"lumpy" 200-dimensional regions of your space, this classification method â
roughly "K-Nearest-Neighbors where K=1 and there's only 1 example of each
class" â this process can't capture that.
There are lots of potential ways to train Doc2Vec vectors from labeled
(1) Train to give each doc a unique doc-vector (such as an integer
doc-id), oblivious to the known-labels. This uses Doc2Vec entirely in its
original 'unsupervised' definition â so it's also plausibly principled to
do this training without any holdout test set. (It's just learning the
patterns of the texts, and it can do this without known labels.) Then, use
the resulting (vector, label) data as input to a separately-chosen
classifier.
(2) Give each doc both a unique doc-id *and* the known-label, where
available, using the ability of `TaggedDocument` to assign multiple tags.
If I recall correctly, I've seen cases where mixing those
many-document-tags in helped make the resulting model more sensitive to the
label-distinctions. Still just use the (per-doc-vector, label) to train the
downstream-classifier
(3) Train each doc with the known-label as its only doc-tag. (This is what
you've shown in your code so far.) This might streamline some steps but
also lose some expressiveness.
Further, especially for methods (2) and (3), you could consider
re-calculating the per-document vectors via inference at the end of
training. During training, a lot of effort is spent making the per-label
vectors predictive, even though those may not be what's interesting with
respect to individual texts. So this final re-inference ensures each
document has a vector optimized to represent its text, with respect to the
final trained model. (In case (3), there would have been no per-document
vectors until this step, and thus no way for a doc to get a vector that
reflects things like, "this text is somewhere between the A-label, D-label,
and R-label neighborhoods".)
However you do the doc-vector training, with or without known-labels being
mixed-in, you then need to do classification as a separate step.
Using a KNearestNeighbor-style classification may make sense if you don't
have too many data points. (Its performance drops rapidly with remembering
all the known training points.) But even with K=1, you might get better
evaluation scores than you get now â a test text wouldn't have to be
closest to the centroid of all examples of a label, just one close
neighbor, to be assigned a label. And higher Ks might smooth out the
effects of outlier examples, and do even better.
You could also try other classifiers. Trying one of sklearn's
linear-classifiers (`SGDClassifier`, `LinearSVC`, `SVC(kernel='linear') and
its RandomForestClassifier would be two options that constrast with a
simple nearest-neighbor approach. (Though each of these has its own
metaparameters to tweak.)
* especially if your documents are short and main task is classification,
the PV-DBOW mode (`dm=0`) is often fast and a top-performer. (Beware,
though: if you pursued the "only 60 docs" tag-assignment method, and use

60 vector dimensions, it'd be severely prone to overfitting and there'd be

no mixed-in word-training to help offset that.) In pure PV-DBOW, the
`window` size doesn't matter, but you can also add in skip-gram
word-training (with `dbow_words=1`), where `window` again matters. That's
slower, and can sometimes help or hurt the doc-vector quality via the
interplay with word-training.
* I haven't seen a lower-than-0.025 alpha (`0.01`) tried as often as a
higher alpha (0.05).
* if your dataset is small enough it's quick and easy to run all these
permutations, then this might be a bad fit for Doc2Vec, or some of your
parameters might be missized (like a vector-size too large for a small
dataset). OTOH, if the runtime is a concern, note that one of the
time-consuming steps on a large corpus, the initial vocabulary-scan &
model-allocation, is actually only affected by one of the parameters you're
varying: `min_count`. You could plausibly optimize your code to prep a
model with a single `min_count` value, then save that half-initialized
model to disk, then re-load it for variants of the other parameters - which
you would direct tamper-modify on the model before the substantive
`train()` occurs.
- Gordon

Hi,
I've written a script to automatize the tuning of hyperparameters for a
doc2vec model. However, I wanted to check if I'm doing things right, since
my results are decent but not great.
Basically, I'm defining a number of values to test for each
hyperparameter of interest, and then the script does a doc2vec model for
every combination of these hyperparameter values on a training corpus (85%
of my total corpus). Each resulting model is tested on a testing corpus
(15% of my total corpus), by trying to guess the tag of each document (by
inferring the vector for the document, then looking at the most similar tag
vector). The overall accuracy and f1 score of each model is computed.
# Set values of interest for hyperparameters
hyperparams = {
'vector_size': [50, 200],
'min_count': [2, 10],
'epochs': [20, 50],
'window': [2, 10],
'alpha': [0.025, 0.01],
}
param_size = hyperparams['vector_size']
param_count = hyperparams['min_count']
param_epochs = hyperparams['epochs']
param_window = hyperparams['window']
param_alpha = hyperparams['alpha']
# Split data into training corpus and testing corpus
x_data = df_restricted["words"]
y_data = df_restricted["tags"]
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.15, random_state=42)
# including genre tag (label) as well
train_tagged_docs = [TaggedDocument(t, [label]) for t, label in zip(
X_train, y_train)]
# Train and evaluate models for every combination of hyperparameters
for param_size, param_count, param_epochs,param_window, param_alpha in
# Train model
np.random.shuffle(train_tagged_docs)
model = gensim.models.doc2vec.Doc2Vec(dm=1, vector_size=param_size,
min_count=param_count, epochs=param_epochs, workers=cores, window=
param_window, alpha=param_alpha)
model.random.seed(0)
model.build_vocab(train_tagged_docs)
model.train(train_tagged_docs, total_examples=model.corpus_count,
epochs=model.iter)
# Evaluate model
X_val = np.array([model.infer_vector(t) for t in X_test])
genre_vectors = np.array([model.docvecs[x] for x in genre_list])
sims = cosine_similarity(X_val, genre_vectors)
y_val_pred = np.array(genre_list)[sims.argmax(axis=1)]
acc = accuracy_score(y_test, y_val_pred).round(4)
score = f1_score(y_test, y_val_pred, average='weighted').round(4)
Saisissez le code ici...
f1_score accuracy size min_count epochs window alpha
0 0.332 0.3419 50 2 20 2 0.025
1 0.1729 0.2233 50 2 20 2 0.01
2 0.3273 0.3373 50 2 20 10 0.025
3 0.1786 0.2279 50 2 20 10 0.01
4 0.3131 0.3044 50 2 50 2 0.025
5 0.2956 0.326 50 2 50 2 0.01
6 0.3243 0.3142 50 2 50 10 0.025
7 0.2942 0.3249 50 2 50 10 0.01
8 0.3483 0.3511 50 10 20 2 0.025
9 0.2477 0.2849 50 10 20 2 0.01
10 0.3413 0.3522 50 10 20 10 0.025
11 0.2425 0.2787 50 10 20 10 0.01
12 0.3211 0.307 50 10 50 2 0.025
13 0.321 0.3522 50 10 50 2 0.01
14 0.3251 0.3157 50 10 50 10 0.025
15 0.3149 0.346 50 10 50 10 0.01
16 0.3173 0.3563 200 2 20 2 0.025
17 0.1893 0.2474 200 2 20 2 0.01
18 0.3223 0.3645 200 2 20 10 0.025
19 0.1847 0.2413 200 2 20 10 0.01
20 0.3053 0.3219 200 2 50 2 0.025
21 0.2915 0.3311 200 2 50 2 0.01
22 0.3069 0.3249 200 2 50 10 0.025
23 0.2914 0.3337 200 2 50 10 0.01
24 0.3352 0.3634 200 10 20 2 0.025
25 0.2547 0.3054 200 10 20 2 0.01
26 0.3459 0.3753 200 10 20 10 0.025
27 0.2475 0.3008 200 10 20 10 0.01
28 0.3125 0.3244 200 10 50 2 0.025
29 0.3035 0.3496 200 10 50 2 0.01
30 0.315 0.326 200 10 50 10 0.025
31 0.3077 0.3516 200 10 50 10 0.01
Am I doing something wrong? The f1 score and accuracy score don't seem
very good. But bear in mind that there are around 60 different labels, so
the classifier doesn't have an easy job.
Many thanks in advance!

Gordon Mohr

2018-12-06 19:43:19 UTC

Permalink

Your docs are reasonably-sized, though it would be good if there were more
of them. I suspect your overall classification-accuracy will improve as
soon as you use a more sophisticated classifier than "reduce each label to
a single summary vector & assign each doc to the nearest label".

For optimizing your other outcomes â like the label->label, word->label,
label->word rankings, it's just a matter of tinkering â ideally via an
automated search, if you have an automated evaluation of the desirability
of the final results.

With a smaller dataset, you may want to explore smaller dimensionalities
and more training `epochs`.

You can add word-training to PV-DBOW with `dm=0, dbow_words=1` â and this
would be worth trying against the plain PV-DM modes.

Training vectors together is necessary for them to be comparable, and it is
the tug-of-war between making different examples, and different vectors,
predictive that gives rise to the useful spacings/orientations of the final
vectors. But when doc-vectors and word-vectors are being co-trained, that
means certain parameter choices may effectively give more
weight/training-attention to one or the other.

In particular, larger `window` values tend to mean relatively more
training-attention to words. So especially in cases where the power of the
label-vectors is of primary interest, you should be sure to try many
small-window values â even 1, 2, 3. I think it'd even be possible to try to
force a relative overweighting of labels by repeating them more than once
in the `tags` of a document, so that could be worth trying. Similarly, if
some labels have few examples but are still important to model on an equal
basis with larger categories, it may be worth trying an artificial
expansion of their proportion of the corpus by repeating their documents.
(Such non-varied repetition isn't ideal for creating subtle vectors â real
variety of examples does that â but could help offset the numerical
domination of other labels in influencing the final model state.)

You wouldn't necessarily need to reduce every label to a single summary
vector to calculate label->label similarities. You could instead do a
census of document neighbors: "For all documents labeled A, looking at its
n nearest neighbors' labels, which other labels appear most often?"

Or similarly, if you create a multi-class classifier based on just the
(doc-vector, label) training-data that offers ranked predictions, there's
no single vector for an individual label. But you could check: "For all
documents with a most-predicted label of A, what other labels most-often
appear as 2nd. 3rd, etc predictions?" Or, "for any known-label-A docs the
classifier gets wrong, what are the most common errors?" (Compare the idea
of the "confusion matrix", once you have a classifier, even a weak one, to
test.)

Mainly: note that collapsing a complete category to a single summary vector
(such as an average of all examples) can be very crude in a
high-dimensional space, missing the potential variety of
boundaries-of-distinction that actually appear in the data, and won't just
be defined by "which label's center-point is closest?"

- Gordon

Post by R. M.
Hi Gordon,
Thanks a lot for your detailed reply -- this is all very helpful.
Let me tell you a bit more about my corpus and my goals, hopefully this
can help determine what the best approach would be.
My corpus, as you guessed, is small-ish for doc2vec: it contains a little
over 15,000 documents. These docs are neither tiny nor huge -- at least
several paragraphs, a lot more in some cases. Each doc is tagged with a
unique label (every doc has a label).
1. Obtain a similarity matrix of the labels. The way I have done this
previously is by training the doc2vec model with the labels (rather than
doc IDs) as tags, and then creating a heatmap representing the cosine
similarity of vectors for labels.
2. Obtain words that are characteristic of a given label. The way I've
previously done this is by training the model with the labels (& dm=1), and
plotting the top *n* most similar word vectors to each label vector.
3. Obtain labels that are most related to a given word. The way I've
previously done this is by training the model with the labels (& dm=1), and
plotting the top *n* most similar label vectors to specific word vectors.
As you can see, I presumably need to train word vectors, which on the face
of it rules out the basic dbow method. I've previously tried your solution
#2 (double tagging docs with doc ID & label), and the results for the three
tasks above seemed worse with the hyperparameters I tried.
There is also the question of how I could perform the three tasks above if
I don't train vectors for labels, as in your solution #1. I suppose what I
could do is calculate the average vector of all document vectors for docs
that have the same label. Would you recommend that?
I would be very grateful if you could give me some further advice based on
the nature of my corpus and my goals.
Many thanks in advance!
RM

Post by Gordon Mohr
You're on the right track: trying all permutations of a set of possible
parameters is a typical meta-optimization strategy, sometimes called "grid
search". But, you're using a very crude classification technique, and ther
are other things worth trying with respect to Doc2Vec training.
By supplying only the desired labels as the document-tags, you're
essentially training the Doc2Vec model with just 60 virtual mega-documents,
and it's learning just 60 unique doc-vectors. That might be OK, especially
if the docs are short or you lack the memory to give every doc its own
vector, but also limits the model expressiveness quite a bit. (And, trying
to train just 60 unique vectors in a 200-dimensional space could risk
severe overfitting â each of the vectors could trend towards being a
one-hot-like vector that just optimizes itself, with no useful
trading-off-against-related-peers. But, the mode you're using, PV-DM, by
including word-training for likely tens of thousands of other words
alongside the 60 doc-vectors, probably mitigates that.
But that means: each of your labels gets summarized as a single
doc-vector. And future test data will be classified by the single nearest
label-vector, reducing all label-volumes to be roughly
spheres-around-those-summary-points. If in fact the labeled data populates
"lumpy" 200-dimensional regions of your space, this classification method â
roughly "K-Nearest-Neighbors where K=1 and there's only 1 example of each
class" â this process can't capture that.
There are lots of potential ways to train Doc2Vec vectors from labeled
(1) Train to give each doc a unique doc-vector (such as an integer
doc-id), oblivious to the known-labels. This uses Doc2Vec entirely in its
original 'unsupervised' definition â so it's also plausibly principled to
do this training without any holdout test set. (It's just learning the
patterns of the texts, and it can do this without known labels.) Then, use
the resulting (vector, label) data as input to a separately-chosen
classifier.
(2) Give each doc both a unique doc-id *and* the known-label, where
available, using the ability of `TaggedDocument` to assign multiple tags.
If I recall correctly, I've seen cases where mixing those
many-document-tags in helped make the resulting model more sensitive to the
label-distinctions. Still just use the (per-doc-vector, label) to train the
downstream-classifier
(3) Train each doc with the known-label as its only doc-tag. (This is
what you've shown in your code so far.) This might streamline some steps
but also lose some expressiveness.
Further, especially for methods (2) and (3), you could consider
re-calculating the per-document vectors via inference at the end of
training. During training, a lot of effort is spent making the per-label
vectors predictive, even though those may not be what's interesting with
respect to individual texts. So this final re-inference ensures each
document has a vector optimized to represent its text, with respect to the
final trained model. (In case (3), there would have been no per-document
vectors until this step, and thus no way for a doc to get a vector that
reflects things like, "this text is somewhere between the A-label, D-label,
and R-label neighborhoods".)
However you do the doc-vector training, with or without known-labels
being mixed-in, you then need to do classification as a separate step.
Using a KNearestNeighbor-style classification may make sense if you don't
have too many data points. (Its performance drops rapidly with remembering
all the known training points.) But even with K=1, you might get better
evaluation scores than you get now â a test text wouldn't have to be
closest to the centroid of all examples of a label, just one close
neighbor, to be assigned a label. And higher Ks might smooth out the
effects of outlier examples, and do even better.
You could also try other classifiers. Trying one of sklearn's
linear-classifiers (`SGDClassifier`, `LinearSVC`, `SVC(kernel='linear') and
its RandomForestClassifier would be two options that constrast with a
simple nearest-neighbor approach. (Though each of these has its own
metaparameters to tweak.)
* especially if your documents are short and main task is classification,
the PV-DBOW mode (`dm=0`) is often fast and a top-performer. (Beware,
though: if you pursued the "only 60 docs" tag-assignment method, and use

60 vector dimensions, it'd be severely prone to overfitting and there'd be

no mixed-in word-training to help offset that.) In pure PV-DBOW, the
`window` size doesn't matter, but you can also add in skip-gram
word-training (with `dbow_words=1`), where `window` again matters. That's
slower, and can sometimes help or hurt the doc-vector quality via the
interplay with word-training.
* I haven't seen a lower-than-0.025 alpha (`0.01`) tried as often as a
higher alpha (0.05).
* if your dataset is small enough it's quick and easy to run all these
permutations, then this might be a bad fit for Doc2Vec, or some of your
parameters might be missized (like a vector-size too large for a small
dataset). OTOH, if the runtime is a concern, note that one of the
time-consuming steps on a large corpus, the initial vocabulary-scan &
model-allocation, is actually only affected by one of the parameters you're
varying: `min_count`. You could plausibly optimize your code to prep a
model with a single `min_count` value, then save that half-initialized
model to disk, then re-load it for variants of the other parameters - which
you would direct tamper-modify on the model before the substantive
`train()` occurs.
- Gordon

Hi,
I've written a script to automatize the tuning of hyperparameters for a
doc2vec model. However, I wanted to check if I'm doing things right, since
my results are decent but not great.
Basically, I'm defining a number of values to test for each
hyperparameter of interest, and then the script does a doc2vec model for
every combination of these hyperparameter values on a training corpus (85%
of my total corpus). Each resulting model is tested on a testing corpus
(15% of my total corpus), by trying to guess the tag of each document (by
inferring the vector for the document, then looking at the most similar tag
vector). The overall accuracy and f1 score of each model is computed.
# Set values of interest for hyperparameters
hyperparams = {
'vector_size': [50, 200],
'min_count': [2, 10],
'epochs': [20, 50],
'window': [2, 10],
'alpha': [0.025, 0.01],
}
param_size = hyperparams['vector_size']
param_count = hyperparams['min_count']
param_epochs = hyperparams['epochs']
param_window = hyperparams['window']
param_alpha = hyperparams['alpha']
# Split data into training corpus and testing corpus
x_data = df_restricted["words"]
y_data = df_restricted["tags"]
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.15, random_state=42)
# including genre tag (label) as well
train_tagged_docs = [TaggedDocument(t, [label]) for t, label in zip(
X_train, y_train)]
# Train and evaluate models for every combination of hyperparameters
for param_size, param_count, param_epochs,param_window, param_alpha in
product(param_size, param_count, param_epochs,param_window, param_alpha
# Train model
np.random.shuffle(train_tagged_docs)
model = gensim.models.doc2vec.Doc2Vec(dm=1, vector_size=param_size,
min_count=param_count, epochs=param_epochs, workers=cores, window=
param_window, alpha=param_alpha)
model.random.seed(0)
model.build_vocab(train_tagged_docs)
model.train(train_tagged_docs, total_examples=model.corpus_count,
epochs=model.iter)
# Evaluate model
X_val = np.array([model.infer_vector(t) for t in X_test])
genre_vectors = np.array([model.docvecs[x] for x in genre_list])
sims = cosine_similarity(X_val, genre_vectors)
y_val_pred = np.array(genre_list)[sims.argmax(axis=1)]
acc = accuracy_score(y_test, y_val_pred).round(4)
score = f1_score(y_test, y_val_pred, average='weighted').round(4)
Saisissez le code ici...
f1_score accuracy size min_count epochs window alpha
0 0.332 0.3419 50 2 20 2 0.025
1 0.1729 0.2233 50 2 20 2 0.01
2 0.3273 0.3373 50 2 20 10 0.025
3 0.1786 0.2279 50 2 20 10 0.01
4 0.3131 0.3044 50 2 50 2 0.025
5 0.2956 0.326 50 2 50 2 0.01
6 0.3243 0.3142 50 2 50 10 0.025
7 0.2942 0.3249 50 2 50 10 0.01
8 0.3483 0.3511 50 10 20 2 0.025
9 0.2477 0.2849 50 10 20 2 0.01
10 0.3413 0.3522 50 10 20 10 0.025
11 0.2425 0.2787 50 10 20 10 0.01
12 0.3211 0.307 50 10 50 2 0.025
13 0.321 0.3522 50 10 50 2 0.01
14 0.3251 0.3157 50 10 50 10 0.025
15 0.3149 0.346 50 10 50 10 0.01
16 0.3173 0.3563 200 2 20 2 0.025
17 0.1893 0.2474 200 2 20 2 0.01
18 0.3223 0.3645 200 2 20 10 0.025
19 0.1847 0.2413 200 2 20 10 0.01
20 0.3053 0.3219 200 2 50 2 0.025
21 0.2915 0.3311 200 2 50 2 0.01
22 0.3069 0.3249 200 2 50 10 0.025
23 0.2914 0.3337 200 2 50 10 0.01
24 0.3352 0.3634 200 10 20 2 0.025
25 0.2547 0.3054 200 10 20 2 0.01
26 0.3459 0.3753 200 10 20 10 0.025
27 0.2475 0.3008 200 10 20 10 0.01
28 0.3125 0.3244 200 10 50 2 0.025
29 0.3035 0.3496 200 10 50 2 0.01
30 0.315 0.326 200 10 50 10 0.025
31 0.3077 0.3516 200 10 50 10 0.01
Am I doing something wrong? The f1 score and accuracy score don't seem
very good. But bear in mind that there are around 60 different labels, so
the classifier doesn't have an easy job.
Many thanks in advance!

R. M.

2018-12-07 13:39:20 UTC

Permalink

Thanks Gordon, as always this is very helpful.

Your suggestions are really interesting, and it makes sense that collapsing
a whole category into a single vector would lose a lot of relevant
information.

I just have a small clarificatory question about your three suggestions for
label-label similarity. The three questions you suggest asking are:

1. For all docs with label A, which other labels appear most often in
its *n *nearest neighbour docs?
2. For all docs with a most-predicted label of A, what other labels most
often appear as 2nd, 3rd, âŠ, predictions?
3. For any known-label-A docs the classifier gets wrong, what are the
most common errors?

I get that such methods would give me a ranking of the *n* most similar
labels to a given label A. However, what kind of metric could be used to
measure the degree of similarity to A? Previously, I have used cosine
similarity between label-vectors. With your three suggestions, it is less
immediately obvious to me what the similarity metric would be. Maybe I'm
missing an obvious answer -- could you give me some pointers?

- RM

Post by Gordon Mohr
Your docs are reasonably-sized, though it would be good if there were more
of them. I suspect your overall classification-accuracy will improve as
soon as you use a more sophisticated classifier than "reduce each label to
a single summary vector & assign each doc to the nearest label".
For optimizing your other outcomes â like the label->label, word->label,
label->word rankings, it's just a matter of tinkering â ideally via an
automated search, if you have an automated evaluation of the desirability
of the final results.
With a smaller dataset, you may want to explore smaller dimensionalities
and more training `epochs`.
You can add word-training to PV-DBOW with `dm=0, dbow_words=1` â and this
would be worth trying against the plain PV-DM modes.
Training vectors together is necessary for them to be comparable, and it
is the tug-of-war between making different examples, and different vectors,
predictive that gives rise to the useful spacings/orientations of the final
vectors. But when doc-vectors and word-vectors are being co-trained, that
means certain parameter choices may effectively give more
weight/training-attention to one or the other.
In particular, larger `window` values tend to mean relatively more
training-attention to words. So especially in cases where the power of the
label-vectors is of primary interest, you should be sure to try many
small-window values â even 1, 2, 3. I think it'd even be possible to try to
force a relative overweighting of labels by repeating them more than once
in the `tags` of a document, so that could be worth trying. Similarly, if
some labels have few examples but are still important to model on an equal
basis with larger categories, it may be worth trying an artificial
expansion of their proportion of the corpus by repeating their documents.
(Such non-varied repetition isn't ideal for creating subtle vectors â real
variety of examples does that â but could help offset the numerical
domination of other labels in influencing the final model state.)
You wouldn't necessarily need to reduce every label to a single summary
vector to calculate label->label similarities. You could instead do a
census of document neighbors: "For all documents labeled A, looking at its
n nearest neighbors' labels, which other labels appear most often?"
Or similarly, if you create a multi-class classifier based on just the
(doc-vector, label) training-data that offers ranked predictions, there's
no single vector for an individual label. But you could check: "For all
documents with a most-predicted label of A, what other labels most-often
appear as 2nd. 3rd, etc predictions?" Or, "for any known-label-A docs the
classifier gets wrong, what are the most common errors?" (Compare the idea
of the "confusion matrix", once you have a classifier, even a weak one, to
test.)
Mainly: note that collapsing a complete category to a single summary
vector (such as an average of all examples) can be very crude in a
high-dimensional space, missing the potential variety of
boundaries-of-distinction that actually appear in the data, and won't just
be defined by "which label's center-point is closest?"
- Gordon

Post by R. M.
Hi Gordon,
Thanks a lot for your detailed reply -- this is all very helpful.
Let me tell you a bit more about my corpus and my goals, hopefully this
can help determine what the best approach would be.
My corpus, as you guessed, is small-ish for doc2vec: it contains a little
over 15,000 documents. These docs are neither tiny nor huge -- at least
several paragraphs, a lot more in some cases. Each doc is tagged with a
unique label (every doc has a label).
1. Obtain a similarity matrix of the labels. The way I have done this
previously is by training the doc2vec model with the labels (rather than
doc IDs) as tags, and then creating a heatmap representing the cosine
similarity of vectors for labels.
2. Obtain words that are characteristic of a given label. The way
I've previously done this is by training the model with the labels (&
dm=1), and plotting the top *n* most similar word vectors to each
label vector.
3. Obtain labels that are most related to a given word. The way I've
previously done this is by training the model with the labels (& dm=1), and
plotting the top *n* most similar label vectors to specific word vectors.
As you can see, I presumably need to train word vectors, which on the
face of it rules out the basic dbow method. I've previously tried your
solution #2 (double tagging docs with doc ID & label), and the results for
the three tasks above seemed worse with the hyperparameters I tried.
There is also the question of how I could perform the three tasks above
if I don't train vectors for labels, as in your solution #1. I suppose what
I could do is calculate the average vector of all document vectors for docs
that have the same label. Would you recommend that?
I would be very grateful if you could give me some further advice based
on the nature of my corpus and my goals.
Many thanks in advance!
RM

Post by Gordon Mohr
You're on the right track: trying all permutations of a set of possible
parameters is a typical meta-optimization strategy, sometimes called "grid
search". But, you're using a very crude classification technique, and ther
are other things worth trying with respect to Doc2Vec training.
By supplying only the desired labels as the document-tags, you're
essentially training the Doc2Vec model with just 60 virtual mega-documents,
and it's learning just 60 unique doc-vectors. That might be OK, especially
if the docs are short or you lack the memory to give every doc its own
vector, but also limits the model expressiveness quite a bit. (And, trying
to train just 60 unique vectors in a 200-dimensional space could risk
severe overfitting â each of the vectors could trend towards being a
one-hot-like vector that just optimizes itself, with no useful
trading-off-against-related-peers. But, the mode you're using, PV-DM, by
including word-training for likely tens of thousands of other words
alongside the 60 doc-vectors, probably mitigates that.
But that means: each of your labels gets summarized as a single
doc-vector. And future test data will be classified by the single nearest
label-vector, reducing all label-volumes to be roughly
spheres-around-those-summary-points. If in fact the labeled data populates
"lumpy" 200-dimensional regions of your space, this classification method â
roughly "K-Nearest-Neighbors where K=1 and there's only 1 example of each
class" â this process can't capture that.
There are lots of potential ways to train Doc2Vec vectors from labeled
(1) Train to give each doc a unique doc-vector (such as an integer
doc-id), oblivious to the known-labels. This uses Doc2Vec entirely in its
original 'unsupervised' definition â so it's also plausibly principled to
do this training without any holdout test set. (It's just learning the
patterns of the texts, and it can do this without known labels.) Then, use
the resulting (vector, label) data as input to a separately-chosen
classifier.
(2) Give each doc both a unique doc-id *and* the known-label, where
available, using the ability of `TaggedDocument` to assign multiple tags.
If I recall correctly, I've seen cases where mixing those
many-document-tags in helped make the resulting model more sensitive to the
label-distinctions. Still just use the (per-doc-vector, label) to train the
downstream-classifier
(3) Train each doc with the known-label as its only doc-tag. (This is
what you've shown in your code so far.) This might streamline some steps
but also lose some expressiveness.
Further, especially for methods (2) and (3), you could consider
re-calculating the per-document vectors via inference at the end of
training. During training, a lot of effort is spent making the per-label
vectors predictive, even though those may not be what's interesting with
respect to individual texts. So this final re-inference ensures each
document has a vector optimized to represent its text, with respect to the
final trained model. (In case (3), there would have been no per-document
vectors until this step, and thus no way for a doc to get a vector that
reflects things like, "this text is somewhere between the A-label, D-label,
and R-label neighborhoods".)
However you do the doc-vector training, with or without known-labels
being mixed-in, you then need to do classification as a separate step.
Using a KNearestNeighbor-style classification may make sense if you
don't have too many data points. (Its performance drops rapidly with
remembering all the known training points.) But even with K=1, you might
get better evaluation scores than you get now â a test text wouldn't have
to be closest to the centroid of all examples of a label, just one close
neighbor, to be assigned a label. And higher Ks might smooth out the
effects of outlier examples, and do even better.
You could also try other classifiers. Trying one of sklearn's
linear-classifiers (`SGDClassifier`, `LinearSVC`, `SVC(kernel='linear') and
its RandomForestClassifier would be two options that constrast with a
simple nearest-neighbor approach. (Though each of these has its own
metaparameters to tweak.)
* especially if your documents are short and main task is
classification, the PV-DBOW mode (`dm=0`) is often fast and a
top-performer. (Beware, though: if you pursued the "only 60 docs"
tag-assignment method, and use >60 vector dimensions, it'd be severely
prone to overfitting and there'd be no mixed-in word-training to help
offset that.) In pure PV-DBOW, the `window` size doesn't matter, but you
can also add in skip-gram word-training (with `dbow_words=1`), where
`window` again matters. That's slower, and can sometimes help or hurt the
doc-vector quality via the interplay with word-training.
* I haven't seen a lower-than-0.025 alpha (`0.01`) tried as often as a
higher alpha (0.05).
* if your dataset is small enough it's quick and easy to run all these
permutations, then this might be a bad fit for Doc2Vec, or some of your
parameters might be missized (like a vector-size too large for a small
dataset). OTOH, if the runtime is a concern, note that one of the
time-consuming steps on a large corpus, the initial vocabulary-scan &
model-allocation, is actually only affected by one of the parameters you're
varying: `min_count`. You could plausibly optimize your code to prep a
model with a single `min_count` value, then save that half-initialized
model to disk, then re-load it for variants of the other parameters - which
you would direct tamper-modify on the model before the substantive
`train()` occurs.
- Gordon

Post by R. M.
Hi,
I've written a script to automatize the tuning of hyperparameters for a
doc2vec model. However, I wanted to check if I'm doing things right, since
my results are decent but not great.
Basically, I'm defining a number of values to test for each
hyperparameter of interest, and then the script does a doc2vec model for
every combination of these hyperparameter values on a training corpus (85%
of my total corpus). Each resulting model is tested on a testing corpus
(15% of my total corpus), by trying to guess the tag of each document (by
inferring the vector for the document, then looking at the most similar tag
vector). The overall accuracy and f1 score of each model is computed.
# Set values of interest for hyperparameters
hyperparams = {
'vector_size': [50, 200],
'min_count': [2, 10],
'epochs': [20, 50],
'window': [2, 10],
'alpha': [0.025, 0.01],
}
param_size = hyperparams['vector_size']
param_count = hyperparams['min_count']
param_epochs = hyperparams['epochs']
param_window = hyperparams['window']
param_alpha = hyperparams['alpha']
# Split data into training corpus and testing corpus
x_data = df_restricted["words"]
y_data = df_restricted["tags"]
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.15, random_state=42)
# including genre tag (label) as well
train_tagged_docs = [TaggedDocument(t, [label]) for t, label in zip(
X_train, y_train)]
# Train and evaluate models for every combination of hyperparameters
for param_size, param_count, param_epochs,param_window, param_alpha in
product(param_size, param_count, param_epochs,param_window, param_alpha
# Train model
np.random.shuffle(train_tagged_docs)
model = gensim.models.doc2vec.Doc2Vec(dm=1, vector_size=param_size,
min_count=param_count, epochs=param_epochs, workers=cores, window=
param_window, alpha=param_alpha)
model.random.seed(0)
model.build_vocab(train_tagged_docs)
model.train(train_tagged_docs, total_examples=model.corpus_count,
epochs=model.iter)
# Evaluate model
X_val = np.array([model.infer_vector(t) for t in X_test])
genre_vectors = np.array([model.docvecs[x] for x in genre_list])
sims = cosine_similarity(X_val, genre_vectors)
y_val_pred = np.array(genre_list)[sims.argmax(axis=1)]
acc = accuracy_score(y_test, y_val_pred).round(4)
score = f1_score(y_test, y_val_pred, average='weighted').round(4)
Saisissez le code ici...
f1_score accuracy size min_count epochs window alpha
0 0.332 0.3419 50 2 20 2 0.025
1 0.1729 0.2233 50 2 20 2 0.01
2 0.3273 0.3373 50 2 20 10 0.025
3 0.1786 0.2279 50 2 20 10 0.01
4 0.3131 0.3044 50 2 50 2 0.025
5 0.2956 0.326 50 2 50 2 0.01
6 0.3243 0.3142 50 2 50 10 0.025
7 0.2942 0.3249 50 2 50 10 0.01
8 0.3483 0.3511 50 10 20 2 0.025
9 0.2477 0.2849 50 10 20 2 0.01
10 0.3413 0.3522 50 10 20 10 0.025
11 0.2425 0.2787 50 10 20 10 0.01
12 0.3211 0.307 50 10 50 2 0.025
13 0.321 0.3522 50 10 50 2 0.01
14 0.3251 0.3157 50 10 50 10 0.025
15 0.3149 0.346 50 10 50 10 0.01
16 0.3173 0.3563 200 2 20 2 0.025
17 0.1893 0.2474 200 2 20 2 0.01
18 0.3223 0.3645 200 2 20 10 0.025
19 0.1847 0.2413 200 2 20 10 0.01
20 0.3053 0.3219 200 2 50 2 0.025
21 0.2915 0.3311 200 2 50 2 0.01
22 0.3069 0.3249 200 2 50 10 0.025
23 0.2914 0.3337 200 2 50 10 0.01
24 0.3352 0.3634 200 10 20 2 0.025
25 0.2547 0.3054 200 10 20 2 0.01
26 0.3459 0.3753 200 10 20 10 0.025
27 0.2475 0.3008 200 10 20 10 0.01
28 0.3125 0.3244 200 10 50 2 0.025
29 0.3035 0.3496 200 10 50 2 0.01
30 0.315 0.326 200 10 50 10 0.025
31 0.3077 0.3516 200 10 50 10 0.01
Am I doing something wrong? The f1 score and accuracy score don't seem
very good. But bear in mind that there are around 60 different labels, so
the classifier doesn't have an easy job.
Many thanks in advance!

R. M.

2018-12-07 19:11:48 UTC

Permalink

I have a further question about your previous suggestions.

You suggested trying the following:

Train to give each doc a unique doc-vector (such as an integer doc-id),

Post by Gordon Mohr
oblivious to the known-labels. This uses Doc2Vec entirely in its original
'unsupervised' definition â so it's also plausibly principled to do this
training without any holdout test set. (It's just learning the patterns of
the texts, and it can do this without known labels.) Then, use the
resulting (vector, label) data as input to a separately-chosen classifier.

I was wondering what you mean by your last sentence. If I do this "without
any holdout test set", what kind of downstream classifier would I confirm
the model with?

What I have done is the following:

1. Training a model on the whole corpus with *doc IDs *as tags (no
labels)
2. Looking up, for each document (vector) in the model, the labels of
top 10 most similar docs. Then comparing the most common label in this list
with the actual label of the target doc, and computing the accuracy of this
comparison acros this as an array for every document s the whole model.

Presumably, this is just testing whether vectors for docs that have the
same label tend to be close in the vector space. Is this what you had in
mind in the last sentence of the passage above? And is this really a good
way to determine the best hyperparameters of the model?

Many thanks for your help.

- RM

Post by Gordon Mohr
Thanks Gordon, as always this is very helpful.
Your suggestions are really interesting, and it makes sense that
collapsing a whole category into a single vector would lose a lot of
relevant information.
I just have a small clarificatory question about your three suggestions
1. For all docs with label A, which other labels appear most often in
its *n *nearest neighbour docs?
2. For all docs with a most-predicted label of A, what other labels
most often appear as 2nd, 3rd, âŠ, predictions?
3. For any known-label-A docs the classifier gets wrong, what are the
most common errors?
I get that such methods would give me a ranking of the *n* most similar
labels to a given label A. However, what kind of metric could be used to
measure the degree of similarity to A? Previously, I have used cosine
similarity between label-vectors. With your three suggestions, it is less
immediately obvious to me what the similarity metric would be. Maybe I'm
missing an obvious answer -- could you give me some pointers?
- RM

Post by Gordon Mohr
Your docs are reasonably-sized, though it would be good if there were
more of them. I suspect your overall classification-accuracy will improve
as soon as you use a more sophisticated classifier than "reduce each label
to a single summary vector & assign each doc to the nearest label".
For optimizing your other outcomes â like the label->label, word->label,
label->word rankings, it's just a matter of tinkering â ideally via an
automated search, if you have an automated evaluation of the desirability
of the final results.
With a smaller dataset, you may want to explore smaller dimensionalities
and more training `epochs`.
You can add word-training to PV-DBOW with `dm=0, dbow_words=1` â and this
would be worth trying against the plain PV-DM modes.
Training vectors together is necessary for them to be comparable, and it
is the tug-of-war between making different examples, and different vectors,
predictive that gives rise to the useful spacings/orientations of the final
vectors. But when doc-vectors and word-vectors are being co-trained, that
means certain parameter choices may effectively give more
weight/training-attention to one or the other.
In particular, larger `window` values tend to mean relatively more
training-attention to words. So especially in cases where the power of the
label-vectors is of primary interest, you should be sure to try many
small-window values â even 1, 2, 3. I think it'd even be possible to try to
force a relative overweighting of labels by repeating them more than once
in the `tags` of a document, so that could be worth trying. Similarly, if
some labels have few examples but are still important to model on an equal
basis with larger categories, it may be worth trying an artificial
expansion of their proportion of the corpus by repeating their documents.
(Such non-varied repetition isn't ideal for creating subtle vectors â real
variety of examples does that â but could help offset the numerical
domination of other labels in influencing the final model state.)
You wouldn't necessarily need to reduce every label to a single summary
vector to calculate label->label similarities. You could instead do a
census of document neighbors: "For all documents labeled A, looking at its
n nearest neighbors' labels, which other labels appear most often?"
Or similarly, if you create a multi-class classifier based on just the
(doc-vector, label) training-data that offers ranked predictions, there's
no single vector for an individual label. But you could check: "For all
documents with a most-predicted label of A, what other labels most-often
appear as 2nd. 3rd, etc predictions?" Or, "for any known-label-A docs the
classifier gets wrong, what are the most common errors?" (Compare the idea
of the "confusion matrix", once you have a classifier, even a weak one, to
test.)
Mainly: note that collapsing a complete category to a single summary
vector (such as an average of all examples) can be very crude in a
high-dimensional space, missing the potential variety of
boundaries-of-distinction that actually appear in the data, and won't just
be defined by "which label's center-point is closest?"
- Gordon

Post by R. M.
Hi Gordon,
Thanks a lot for your detailed reply -- this is all very helpful.
Let me tell you a bit more about my corpus and my goals, hopefully this
can help determine what the best approach would be.
My corpus, as you guessed, is small-ish for doc2vec: it contains a
little over 15,000 documents. These docs are neither tiny nor huge -- at
least several paragraphs, a lot more in some cases. Each doc is tagged with
a unique label (every doc has a label).
1. Obtain a similarity matrix of the labels. The way I have done
this previously is by training the doc2vec model with the labels (rather
than doc IDs) as tags, and then creating a heatmap representing the cosine
similarity of vectors for labels.
2. Obtain words that are characteristic of a given label. The way
I've previously done this is by training the model with the labels (&
dm=1), and plotting the top *n* most similar word vectors to each
label vector.
3. Obtain labels that are most related to a given word. The way I've
previously done this is by training the model with the labels (& dm=1), and
plotting the top *n* most similar label vectors to specific word vectors.
As you can see, I presumably need to train word vectors, which on the
face of it rules out the basic dbow method. I've previously tried your
solution #2 (double tagging docs with doc ID & label), and the results for
the three tasks above seemed worse with the hyperparameters I tried.
There is also the question of how I could perform the three tasks above
if I don't train vectors for labels, as in your solution #1. I suppose what
I could do is calculate the average vector of all document vectors for docs
that have the same label. Would you recommend that?
I would be very grateful if you could give me some further advice based
on the nature of my corpus and my goals.
Many thanks in advance!
RM

Post by Gordon Mohr
You're on the right track: trying all permutations of a set of possible
parameters is a typical meta-optimization strategy, sometimes called "grid
search". But, you're using a very crude classification technique, and ther
are other things worth trying with respect to Doc2Vec training.
By supplying only the desired labels as the document-tags, you're
essentially training the Doc2Vec model with just 60 virtual mega-documents,
and it's learning just 60 unique doc-vectors. That might be OK, especially
if the docs are short or you lack the memory to give every doc its own
vector, but also limits the model expressiveness quite a bit. (And, trying
to train just 60 unique vectors in a 200-dimensional space could risk
severe overfitting â each of the vectors could trend towards being a
one-hot-like vector that just optimizes itself, with no useful
trading-off-against-related-peers. But, the mode you're using, PV-DM, by
including word-training for likely tens of thousands of other words
alongside the 60 doc-vectors, probably mitigates that.
But that means: each of your labels gets summarized as a single
doc-vector. And future test data will be classified by the single nearest
label-vector, reducing all label-volumes to be roughly
spheres-around-those-summary-points. If in fact the labeled data populates
"lumpy" 200-dimensional regions of your space, this classification method â
roughly "K-Nearest-Neighbors where K=1 and there's only 1 example of each
class" â this process can't capture that.
There are lots of potential ways to train Doc2Vec vectors from labeled
(1) Train to give each doc a unique doc-vector (such as an integer
doc-id), oblivious to the known-labels. This uses Doc2Vec entirely in its
original 'unsupervised' definition â so it's also plausibly principled to
do this training without any holdout test set. (It's just learning the
patterns of the texts, and it can do this without known labels.) Then, use
the resulting (vector, label) data as input to a separately-chosen
classifier.
(2) Give each doc both a unique doc-id *and* the known-label, where
available, using the ability of `TaggedDocument` to assign multiple tags.
If I recall correctly, I've seen cases where mixing those
many-document-tags in helped make the resulting model more sensitive to the
label-distinctions. Still just use the (per-doc-vector, label) to train the
downstream-classifier
(3) Train each doc with the known-label as its only doc-tag. (This is
what you've shown in your code so far.) This might streamline some steps
but also lose some expressiveness.
Further, especially for methods (2) and (3), you could consider
re-calculating the per-document vectors via inference at the end of
training. During training, a lot of effort is spent making the per-label
vectors predictive, even though those may not be what's interesting with
respect to individual texts. So this final re-inference ensures each
document has a vector optimized to represent its text, with respect to the
final trained model. (In case (3), there would have been no per-document
vectors until this step, and thus no way for a doc to get a vector that
reflects things like, "this text is somewhere between the A-label, D-label,
and R-label neighborhoods".)
However you do the doc-vector training, with or without known-labels
being mixed-in, you then need to do classification as a separate step.
Using a KNearestNeighbor-style classification may make sense if you
don't have too many data points. (Its performance drops rapidly with
remembering all the known training points.) But even with K=1, you might
get better evaluation scores than you get now â a test text wouldn't have
to be closest to the centroid of all examples of a label, just one close
neighbor, to be assigned a label. And higher Ks might smooth out the
effects of outlier examples, and do even better.
You could also try other classifiers. Trying one of sklearn's
linear-classifiers (`SGDClassifier`, `LinearSVC`, `SVC(kernel='linear') and
its RandomForestClassifier would be two options that constrast with a
simple nearest-neighbor approach. (Though each of these has its own
metaparameters to tweak.)
* especially if your documents are short and main task is
classification, the PV-DBOW mode (`dm=0`) is often fast and a
top-performer. (Beware, though: if you pursued the "only 60 docs"
tag-assignment method, and use >60 vector dimensions, it'd be severely
prone to overfitting and there'd be no mixed-in word-training to help
offset that.) In pure PV-DBOW, the `window` size doesn't matter, but you
can also add in skip-gram word-training (with `dbow_words=1`), where
`window` again matters. That's slower, and can sometimes help or hurt the
doc-vector quality via the interplay with word-training.
* I haven't seen a lower-than-0.025 alpha (`0.01`) tried as often as a
higher alpha (0.05).
* if your dataset is small enough it's quick and easy to run all these
permutations, then this might be a bad fit for Doc2Vec, or some of your
parameters might be missized (like a vector-size too large for a small
dataset). OTOH, if the runtime is a concern, note that one of the
time-consuming steps on a large corpus, the initial vocabulary-scan &
model-allocation, is actually only affected by one of the parameters you're
varying: `min_count`. You could plausibly optimize your code to prep a
model with a single `min_count` value, then save that half-initialized
model to disk, then re-load it for variants of the other parameters - which
you would direct tamper-modify on the model before the substantive
`train()` occurs.
- Gordon

Post by R. M.
Hi,
I've written a script to automatize the tuning of hyperparameters for
a doc2vec model. However, I wanted to check if I'm doing things right,
since my results are decent but not great.
Basically, I'm defining a number of values to test for each
hyperparameter of interest, and then the script does a doc2vec model for
every combination of these hyperparameter values on a training corpus (85%
of my total corpus). Each resulting model is tested on a testing corpus
(15% of my total corpus), by trying to guess the tag of each document (by
inferring the vector for the document, then looking at the most similar tag
vector). The overall accuracy and f1 score of each model is computed.
# Set values of interest for hyperparameters
hyperparams = {
'vector_size': [50, 200],
'min_count': [2, 10],
'epochs': [20, 50],
'window': [2, 10],
'alpha': [0.025, 0.01],
}
param_size = hyperparams['vector_size']
param_count = hyperparams['min_count']
param_epochs = hyperparams['epochs']
param_window = hyperparams['window']
param_alpha = hyperparams['alpha']
# Split data into training corpus and testing corpus
x_data = df_restricted["words"]
y_data = df_restricted["tags"]
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.15, random_state=42)
# including genre tag (label) as well
train_tagged_docs = [TaggedDocument(t, [label]) for t, label in zip(
X_train, y_train)]
# Train and evaluate models for every combination of hyperparameters
for param_size, param_count, param_epochs,param_window, param_alpha in
# Train model
np.random.shuffle(train_tagged_docs)
model = gensim.models.doc2vec.Doc2Vec(dm=1, vector_size=param_size
, min_count=param_count, epochs=param_epochs, workers=cores, window=
param_window, alpha=param_alpha)
model.random.seed(0)
model.build_vocab(train_tagged_docs)
model.train(train_tagged_docs, total_examples=model.corpus_count,
epochs=model.iter)
# Evaluate model
X_val = np.array([model.infer_vector(t) for t in X_test])
genre_vectors = np.array([model.docvecs[x] for x in genre_list])
sims = cosine_similarity(X_val, genre_vectors)
y_val_pred = np.array(genre_list)[sims.argmax(axis=1)]
acc = accuracy_score(y_test, y_val_pred).round(4)
score = f1_score(y_test, y_val_pred, average='weighted').round(4)
Saisissez le code ici...
f1_score accuracy size min_count epochs window alpha
0 0.332 0.3419 50 2 20 2 0.025
1 0.1729 0.2233 50 2 20 2 0.01
2 0.3273 0.3373 50 2 20 10 0.025
3 0.1786 0.2279 50 2 20 10 0.01
4 0.3131 0.3044 50 2 50 2 0.025
5 0.2956 0.326 50 2 50 2 0.01
6 0.3243 0.3142 50 2 50 10 0.025
7 0.2942 0.3249 50 2 50 10 0.01
8 0.3483 0.3511 50 10 20 2 0.025
9 0.2477 0.2849 50 10 20 2 0.01
10 0.3413 0.3522 50 10 20 10 0.025
11 0.2425 0.2787 50 10 20 10 0.01
12 0.3211 0.307 50 10 50 2 0.025
13 0.321 0.3522 50 10 50 2 0.01
14 0.3251 0.3157 50 10 50 10 0.025
15 0.3149 0.346 50 10 50 10 0.01
16 0.3173 0.3563 200 2 20 2 0.025
17 0.1893 0.2474 200 2 20 2 0.01
18 0.3223 0.3645 200 2 20 10 0.025
19 0.1847 0.2413 200 2 20 10 0.01
20 0.3053 0.3219 200 2 50 2 0.025
21 0.2915 0.3311 200 2 50 2 0.01
22 0.3069 0.3249 200 2 50 10 0.025
23 0.2914 0.3337 200 2 50 10 0.01
24 0.3352 0.3634 200 10 20 2 0.025
25 0.2547 0.3054 200 10 20 2 0.01
26 0.3459 0.3753 200 10 20 10 0.025
27 0.2475 0.3008 200 10 20 10 0.01
28 0.3125 0.3244 200 10 50 2 0.025
29 0.3035 0.3496 200 10 50 2 0.01
30 0.315 0.326 200 10 50 10 0.025
31 0.3077 0.3516 200 10 50 10 0.01
Am I doing something wrong? The f1 score and accuracy score don't seem
very good. But bear in mind that there are around 60 different labels, so
the classifier doesn't have an easy job.
Many thanks in advance!

Gordon Mohr

2018-12-07 21:27:38 UTC

Permalink

You could choose any supervised classifier, to use the Doc2Vec vectors as
its input-features, and check the quality of its predictions. A few
previously mentioned possibilities, by their names in sckikit-learn:
KNeighborsClassifier (which is somewhat similar to your prior "classify as
the same label as the nearest label-vector"); SGDClassifier;
RandomForestClassifier.

For training and evaluating the classifier, you *would* use a held-out test
set â even if you hadn't done so for the unsupervised creation of Doc2Vec
features. (If you weren't applying the known-labels as document-tags, there
was no "peeking" at desired results in that Doc2Vec phase, and using all
examples plausibly simulates an actual deployment where the Doc2Vec model
could learn from a lot of unlabeled data as well.)

Your "what I have done" 1-2 is vaguely similar to a
KNeighborsClassification, k=10, checking each document individually (as if
it were a single-hold out) to see if it labeling it the same as the
plurality of its neighbors would label it properly. So sure, accuracy (or
other evaluations) on that full system might be a good thing to target with
your Doc2Vec tuning.

But you can vary k, and you can try totally different non-neighbor-based
classifiers that might better discover the shapes of the
label-volumes/boundaries in the dataset. And such alternates might do
better. (Or with larger datasets, work faster or in less memory than the
KNeighbors-style evaluation, which requires calculating-and-sorting
distances-to-all-known-datapoints to find the top-K).

- Gordon

Post by R. M.
I have a further question about your previous suggestions.
Train to give each doc a unique doc-vector (such as an integer doc-id),

I was wondering what you mean by your last sentence. If I do this "without
any holdout test set", what kind of downstream classifier would I confirm
the model with?
1. Training a model on the whole corpus with *doc IDs *as tags (no
labels)
2. Looking up, for each document (vector) in the model, the labels of
top 10 most similar docs. Then comparing the most common label in this list
with the actual label of the target doc, and computing the accuracy of this
comparison acros this as an array for every document s the whole model.
Presumably, this is just testing whether vectors for docs that have the
same label tend to be close in the vector space. Is this what you had in
mind in the last sentence of the passage above? And is this really a good
way to determine the best hyperparameters of the model?
Many thanks for your help.
- RM

Post by Gordon Mohr
Your docs are reasonably-sized, though it would be good if there were
more of them. I suspect your overall classification-accuracy will improve
as soon as you use a more sophisticated classifier than "reduce each label
to a single summary vector & assign each doc to the nearest label".
For optimizing your other outcomes â like the label->label, word->label,
label->word rankings, it's just a matter of tinkering â ideally via an
automated search, if you have an automated evaluation of the desirability
of the final results.
With a smaller dataset, you may want to explore smaller dimensionalities
and more training `epochs`.
You can add word-training to PV-DBOW with `dm=0, dbow_words=1` â and
this would be worth trying against the plain PV-DM modes.
Training vectors together is necessary for them to be comparable, and it
is the tug-of-war between making different examples, and different vectors,
predictive that gives rise to the useful spacings/orientations of the final
vectors. But when doc-vectors and word-vectors are being co-trained, that
means certain parameter choices may effectively give more
weight/training-attention to one or the other.
In particular, larger `window` values tend to mean relatively more
training-attention to words. So especially in cases where the power of the
label-vectors is of primary interest, you should be sure to try many
small-window values â even 1, 2, 3. I think it'd even be possible to try to
force a relative overweighting of labels by repeating them more than once
in the `tags` of a document, so that could be worth trying. Similarly, if
some labels have few examples but are still important to model on an equal
basis with larger categories, it may be worth trying an artificial
expansion of their proportion of the corpus by repeating their documents.
(Such non-varied repetition isn't ideal for creating subtle vectors â real
variety of examples does that â but could help offset the numerical
domination of other labels in influencing the final model state.)
You wouldn't necessarily need to reduce every label to a single summary
vector to calculate label->label similarities. You could instead do a
census of document neighbors: "For all documents labeled A, looking at its
n nearest neighbors' labels, which other labels appear most often?"
Or similarly, if you create a multi-class classifier based on just the
(doc-vector, label) training-data that offers ranked predictions, there's
no single vector for an individual label. But you could check: "For all
documents with a most-predicted label of A, what other labels most-often
appear as 2nd. 3rd, etc predictions?" Or, "for any known-label-A docs the
classifier gets wrong, what are the most common errors?" (Compare the idea
of the "confusion matrix", once you have a classifier, even a weak one, to
test.)
Mainly: note that collapsing a complete category to a single summary
vector (such as an average of all examples) can be very crude in a
high-dimensional space, missing the potential variety of
boundaries-of-distinction that actually appear in the data, and won't just
be defined by "which label's center-point is closest?"
- Gordon

Post by R. M.
Hi Gordon,
Thanks a lot for your detailed reply -- this is all very helpful.
Let me tell you a bit more about my corpus and my goals, hopefully this
can help determine what the best approach would be.
My corpus, as you guessed, is small-ish for doc2vec: it contains a
little over 15,000 documents. These docs are neither tiny nor huge -- at
least several paragraphs, a lot more in some cases. Each doc is tagged with
a unique label (every doc has a label).
1. Obtain a similarity matrix of the labels. The way I have done
this previously is by training the doc2vec model with the labels (rather
than doc IDs) as tags, and then creating a heatmap representing the cosine
similarity of vectors for labels.
2. Obtain words that are characteristic of a given label. The way
I've previously done this is by training the model with the labels (&
dm=1), and plotting the top *n* most similar word vectors to each
label vector.
3. Obtain labels that are most related to a given word. The way
I've previously done this is by training the model with the labels (&
dm=1), and plotting the top *n* most similar label vectors to
specific word vectors.
As you can see, I presumably need to train word vectors, which on the
face of it rules out the basic dbow method. I've previously tried your
solution #2 (double tagging docs with doc ID & label), and the results for
the three tasks above seemed worse with the hyperparameters I tried.
There is also the question of how I could perform the three tasks above
if I don't train vectors for labels, as in your solution #1. I suppose what
I could do is calculate the average vector of all document vectors for docs
that have the same label. Would you recommend that?
I would be very grateful if you could give me some further advice based
on the nature of my corpus and my goals.
Many thanks in advance!
RM

Post by Gordon Mohr
You're on the right track: trying all permutations of a set of
possible parameters is a typical meta-optimization strategy, sometimes
called "grid search". But, you're using a very crude classification
technique, and ther are other things worth trying with respect to Doc2Vec
training.
By supplying only the desired labels as the document-tags, you're
essentially training the Doc2Vec model with just 60 virtual mega-documents,
and it's learning just 60 unique doc-vectors. That might be OK, especially
if the docs are short or you lack the memory to give every doc its own
vector, but also limits the model expressiveness quite a bit. (And, trying
to train just 60 unique vectors in a 200-dimensional space could risk
severe overfitting â each of the vectors could trend towards being a
one-hot-like vector that just optimizes itself, with no useful
trading-off-against-related-peers. But, the mode you're using, PV-DM, by
including word-training for likely tens of thousands of other words
alongside the 60 doc-vectors, probably mitigates that.
But that means: each of your labels gets summarized as a single
doc-vector. And future test data will be classified by the single nearest
label-vector, reducing all label-volumes to be roughly
spheres-around-those-summary-points. If in fact the labeled data populates
"lumpy" 200-dimensional regions of your space, this classification method â
roughly "K-Nearest-Neighbors where K=1 and there's only 1 example of each
class" â this process can't capture that.
There are lots of potential ways to train Doc2Vec vectors from labeled
(1) Train to give each doc a unique doc-vector (such as an integer
doc-id), oblivious to the known-labels. This uses Doc2Vec entirely in its
original 'unsupervised' definition â so it's also plausibly principled to
do this training without any holdout test set. (It's just learning the
patterns of the texts, and it can do this without known labels.) Then, use
the resulting (vector, label) data as input to a separately-chosen
classifier.
(2) Give each doc both a unique doc-id *and* the known-label, where
available, using the ability of `TaggedDocument` to assign multiple tags.
If I recall correctly, I've seen cases where mixing those
many-document-tags in helped make the resulting model more sensitive to the
label-distinctions. Still just use the (per-doc-vector, label) to train the
downstream-classifier
(3) Train each doc with the known-label as its only doc-tag. (This is
what you've shown in your code so far.) This might streamline some steps
but also lose some expressiveness.
Further, especially for methods (2) and (3), you could consider
re-calculating the per-document vectors via inference at the end of
training. During training, a lot of effort is spent making the per-label
vectors predictive, even though those may not be what's interesting with
respect to individual texts. So this final re-inference ensures each
document has a vector optimized to represent its text, with respect to the
final trained model. (In case (3), there would have been no per-document
vectors until this step, and thus no way for a doc to get a vector that
reflects things like, "this text is somewhere between the A-label, D-label,
and R-label neighborhoods".)
However you do the doc-vector training, with or without known-labels
being mixed-in, you then need to do classification as a separate step.
Using a KNearestNeighbor-style classification may make sense if you
don't have too many data points. (Its performance drops rapidly with
remembering all the known training points.) But even with K=1, you might
get better evaluation scores than you get now â a test text wouldn't have
to be closest to the centroid of all examples of a label, just one close
neighbor, to be assigned a label. And higher Ks might smooth out the
effects of outlier examples, and do even better.
You could also try other classifiers. Trying one of sklearn's
linear-classifiers (`SGDClassifier`, `LinearSVC`, `SVC(kernel='linear') and
its RandomForestClassifier would be two options that constrast with a
simple nearest-neighbor approach. (Though each of these has its own
metaparameters to tweak.)
* especially if your documents are short and main task is
classification, the PV-DBOW mode (`dm=0`) is often fast and a
top-performer. (Beware, though: if you pursued the "only 60 docs"
tag-assignment method, and use >60 vector dimensions, it'd be severely
prone to overfitting and there'd be no mixed-in word-training to help
offset that.) In pure PV-DBOW, the `window` size doesn't matter, but you
can also add in skip-gram word-training (with `dbow_words=1`), where
`window` again matters. That's slower, and can sometimes help or hurt the
doc-vector quality via the interplay with word-training.
* I haven't seen a lower-than-0.025 alpha (`0.01`) tried as often as a
higher alpha (0.05).
* if your dataset is small enough it's quick and easy to run all these
permutations, then this might be a bad fit for Doc2Vec, or some of your
parameters might be missized (like a vector-size too large for a small
dataset). OTOH, if the runtime is a concern, note that one of the
time-consuming steps on a large corpus, the initial vocabulary-scan &
model-allocation, is actually only affected by one of the parameters you're
varying: `min_count`. You could plausibly optimize your code to prep a
model with a single `min_count` value, then save that half-initialized
model to disk, then re-load it for variants of the other parameters - which
you would direct tamper-modify on the model before the substantive
`train()` occurs.
- Gordon

Post by R. M.
Hi,
I've written a script to automatize the tuning of hyperparameters for
a doc2vec model. However, I wanted to check if I'm doing things right,
since my results are decent but not great.
Basically, I'm defining a number of values to test for each
hyperparameter of interest, and then the script does a doc2vec model for
every combination of these hyperparameter values on a training corpus (85%
of my total corpus). Each resulting model is tested on a testing corpus
(15% of my total corpus), by trying to guess the tag of each document (by
inferring the vector for the document, then looking at the most similar tag
vector). The overall accuracy and f1 score of each model is computed.
# Set values of interest for hyperparameters
hyperparams = {
'vector_size': [50, 200],
'min_count': [2, 10],
'epochs': [20, 50],
'window': [2, 10],
'alpha': [0.025, 0.01],
}
param_size = hyperparams['vector_size']
param_count = hyperparams['min_count']
param_epochs = hyperparams['epochs']
param_window = hyperparams['window']
param_alpha = hyperparams['alpha']
# Split data into training corpus and testing corpus
x_data = df_restricted["words"]
y_data = df_restricted["tags"]
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.15, random_state=42)
# including genre tag (label) as well
train_tagged_docs = [TaggedDocument(t, [label]) for t, label in zip(
X_train, y_train)]
# Train and evaluate models for every combination of hyperparameters
for param_size, param_count, param_epochs,param_window, param_alpha
in product(param_size, param_count, param_epochs,param_window,
# Train model
np.random.shuffle(train_tagged_docs)
model = gensim.models.doc2vec.Doc2Vec(dm=1, vector_size=
param_size, min_count=param_count, epochs=param_epochs, workers=cores
, window=param_window, alpha=param_alpha)
model.random.seed(0)
model.build_vocab(train_tagged_docs)
model.train(train_tagged_docs, total_examples=model.corpus_count,
epochs=model.iter)
# Evaluate model
X_val = np.array([model.infer_vector(t) for t in X_test])
genre_vectors = np.array([model.docvecs[x] for x in genre_list])
sims = cosine_similarity(X_val, genre_vectors)
y_val_pred = np.array(genre_list)[sims.argmax(axis=1)]
acc = accuracy_score(y_test, y_val_pred).round(4)
score = f1_score(y_test, y_val_pred, average='weighted').round(4)
Saisissez le code ici...
f1_score accuracy size min_count epochs window alpha
0 0.332 0.3419 50 2 20 2 0.025
1 0.1729 0.2233 50 2 20 2 0.01
2 0.3273 0.3373 50 2 20 10 0.025
3 0.1786 0.2279 50 2 20 10 0.01
4 0.3131 0.3044 50 2 50 2 0.025
5 0.2956 0.326 50 2 50 2 0.01
6 0.3243 0.3142 50 2 50 10 0.025
7 0.2942 0.3249 50 2 50 10 0.01
8 0.3483 0.3511 50 10 20 2 0.025
9 0.2477 0.2849 50 10 20 2 0.01
10 0.3413 0.3522 50 10 20 10 0.025
11 0.2425 0.2787 50 10 20 10 0.01
12 0.3211 0.307 50 10 50 2 0.025
13 0.321 0.3522 50 10 50 2 0.01
14 0.3251 0.3157 50 10 50 10 0.025
15 0.3149 0.346 50 10 50 10 0.01
16 0.3173 0.3563 200 2 20 2 0.025
17 0.1893 0.2474 200 2 20 2 0.01
18 0.3223 0.3645 200 2 20 10 0.025
19 0.1847 0.2413 200 2 20 10 0.01
20 0.3053 0.3219 200 2 50 2 0.025
21 0.2915 0.3311 200 2 50 2 0.01
22 0.3069 0.3249 200 2 50 10 0.025
23 0.2914 0.3337 200 2 50 10 0.01
24 0.3352 0.3634 200 10 20 2 0.025
25 0.2547 0.3054 200 10 20 2 0.01
26 0.3459 0.3753 200 10 20 10 0.025
27 0.2475 0.3008 200 10 20 10 0.01
28 0.3125 0.3244 200 10 50 2 0.025
29 0.3035 0.3496 200 10 50 2 0.01
30 0.315 0.326 200 10 50 10 0.025
31 0.3077 0.3516 200 10 50 10 0.01
Am I doing something wrong? The f1 score and accuracy score don't
seem very good. But bear in mind that there are around 60 different labels,
so the classifier doesn't have an easy job.
Many thanks in advance!

Gordon Mohr

2018-12-07 21:12:33 UTC

Permalink

Let's say you have 300 documents with label-A. For each, you could look up
the nearest document that isn't labeled A. (This is yet another slight
variant on the prior suggestions.)That might give you a ranked tally like:

125 label-F
75 label-B
40 label-M
30 label-X
20 label-H
10 label-D
0 [all other labels]

You could just read those tallies as a "similarity". You could scale the
numbers to be -1.0-to-1.0 or 0.0-to-1.0 or whatever proves useful.

There are many potential ways to calculate a pairwise tally between labels
â from neighbor-labels, or classifier errors, or
classifier-ranked-predictions, and among just the next-best, or top-5, or
top-n, or weighted-list-of-all neighbors/predictions, etc. And then many
potential ways to scale those tallies into pairwise similarity numbers. I'm
not recommending any particular one â just highlighting that it doesn't
necessarily have to be a "single summary point to single summary point
comparison", and that despite the attractive simplicity of "single summary
points" that model could hide the real shapes of the categories, and
borders between them.

- Gordon

Post by R. M.
Thanks Gordon, as always this is very helpful.
Your suggestions are really interesting, and it makes sense that
collapsing a whole category into a single vector would lose a lot of
relevant information.
I just have a small clarificatory question about your three suggestions
1. For all docs with label A, which other labels appear most often in
its *n *nearest neighbour docs?
2. For all docs with a most-predicted label of A, what other labels
most often appear as 2nd, 3rd, âŠ, predictions?
3. For any known-label-A docs the classifier gets wrong, what are the
most common errors?
I get that such methods would give me a ranking of the *n* most similar
labels to a given label A. However, what kind of metric could be used to
measure the degree of similarity to A? Previously, I have used cosine
similarity between label-vectors. With your three suggestions, it is less
immediately obvious to me what the similarity metric would be. Maybe I'm
missing an obvious answer -- could you give me some pointers?
- RM

Post by Gordon Mohr
Your docs are reasonably-sized, though it would be good if there were
more of them. I suspect your overall classification-accuracy will improve
as soon as you use a more sophisticated classifier than "reduce each label
to a single summary vector & assign each doc to the nearest label".
For optimizing your other outcomes â like the label->label, word->label,
label->word rankings, it's just a matter of tinkering â ideally via an
automated search, if you have an automated evaluation of the desirability
of the final results.
With a smaller dataset, you may want to explore smaller dimensionalities
and more training `epochs`.
You can add word-training to PV-DBOW with `dm=0, dbow_words=1` â and this
would be worth trying against the plain PV-DM modes.
Training vectors together is necessary for them to be comparable, and it
is the tug-of-war between making different examples, and different vectors,
predictive that gives rise to the useful spacings/orientations of the final
vectors. But when doc-vectors and word-vectors are being co-trained, that
means certain parameter choices may effectively give more
weight/training-attention to one or the other.
In particular, larger `window` values tend to mean relatively more
training-attention to words. So especially in cases where the power of the
label-vectors is of primary interest, you should be sure to try many
small-window values â even 1, 2, 3. I think it'd even be possible to try to
force a relative overweighting of labels by repeating them more than once
in the `tags` of a document, so that could be worth trying. Similarly, if
some labels have few examples but are still important to model on an equal
basis with larger categories, it may be worth trying an artificial
expansion of their proportion of the corpus by repeating their documents.
(Such non-varied repetition isn't ideal for creating subtle vectors â real
variety of examples does that â but could help offset the numerical
domination of other labels in influencing the final model state.)
You wouldn't necessarily need to reduce every label to a single summary
vector to calculate label->label similarities. You could instead do a
census of document neighbors: "For all documents labeled A, looking at its
n nearest neighbors' labels, which other labels appear most often?"
Or similarly, if you create a multi-class classifier based on just the
(doc-vector, label) training-data that offers ranked predictions, there's
no single vector for an individual label. But you could check: "For all
documents with a most-predicted label of A, what other labels most-often
appear as 2nd. 3rd, etc predictions?" Or, "for any known-label-A docs the
classifier gets wrong, what are the most common errors?" (Compare the idea
of the "confusion matrix", once you have a classifier, even a weak one, to
test.)
Mainly: note that collapsing a complete category to a single summary
vector (such as an average of all examples) can be very crude in a
high-dimensional space, missing the potential variety of
boundaries-of-distinction that actually appear in the data, and won't just
be defined by "which label's center-point is closest?"
- Gordon

Post by R. M.
Hi Gordon,
Thanks a lot for your detailed reply -- this is all very helpful.
Let me tell you a bit more about my corpus and my goals, hopefully this
can help determine what the best approach would be.
My corpus, as you guessed, is small-ish for doc2vec: it contains a
little over 15,000 documents. These docs are neither tiny nor huge -- at
least several paragraphs, a lot more in some cases. Each doc is tagged with
a unique label (every doc has a label).
1. Obtain a similarity matrix of the labels. The way I have done
this previously is by training the doc2vec model with the labels (rather
than doc IDs) as tags, and then creating a heatmap representing the cosine
similarity of vectors for labels.
2. Obtain words that are characteristic of a given label. The way
I've previously done this is by training the model with the labels (&
dm=1), and plotting the top *n* most similar word vectors to each
label vector.
3. Obtain labels that are most related to a given word. The way I've
previously done this is by training the model with the labels (& dm=1), and
plotting the top *n* most similar label vectors to specific word vectors.
As you can see, I presumably need to train word vectors, which on the
face of it rules out the basic dbow method. I've previously tried your
solution #2 (double tagging docs with doc ID & label), and the results for
the three tasks above seemed worse with the hyperparameters I tried.
There is also the question of how I could perform the three tasks above
if I don't train vectors for labels, as in your solution #1. I suppose what
I could do is calculate the average vector of all document vectors for docs
that have the same label. Would you recommend that?
I would be very grateful if you could give me some further advice based
on the nature of my corpus and my goals.
Many thanks in advance!
RM

Post by Gordon Mohr
You're on the right track: trying all permutations of a set of possible
parameters is a typical meta-optimization strategy, sometimes called "grid
search". But, you're using a very crude classification technique, and ther
are other things worth trying with respect to Doc2Vec training.
By supplying only the desired labels as the document-tags, you're
essentially training the Doc2Vec model with just 60 virtual mega-documents,
and it's learning just 60 unique doc-vectors. That might be OK, especially
if the docs are short or you lack the memory to give every doc its own
vector, but also limits the model expressiveness quite a bit. (And, trying
to train just 60 unique vectors in a 200-dimensional space could risk
severe overfitting â each of the vectors could trend towards being a
one-hot-like vector that just optimizes itself, with no useful
trading-off-against-related-peers. But, the mode you're using, PV-DM, by
including word-training for likely tens of thousands of other words
alongside the 60 doc-vectors, probably mitigates that.
But that means: each of your labels gets summarized as a single
doc-vector. And future test data will be classified by the single nearest
label-vector, reducing all label-volumes to be roughly
spheres-around-those-summary-points. If in fact the labeled data populates
"lumpy" 200-dimensional regions of your space, this classification method â
roughly "K-Nearest-Neighbors where K=1 and there's only 1 example of each
class" â this process can't capture that.
There are lots of potential ways to train Doc2Vec vectors from labeled
(1) Train to give each doc a unique doc-vector (such as an integer
doc-id), oblivious to the known-labels. This uses Doc2Vec entirely in its
original 'unsupervised' definition â so it's also plausibly principled to
do this training without any holdout test set. (It's just learning the
patterns of the texts, and it can do this without known labels.) Then, use
the resulting (vector, label) data as input to a separately-chosen
classifier.
(2) Give each doc both a unique doc-id *and* the known-label, where
available, using the ability of `TaggedDocument` to assign multiple tags.
If I recall correctly, I've seen cases where mixing those
many-document-tags in helped make the resulting model more sensitive to the
label-distinctions. Still just use the (per-doc-vector, label) to train the
downstream-classifier
(3) Train each doc with the known-label as its only doc-tag. (This is
what you've shown in your code so far.) This might streamline some steps
but also lose some expressiveness.
Further, especially for methods (2) and (3), you could consider
re-calculating the per-document vectors via inference at the end of
training. During training, a lot of effort is spent making the per-label
vectors predictive, even though those may not be what's interesting with
respect to individual texts. So this final re-inference ensures each
document has a vector optimized to represent its text, with respect to the
final trained model. (In case (3), there would have been no per-document
vectors until this step, and thus no way for a doc to get a vector that
reflects things like, "this text is somewhere between the A-label, D-label,
and R-label neighborhoods".)
However you do the doc-vector training, with or without known-labels
being mixed-in, you then need to do classification as a separate step.
Using a KNearestNeighbor-style classification may make sense if you
don't have too many data points. (Its performance drops rapidly with
remembering all the known training points.) But even with K=1, you might
get better evaluation scores than you get now â a test text wouldn't have
to be closest to the centroid of all examples of a label, just one close
neighbor, to be assigned a label. And higher Ks might smooth out the
effects of outlier examples, and do even better.
You could also try other classifiers. Trying one of sklearn's
linear-classifiers (`SGDClassifier`, `LinearSVC`, `SVC(kernel='linear') and
its RandomForestClassifier would be two options that constrast with a
simple nearest-neighbor approach. (Though each of these has its own
metaparameters to tweak.)
* especially if your documents are short and main task is
classification, the PV-DBOW mode (`dm=0`) is often fast and a
top-performer. (Beware, though: if you pursued the "only 60 docs"
tag-assignment method, and use >60 vector dimensions, it'd be severely
prone to overfitting and there'd be no mixed-in word-training to help
offset that.) In pure PV-DBOW, the `window` size doesn't matter, but you
can also add in skip-gram word-training (with `dbow_words=1`), where
`window` again matters. That's slower, and can sometimes help or hurt the
doc-vector quality via the interplay with word-training.
* I haven't seen a lower-than-0.025 alpha (`0.01`) tried as often as a
higher alpha (0.05).
* if your dataset is small enough it's quick and easy to run all these
permutations, then this might be a bad fit for Doc2Vec, or some of your
parameters might be missized (like a vector-size too large for a small
dataset). OTOH, if the runtime is a concern, note that one of the
time-consuming steps on a large corpus, the initial vocabulary-scan &
model-allocation, is actually only affected by one of the parameters you're
varying: `min_count`. You could plausibly optimize your code to prep a
model with a single `min_count` value, then save that half-initialized
model to disk, then re-load it for variants of the other parameters - which
you would direct tamper-modify on the model before the substantive
`train()` occurs.
- Gordon

Post by R. M.
Hi,
I've written a script to automatize the tuning of hyperparameters for
a doc2vec model. However, I wanted to check if I'm doing things right,
since my results are decent but not great.
Basically, I'm defining a number of values to test for each
hyperparameter of interest, and then the script does a doc2vec model for
every combination of these hyperparameter values on a training corpus (85%
of my total corpus). Each resulting model is tested on a testing corpus
(15% of my total corpus), by trying to guess the tag of each document (by
inferring the vector for the document, then looking at the most similar tag
vector). The overall accuracy and f1 score of each model is computed.
# Set values of interest for hyperparameters
hyperparams = {
'vector_size': [50, 200],
'min_count': [2, 10],
'epochs': [20, 50],
'window': [2, 10],
'alpha': [0.025, 0.01],
}
param_size = hyperparams['vector_size']
param_count = hyperparams['min_count']
param_epochs = hyperparams['epochs']
param_window = hyperparams['window']
param_alpha = hyperparams['alpha']
# Split data into training corpus and testing corpus
x_data = df_restricted["words"]
y_data = df_restricted["tags"]
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.15, random_state=42)
# including genre tag (label) as well
train_tagged_docs = [TaggedDocument(t, [label]) for t, label in zip(
X_train, y_train)]
# Train and evaluate models for every combination of hyperparameters
for param_size, param_count, param_epochs,param_window, param_alpha in
# Train model
np.random.shuffle(train_tagged_docs)
model = gensim.models.doc2vec.Doc2Vec(dm=1, vector_size=param_size
, min_count=param_count, epochs=param_epochs, workers=cores, window=
param_window, alpha=param_alpha)
model.random.seed(0)
model.build_vocab(train_tagged_docs)
model.train(train_tagged_docs, total_examples=model.corpus_count,
epochs=model.iter)
# Evaluate model
X_val = np.array([model.infer_vector(t) for t in X_test])
genre_vectors = np.array([model.docvecs[x] for x in genre_list])
sims = cosine_similarity(X_val, genre_vectors)
y_val_pred = np.array(genre_list)[sims.argmax(axis=1)]
acc = accuracy_score(y_test, y_val_pred).round(4)
score = f1_score(y_test, y_val_pred, average='weighted').round(4)
Saisissez le code ici...
f1_score accuracy size min_count epochs window alpha
0 0.332 0.3419 50 2 20 2 0.025
1 0.1729 0.2233 50 2 20 2 0.01
2 0.3273 0.3373 50 2 20 10 0.025
3 0.1786 0.2279 50 2 20 10 0.01
4 0.3131 0.3044 50 2 50 2 0.025
5 0.2956 0.326 50 2 50 2 0.01
6 0.3243 0.3142 50 2 50 10 0.025
7 0.2942 0.3249 50 2 50 10 0.01
8 0.3483 0.3511 50 10 20 2 0.025
9 0.2477 0.2849 50 10 20 2 0.01
10 0.3413 0.3522 50 10 20 10 0.025
11 0.2425 0.2787 50 10 20 10 0.01
12 0.3211 0.307 50 10 50 2 0.025
13 0.321 0.3522 50 10 50 2 0.01
14 0.3251 0.3157 50 10 50 10 0.025
15 0.3149 0.346 50 10 50 10 0.01
16 0.3173 0.3563 200 2 20 2 0.025
17 0.1893 0.2474 200 2 20 2 0.01
18 0.3223 0.3645 200 2 20 10 0.025
19 0.1847 0.2413 200 2 20 10 0.01
20 0.3053 0.3219 200 2 50 2 0.025
21 0.2915 0.3311 200 2 50 2 0.01
22 0.3069 0.3249 200 2 50 10 0.025
23 0.2914 0.3337 200 2 50 10 0.01
24 0.3352 0.3634 200 10 20 2 0.025
25 0.2547 0.3054 200 10 20 2 0.01
26 0.3459 0.3753 200 10 20 10 0.025
27 0.2475 0.3008 200 10 20 10 0.01
28 0.3125 0.3244 200 10 50 2 0.025
29 0.3035 0.3496 200 10 50 2 0.01
30 0.315 0.326 200 10 50 10 0.025
31 0.3077 0.3516 200 10 50 10 0.01
Am I doing something wrong? The f1 score and accuracy score don't seem
very good. But bear in mind that there are around 60 different labels, so
the classifier doesn't have an easy job.
Many thanks in advance!