Train a spaCy model for semantic similarity

I'm attempting to train a spaCy model for the purposes of computing semantic similarity but I'm not getting the results I would anticipate.

I have created two text files that contain many sentences that use a new term, PROJ123456. For example, PROJ123456 is on track.

I've added each to a DocBin and saved them to disk as train.spacy and dev.spacy.

I'm then running: python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy

The config.cfg file contains:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = en
pipeline = [tok2vec,parser]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {@tokenizers:spacy.Tokenizer.v1}

[components]

[components.parser]
factory = parser
learn_tokens = false
min_action_freq = 30
moves = null
scorer = {@scorers:spacy.parser_scorer.v1}
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = spacy.TransitionBasedParser.v2
state_type = parser
extra_state_tokens = false
hidden_width = 128
maxout_pieces = 3
use_upper = true
nO = null

[components.parser.model.tok2vec]
@architectures = spacy.Tok2VecListener.v1
width = ${components.tok2vec.model.encode.width}
upstream = *

[components.tok2vec]
factory = tok2vec

[components.tok2vec.model]
@architectures = spacy.Tok2Vec.v2

[components.tok2vec.model.embed]
@architectures = spacy.MultiHashEmbed.v2
width = ${components.tok2vec.model.encode.width}
attrs = [ORTH,SHAPE]
rows = [5000,2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = spacy.MaxoutWindowEncoder.v2
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = spacy.Corpus.v1
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = spacy.Corpus.v1
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = corpora.dev
train_corpus = corpora.train
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = spacy.batch_by_words.v1
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = compounding.v1
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = spacy.ConsoleLogger.v1
progress_bar = false

[training.optimizer]
@optimizers = Adam.v1
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
dep_uas = 0.5
dep_las = 0.5
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = 0.0

[pretraining]

[initialize]
vectors = en_core_web_lg
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

I get a new model in output/model-last.

I then run the following file:

import spacy
nlp = spacy.load(./output/model-last)
print(nlp('PROJ123456').vector)

I'm expecting to see a vector with some non-zero values but instead I see a vector of 300 zero values. I take that to indicate it hasn't added PROJ123456 to the vocab. But I'm not sure why.

Topic semantic-similarity spacy training nlp

Category Data Science


If you have word vectors, the .vectors property uses them to calculate values. Training a model doesn't modify word vectors. It looks like you're just re-using the word vectors from the large English model, which won't contain your special term, so the fix is for you to train your own word vectors and add them to the model.


After vectorizing your custom text, you need to do one of the 2 things in spaCy:

  1. load those binary vectors to nlp.vocab with something like nlp.vocab.load_rep_vectors .
  2. Or simply replace the vec.bin file in "data/vocab/vec.bin".

Detailed information here: https://stackoverflow.com/questions/43524301/update-spacy-vocabulary .

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.