Pytorch build_vocab_from_iterator giving vocabulary with very few words
I am trying to build a translation model in pytorch. Following this post on pytorch I downloaded the multi30k
dataset and spacy models for English and German.
python -m spacy download en
python -m spacy download de
import torchtext
import torch
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab, build_vocab_from_iterator
from torchtext.utils import download_from_url, extract_archive
import io
url_base = 'https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/raw/'
train_urls = ('train.de.gz', 'train.en.gz')
val_urls = ('val.de.gz', 'val.en.gz')
test_urls = ('test_2016_flickr.de.gz', 'test_2016_flickr.en.gz')
train_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in train_urls]
val_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in val_urls]
test_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in test_urls]
de_tokenizer = get_tokenizer('spacy', language='de')
en_tokenizer = get_tokenizer('spacy', language='en')
def build_vocab(filepath, tokenizer):
counter = Counter()
with io.open(filepath, encoding=utf8) as f:
for string_ in f:
counter.update(tokenizer(string_))
return Vocab(counter, specials=['unk', 'pad', 'bos', 'eos'])
de_vocab = build_vocab(train_filepaths[0], de_tokenizer)
en_vocab = build_vocab(train_filepaths[1], en_tokenizer)
Which gave me the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
ipython-input-66-c669b7554322 in module()
20 return vocab
21
--- 22 de_vocab = build_vocab(train_filepaths[0], de_tokenizer)
23 en_vocab = build_vocab(train_filepaths[1], en_tokenizer)
24
ipython-input-66-c669b7554322 in build_vocab(filepath, tokenizer)
16 for string_ in f:
17 counter.update(tokenizer(string_))
--- 18 vocab = Vocab(counter, specials=['unk', 'pad', 'bos', 'eos'])
19 vocab.set_default_index(vocab['unk'])
20 return vocab
TypeError: __init__() got an unexpected keyword argument 'specials'
After a google search I tried to modify the build_function
:
def build_vocab(filepath, tokenizer):
counter = Counter()
with io.open(filepath, encoding=utf8) as f:
for string_ in f:
counter.update(tokenizer(string_))
vocab = build_vocab_from_iterator(counter, specials=['unk', 'pad', 'bos', 'eos'])
vocab.set_default_index(vocab['unk'])
return vocab
This ran without errors but there were very few words in the vocabulary (len(en_vocab) - 85
) while the counter
for en
has len(counter) - 10836
.
If I create the vocab object without the specials
keyword I get a vocab object with 10836 length.
en_vocab = Vocab(en_counter) # len(en_vocab) - 10836
But now I do not have a way of including the specials
in the vocab.
Using an example from the official pytorch documentation for build_vocab_from_iterator
I was able to create another vocab object-
def new_builder(file_path):
with io.open(file_path, encoding='utf-8') as f:
for line in f:
yield line.strip().split()
new_vocab = build_vocab_from_iterator(new_builder(train_filepaths[1]), specials=['unk', 'pad', 'bos', 'eos'])
With this method len(new_vocab)
is coming out to be 15460. Which of these methods (if any) is correct and why do the other method give incorrect results?
Also, I have noticed that the spacy tokenizer does not appear to be changing any capitalization or doing any other changes to the words in the input sentences.
sentence = 'Two young, White males are outside near many bushes.\n'
print(en_tokenizer(sentence))
Output:
['Two', 'young', ',', 'White', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.', '\n']
Should there be additional preprocessing steps before creating the vocabulary so that tree
and Tree
are not considered different words?
Topic pytorch machine-translation nlp python
Category Data Science