Pytorch build_vocab_from_iterator giving vocabulary with very few words

I am trying to build a translation model in pytorch. Following this post on pytorch I downloaded the multi30k dataset and spacy models for English and German.

python -m spacy download en
python -m spacy download de
import torchtext
import torch
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab, build_vocab_from_iterator
from torchtext.utils import download_from_url, extract_archive
import io

url_base = 'https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/raw/'
train_urls = ('train.de.gz', 'train.en.gz')
val_urls = ('val.de.gz', 'val.en.gz')
test_urls = ('test_2016_flickr.de.gz', 'test_2016_flickr.en.gz')

train_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in train_urls]
val_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in val_urls]
test_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in test_urls]

de_tokenizer = get_tokenizer('spacy', language='de')
en_tokenizer = get_tokenizer('spacy', language='en')

def build_vocab(filepath, tokenizer):
  counter = Counter()
  with io.open(filepath, encoding=utf8) as f:
    for string_ in f:
      counter.update(tokenizer(string_))
  return Vocab(counter, specials=['unk', 'pad', 'bos', 'eos'])

de_vocab = build_vocab(train_filepaths[0], de_tokenizer)
en_vocab = build_vocab(train_filepaths[1], en_tokenizer)

Which gave me the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
ipython-input-66-c669b7554322 in module()
     20     return vocab
     21 
--- 22 de_vocab = build_vocab(train_filepaths[0], de_tokenizer)
     23 en_vocab = build_vocab(train_filepaths[1], en_tokenizer)
     24 

ipython-input-66-c669b7554322 in build_vocab(filepath, tokenizer)
     16         for string_ in f:
     17             counter.update(tokenizer(string_))
--- 18     vocab = Vocab(counter, specials=['unk', 'pad', 'bos', 'eos'])
     19     vocab.set_default_index(vocab['unk'])
     20     return vocab

TypeError: __init__() got an unexpected keyword argument 'specials'

After a google search I tried to modify the build_function:

def build_vocab(filepath, tokenizer):
    counter = Counter()
    with io.open(filepath, encoding=utf8) as f:
        for string_ in f:
            counter.update(tokenizer(string_))
    vocab = build_vocab_from_iterator(counter, specials=['unk', 'pad', 'bos', 'eos'])
    vocab.set_default_index(vocab['unk'])
    return vocab

This ran without errors but there were very few words in the vocabulary (len(en_vocab) - 85) while the counter for en has len(counter) - 10836.

If I create the vocab object without the specials keyword I get a vocab object with 10836 length.

en_vocab = Vocab(en_counter) # len(en_vocab) - 10836

But now I do not have a way of including the specials in the vocab.

Using an example from the official pytorch documentation for build_vocab_from_iterator I was able to create another vocab object-

def new_builder(file_path):
    with io.open(file_path, encoding='utf-8') as f:
        for line in f:
            yield line.strip().split()

new_vocab = build_vocab_from_iterator(new_builder(train_filepaths[1]), specials=['unk', 'pad', 'bos', 'eos'])

With this method len(new_vocab) is coming out to be 15460. Which of these methods (if any) is correct and why do the other method give incorrect results?

Also, I have noticed that the spacy tokenizer does not appear to be changing any capitalization or doing any other changes to the words in the input sentences.

sentence = 'Two young, White males are outside near many bushes.\n'
print(en_tokenizer(sentence))

Output:

['Two', 'young', ',', 'White', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.', '\n']

Should there be additional preprocessing steps before creating the vocabulary so that tree and Tree are not considered different words?

Topic pytorch machine-translation nlp python

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.