How to perform tokenization for tweets in xlnet?

X_train has only one column that contains all tweets.

xlnet_model = 'xlnet-large-cased'
xlnet_tokenizer = XLNetTokenizer.from_pretrained(xlnet_model)

def get_inputs(tweets, tokenizer, max_len=120):
     Gets tensors from text using the tokenizer provided
    inps = [tokenizer.encode_plus(t, max_length=max_len, pad_to_max_length=True, add_special_tokens=True) for t in tweets]
    inp_tok = np.array([a['input_ids'] for a in inps])
    ids = np.array([a['attention_mask'] for a in inps])
    segments = np.array([a['token_type_ids'] for a in inps])
    return inp_tok, ids, segments

inp_tok, ids, segments = get_inputs(X_train, xlnet_tokenizer)

AttributeError: 'NoneType' object has no attribute 'encode_plus'

Topic tokenization tensorflow sentiment-analysis nlp python

Category Data Science


You need to do pip install sentencepiece for it to work.

By the way, you can also give the tweets as a list to the tokenizer. You don't need to tokenize them one by one.

tokenizer(tweets, max_length=max_len, padding='max_length', add_special_tokens=True)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.