How to perform tokenization for tweets in xlnet?
X_train has only one column that contains all tweets.
xlnet_model = 'xlnet-large-cased'
xlnet_tokenizer = XLNetTokenizer.from_pretrained(xlnet_model)
def get_inputs(tweets, tokenizer, max_len=120):
Gets tensors from text using the tokenizer provided
inps = [tokenizer.encode_plus(t, max_length=max_len, pad_to_max_length=True, add_special_tokens=True) for t in tweets]
inp_tok = np.array([a['input_ids'] for a in inps])
ids = np.array([a['attention_mask'] for a in inps])
segments = np.array([a['token_type_ids'] for a in inps])
return inp_tok, ids, segments
inp_tok, ids, segments = get_inputs(X_train, xlnet_tokenizer)
AttributeError: 'NoneType' object has no attribute 'encode_plus'
Topic tokenization tensorflow sentiment-analysis nlp python
Category Data Science