How to use is_split_into_words with Huggingface NER pipeline

I am using Huggingface transformers for NER, following this excellent guide: https://huggingface.co/blog/how-to-train.

My incoming text has already been split into words. When tokenizing during training/fine-tuning I can use tokenizer(text,is_split_into_words=True) to tokenize the incoming text. However, I can't figure out how to do the same in a pipeline for predictions.

For example, the following works (but requires incoming text to be a string):

s1 = Here is a sentence
p1 = pipeline(ner,model=model,tokenizer=tokenizer)
p1(s1)

But the following raises the following error: Exception: Impossible to guess which tokenizer to use. Please provide a PreTrainedTokenizer class or a path/identifier to a pretrained tokenizer.

s2 = Here is a sentence.split()
toks = tokenizer(s2,is_split_into_words=True)
p2 = pipeline(ner,model=model)
p2(toks)

I don't want to join the incoming text into one sentence because whitespace is significant in my use case. Post-processing the outputs of the pipeline will be complicated if I just pass in one string rather than a list of words.

Any advice on how I can use is_split_into_words=True functionality in the pipeline?

Topic huggingface transformer named-entity-recognition

Category Data Science


If you are not set on this particular model for NER, there are some that work with multi-sentence texts straight away without any manual splitting:

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.