How to use is_split_into_words with Huggingface NER pipeline
I am using Huggingface transformers for NER, following this excellent guide: https://huggingface.co/blog/how-to-train.
My incoming text has already been split into words. When tokenizing during training/fine-tuning I can use tokenizer(text,is_split_into_words=True)
to tokenize the incoming text. However, I can't figure out how to do the same in a pipeline
for predictions.
For example, the following works (but requires incoming text to be a string):
s1 = Here is a sentence
p1 = pipeline(ner,model=model,tokenizer=tokenizer)
p1(s1)
But the following raises the following error: Exception: Impossible to guess which tokenizer to use. Please provide a PreTrainedTokenizer class or a path/identifier to a pretrained tokenizer.
s2 = Here is a sentence.split()
toks = tokenizer(s2,is_split_into_words=True)
p2 = pipeline(ner,model=model)
p2(toks)
I don't want to join the incoming text into one sentence because whitespace is significant in my use case. Post-processing the outputs of the pipeline will be complicated if I just pass in one string rather than a list of words.
Any advice on how I can use is_split_into_words=True
functionality in the pipeline?
Topic huggingface transformer named-entity-recognition
Category Data Science