reducing false positives with annotated named entity recognition model

I am training a NER model to detect mentioned phrases and slang words in a bias study conducted on court cases.

Essentially, I have packets of text that I scanned and these are the complete proceedings.

The model is great at detecting the phrases I want based on annotations that I have created from the many cases that I have already scanned. However, I am facing false positives for certain phrases.

Here is an example of a phrase I want to tag: Your honor, my client, the def., pleads guilty. Here is a false positive it has detected: You are def guilty, said the judge.

It seems that in many cases def gets tagged incorrectly. I have not fed the model any training documents where this type of shortened def could exist, but my guess is that is where my problem is. I have only trained the model on annotated data, and have not provided it any other data, text documents, readings.

What do you think I can do to reduce false positives?

Topic data-science-model named-entity-recognition machine-learning

Category Data Science


In general providing the model with more negative examples would indeed help. In the example, if the model had seen multiple negative instances which include "def", then it would not rely on it or try to find other clues. Note that this could also cause the opposite effect of false negative errors.

Additionally you could try to improve the features or the preprocessing in order to help the model determine the correct answer. For example if "def." is usually spelled with the punctuation when used as an abbreviation, it would be simple to make the string different from any other "def" usage. Probably the punctuation was removed at tokenization stage, possibly causing the confusion by the model.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.