Best Approach for this Entity Extraction Problem?

Question

Best Approach for this Entity Extraction Problem?

AndrewJaeyoung

2022年4月14日 03:05

Context

I have looked endlessly for a similar question to this but I haven't found one so hopefully someone can offer me some insight.

I have a task where I'm given a bunch of employees with their alphanumeric ID number. So my inputs and labels look like such (this is idealized, the existing entries need a TON of cleaning, but this is how it would look after cleaning):

The Task:

I need to extract the ID number from the Full ID using an entity extraction model. However, I am not sure how to go about building this model. I feel as if deep learning is overkill on this problem, but I have no clue of how to go about extracting the ID with a statistical model. I am looking to use machine learning here because regex is too strict of a rule to parse out the ID number with, if a user inputs a Full ID incorrectly, then the regex will definitely not output the correct ID. If I could have a machine learning model that learns some rules to extract an ID, it would be a lot more ideal and scalable for the long-term.

My Attempts:

My initial approach was to vectorize the inputs (Full ID) and then treat this as a classification problem (between two classes name and id), then I would take only the tokens predicted to be in the class id as my prediction for ID. This has not worked as well as I'd hoped in practice however, so I am wondering if anyone else had ideas on how to approach this? Does it seem feasible to train a model from scratch to do this? Or would I have to resort to using a predefined workflow (like on spacy)?

Note: If possible, I would hope to be able to stick to TensorFlow (Keras), Sci-kit Learn as my only machine learning tools. I am using Pandas, NumPy, re to do the cleaning and preprocessing. Thank you in advance.

Update 2022-04-13

I figured out a way to do this task in case anyone ever needs to do something like this.

I installed spaCy, and used a blank model (english language). Follow the exact format of the training data, save it as a .spacy file, and then make sure to fill out the config file on the website (follow the steps here: https://spacy.io/usage/training). Make sure to put the training and validation data on the config file, and run the training loop. It will output into a new directory once finished, and then you can inference with this model. I can put code up if anyone ever requests the specifics of how to do this.

spaCy's model allows us to not only classify various tokens in a document, but it also allows us to extract the token that was classified in a certain manner. I simply labeled every training instance as store_number and this model is extremely powerful (with the correct preprocessing applied). I ended with about 90% accuracy and precision on minimal samples, so spaCy models are the solution if you ever need to perform an entity extraction task.

Topic named-entity-recognition nlp machine-learning

Category Data Science

Best Approach for this Entity Extraction Problem?

About