Data Entry Automation with ML

I am working on a data entry task with approximately 6000 entries to go over.

The source comes in the form of a string and can look something like this:

Air Canada B737 FFS

From this I can extract the following information:

Company: Air Canada

Model: B737

Technology: FFS

For my initial plan of attack I iterated over the source strings using Regular Expression to extract as many keywords as possible, the problem is there are so many different Companies, Models and Technologies in my source, and not every source is as clean as the one above so it would take forever to write out all the possible regular expressions. Also, there is usually text within the source that isn't important, a red herring if you will.

As a result I have manually filled out 2000 of the entries myself.

My question is, do you think I could train a ML model with my manually filled dataset to do the rest of this task for me? I would consider myself to have intermediate skills in python, but what type of ML algo should I use for this task? Any help to point me in the right direction would be very much appreciated!

Topic text-mining machine-learning

Category Data Science


So, you can build a pretty good starter set of airlines from this wikipedia page. If you scraped the page, extracted the Airline column, and lowercased the text, you would be able to have some seed data to work with.

This is also a shot in the dark, but you could scrape the model names from this toy/model planes website, also generating some base set of strings you can use to automatically validate the first two fields in your output.

The trick to reducing your time here is to figure out the 80/20 (get 80% of the work done with 20% of the effort). I have no answers for how you would be able to differentiate the last field. But these are the steps I would take to reduce the effort altogether

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.