5 digit number mis-reads analysis

Nothing to do with number recognition in the classical 'hand-written' sense

Disclaimer above to avoid this being counted as a repeat.

I have a selection of 96 serial numbers, and a separate selection of >220 serial numbers. Within the larger set typically resides the smaller set (not always though), but also ~ 120 incorrect numbers.

See below for an example - for the record I have matched things up as best as I can... the correct number is first, the 'possibles' are in parenthesis at right:

21490 (21490, 21400, 21498, 21499, 21480, 21488)

21491 (21401, 21481, 1401)

21492 (21492, 21402)

This set gives a good example of the type of thing i'm seeing:

  1. Number being misread the same way (0--> 9 and 8)

  2. Sometimes a number is being missed entirely

  3. Sometimes the right number isn't read at all...

It's not limited to 0, 8s and 9s, but these are the worst, so I'd like to try and understand which numeric characters are problematic (give them all a score), and build a model which takes a number, and knows a list of numbers it CAN be, and give me what number it should be, ideally with a confidence metric.

Anyone done this before and have any ideas?

Topic jupyter numerical python

Category Data Science


The first step would be to find how similar a candidate number is against any number in the reference list. I think this is a perfect case for a character-based string similarity measure, typically the Levenshtein edit distance.

In case it's possible to have several matches, there could be a second step which would predict the most likely match, maybe based on the frequencies of the number.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.