Check similarity of table/csv of Product Names
We've got a list of approximately 18,000 product names (they're from 80-90 sources, so quite a few that are similar but not duplicates - these were picked as DISTINCT from a table) unfortunately there are different ways of expressing these names. We have to try and normalize the dataset so we present our users with more meaningful names.
For example, a list like this:
Canon EOS 5D Mark III
Canon EOS 5D mk III
Canon EOS 5DMK3
Canon EF 70-200mm f/2.8L IS II USM Lens
Canon EF 70-200mm f/2.8L IS II USM Telephoto Zoom Lens
Canon EF 70-200mm f/2.8L IS USM Lens
Canon EF 70-200mm f/4L USM Lens
I'd like to be able to assess those strings and collapse them into something like this:
Canon EOS 5D Mark III
Canon EF 70-200mm f/2.8L IS II USM
Canon EF 70-200mm f/2.8L IS USM Lens
Canon EF 70-200mm f/4L USM Lens
But I'd like to know how similar two strings are to be able to determine this. I do realise that the F2.8 IS II and IS USM may be a bit hard, but thought I'd throw it in.
The real product names are far less exciting (they're parts for the farm equipment we stock).
We also store these names in a Postgres (9.5) database table. Examples i've seen compare two lists, but we don't have a master product list to do that unfortunately.
Topic text-filter fuzzy-logic similarity
Category Data Science