Efficient way to compare one record to millions of rows

We have a production table that contains a bucket of customer data. A customer could be the same customer/person at location A and at location B. They are different by how the name is spelled, address disparity (lane vs ln), and ultimately the customer ID (PK/UID).

We have built a query to pull in the data for the customer and loading them into a staging table and running a similarity coeff library to check each record in the staging table through each record in the production table. When the customer similarity coeff meets the threshold we have assigned we feel confident the customer is the same customer and in turn the customer is put in a golden table as one customer and move on to the next record to do the same check.

Our current process has done 300 rows in a day. As you can tell, this is very slow and wont work.

Has anyone solved a problem similar to this and if so, how did you handle?

Thank you!

Topic data-cleaning bigdata

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.