Efficient way to compare one record to millions of rows

Question

Efficient way to compare one record to millions of rows

Isaiah Melendez

2022年4月12日 15:15

We have a production table that contains a bucket of customer data. A customer could be the same customer/person at location A and at location B. They are different by how the name is spelled, address disparity (lane vs ln), and ultimately the customer ID (PK/UID).

We have built a query to pull in the data for the customer and loading them into a staging table and running a similarity coeff library to check each record in the staging table through each record in the production table. When the customer similarity coeff meets the threshold we have assigned we feel confident the customer is the same customer and in turn the customer is put in a golden table as one customer and move on to the next record to do the same check.

Our current process has done 300 rows in a day. As you can tell, this is very slow and wont work.

Has anyone solved a problem similar to this and if so, how did you handle?

Thank you!

Topic data-cleaning bigdata

Category Data Science

Efficient way to compare one record to millions of rows

About