Fuzzily join two large sets of postal addresses
I have two tables of postal address information - the one is about 2 million records, the other roughly 40 million. They have quite bad quality, and also are not quite compatible with each other (different conventions in both sets, some fields cut off in an impractical way... - in other words, Real World Data).
They may not be the largest ones around, but compared to the available hardware, they are non-trivial (I cannot simply spin up a lot of worker nodes, or do a simple $O(n²)$ loop...).
I wish to join-update the two tables on the postal address. Something like
select table1.id, table2.id
from table1, table2
where fuzzy_match([table1.street, table1.city, ...],
[table1.street, table1.city, ...]) $MINIMUM_DISTANCE;
The match is then used to store table2.id
as a reference in table1
.
Obviously this is only pseudo-code. I'm looking for a tool to do this (either some existing solution, or a library to use in our application, or hints towards a non-naive algorithm or data structure to "DIY" it).
I need to do this at least once/initially over the full data, and then periodically for deltas/updates.
This is the technical background:
- Linux server
- Oracle RDBMS
- The programming environment is primarily Ruby-based, but other tools and languages can be installed if open source.
I can think of the following options:
- Oracle Text contains a fuzzy search. But this cannot be used for non-technical reasons outside of the scope of this question.
- Elasticsearch has a fuzzy query based on Levenshtein. I'm in the process of getting my data into that and trying it for fitness and performance.
- Program it myself in Ruby or other languages, although doing it from (almost) first principles does not seem too appealing to me - I'd hope that problem has been solved somewhere already.
- Online solutions or closed source non-free solutions are not applicable, e.g. I cannot upload any of the data to Google Maps or whatever along that line.
Do you have any hint on how to approach this? I'm happy for tool/library suggestions, or a suggestion for some data structure, an algorithm which I can then research/implement myself.
I am aware of How to do postal addresses fuzzy matching? and will probably also try out https://github.com/openvenues/libpostal as a base, but am afraid that that is mostly meant to be used for individual (?) addresses, so I'll run into $O(n^2)$ problems again...
Topic oracle fuzzy-logic dataset
Category Data Science