How to train a model to predict if 2 samples refer to the same thing?

I have 2 ddbb with around 60,000 samples each. Both have the same features (same column names) that represent particular things with text or categories (turned into numbers). Each sample in a ddbb is assumed to refer to a different particular thing. But there are some objects that are represented in both ddbb, yet with somewhat different values in the same-name column (like different open descriptions, or classified as another category).

The aim is to train a machine learning model that recognizes when the two “descriptions” refer to the same thing.

We have manually recognized a few thousand duplicated cases that we have labelled correspondingly. This label is the target variable to learn. Yet, the supervised classification examples I have seen, work with a single data frame and/or predict something about a single row of features.

How does it work when we are not trying to predict what a sample represents, but the relation between 2 samples (specifically now if they refer to the same object or not)?

I do not even know how to feed two dataframes into scikit-learn or even auto scikit-learn (or similars) so that it can handle the task of recognizing if a sample from one represents the same thing than another sample in the other one...Maybe I should concatenate them so that there is just one df, but then it would have to compare all with all the rest...Or does it not make a big difference?

Any idea or hint about how to proceed here or how to frame the problem better?

Topic automl text-classification feature-engineering supervised-learning

Category Data Science


If you want to take a supervised approach, you can treat this as a binary classification problem where the input is two rows that have been concatenated into one, and the target output is the label indicating whether they are duplicates or not.

You can construct this dataset by:

  • Using the rows that you have identified manually as duplicates as the positive class. You mentioned having "manually recognized a few thousand duplicated cases that we have labelled correspondingly" so I am assuming that you have a way to find pairs of rows that are duplicates.
  • Using rows that you know are not duplicates as the negative class. You can identify pairs of rows that are not duplicates by, for example, randomly sampling a row from one group and a row from another group.

Then you can fall back on training a binary classifier.

If you want to take an unsupervised approach, you would need a similarity function that you can use to compare two rows (for example, their euclidean distance). Then, if two rows have a similarity above a specific threshold, they could be flagged as possible duplicates.


If I understand correctly, it sounds somewhat like an NLP problem. For each row, and each column, you will compute the similarity between entries across dataframes e.g.

Your two dataframes:

Row182, ColumnA, DataFrame1 = XYZ_ABC_123
Row182, ColumnA, DataFrame2 = XYZ_ABC_152

New data frame:

Row182, ColumnA = 0.8

Repeat for all rows and columns. You would then set some threshold at an acceptable similarity rate / accuracy of your similarity metric matrix. To highlight the most severe cross-overs.

You could have a look at https://www.nltk.org/, but also depending on the data entries you could just do a pure pythonic solution e.g. 80% match between string entries.

I'm guessing at what your data looks like, so some examples there could help

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.