Performing actual deduplication using LSH

I have a huge dataset (10M) of text files, which I try to de-duplicate - not only in terms of trivial duplicates, but also near-duplicates, given some similarity threshold. I know that LSH (locality sensitive hashing) algorithm would be good option, but I don't know how to tackle the last phase of the processing.

Currently, I have the following:

  1. Generate signatures for all of the text files
  2. Compute hashes (perform the LSH)
  3. Group documents from the same bucket hash
  4. From each group, find similar documents by comparing their similarity and applying the threshold.

The steps described above give me a large list of pairs of files, that were marked as near duplicates. Next step is to actually remove those duplicates.

I have tried 3 approaches:

#1 Use naive approach and drop every document that appeared in at least 1 pair, but that means I drop a lot of data (more than necessary).

#2 Arbitrarily remove one file from each pair but then, it leaves some of the actual duplicates in place - see the example:

Input pairs (marked as duplicates):

A, B
A, C
C, D
E, F

After arbitrarily removing one file (e.g. the first one) from each pair I would get:

A, C, E

which actually leaves A, C in the dataset, but they were marked as duplicates.

#3 Build a graph, treating pairs as edges, find connected components and take one file from each component. This is a great solution, but is really expensive computationally (both from the perspective of CPU and RAM) and does not scale to my dataset size.

How to efficiently remove near duplicate files from my dataset while preserving as much data as possible?

Topic similar-documents preprocessing similarity data-mining

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.