How to evaluate the similarity of two columns containing strings?
I am new to text processing and stuck on a problem to identify the similarity of columns. To detail the problem, consider we have two columns with string values:
Column A | Column B
-------------------------------
abcd | xyz
foo | bar
xyzzy | acct
xyz | world
onex | foo
... | ...
... | ...
The length of columns can be in order of thousands. Is there an approach to identify how similar the columns are?
Currently, I am creating Minhash signatures for both the columns and computing the Jaccard similarity b/w the signatures. But the problem is, the similarity scores are coming too low even for the columns which have a considerate overlap of values.
Then, I tried creating signatures by taking fractions of values that are most frequently occurring but that does not seem to help either.
Is there any other approach to work on this?
Topic text-processing text
Category Data Science