How to evaluate the similarity of two columns containing strings?

I am new to text processing and stuck on a problem to identify the similarity of columns. To detail the problem, consider we have two columns with string values:

Column A      |        Column B
-------------------------------
abcd          |          xyz
foo           |          bar
xyzzy         |          acct
xyz           |          world
onex          |          foo
...           |          ...
...           |          ...

The length of columns can be in order of thousands. Is there an approach to identify how similar the columns are?

Currently, I am creating Minhash signatures for both the columns and computing the Jaccard similarity b/w the signatures. But the problem is, the similarity scores are coming too low even for the columns which have a considerate overlap of values.

Then, I tried creating signatures by taking fractions of values that are most frequently occurring but that does not seem to help either.

Is there any other approach to work on this?

Topic text-processing text

Category Data Science


You could use similarity metrics for strings. There are a number of "off the shelf" packages to compare string similarity, such as stringdist for R.

The stringsim function - for instance - allows you to compare string similarity (and there are options to use different metrics).

Example (in R):

library(stringdist)

stringsim("cat", "catfish")
> [1] 0.4285714

# Also works with vectors
df = data.frame(a=c("cat","dog","tree"),b=c("catfish","hotdog","forest"))

stringsim(df$a,df$b, method="jaccard")
> [1] 0.4285714 0.6000000 0.5000000

Also see this github-repo for fuzzy-matching etc.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.