Vectorized String Distance

I am looking for a way to calculate the string distance between two Pandas dataframe columns in a vectorized way. I tried distance and textdistance libraries but they require to use df.apply which is incredibly slow. Do you know any way to have a string distance using only column operations ?

Thanks

Topic numpy distance preprocessing pandas

Category Data Science


I found here that performance across string distance libraries varies greatly : https://github.com/life4/textdistance#benchmarks

The python-Levenshtein library is lightning fast compared to the others so I will use this one. If it's not sufficient I will use parallelism as suggested by @Peter


I have a similar problem and tried parallel computing using joblib. In terms of performance this approach seems okay. However, it appears that joblib "blocks" RAM memory when repeated very often. So I'm open for alternatives (or suggestions how to terminate the parallel job properly).

from joblib import Parallel, delayed
import distance
import pandas as pd
# https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html

# Define some distance measure
def calc_dist(myrow):
    return distance.levenshtein(myrow[0], myrow[1])

# Some fake data
df = pd.DataFrame({
     "text1":["some text","foo","bar","new text","more words"], 
     "text2":["text","hello","bar","bar","move words"]})

# Columns to lists / zip them 
l1=df['text1'].tolist()
l2=df['text2'].tolist()
nlist = list(zip(l1,l2))

# Calculate distances
dist_vec = Parallel(n_jobs=2)(delayed(calc_dist)(i) for i in nlist)

print(dist_vec)
> [5, 4, 0, 8, 1]

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.