Vectorized String Distance

Question

Vectorized String Distance

Anatole

2022年2月22日 10:46

I am looking for a way to calculate the string distance between two Pandas dataframe columns in a vectorized way. I tried distance and textdistance libraries but they require to use df.apply which is incredibly slow. Do you know any way to have a string distance using only column operations ?

Thanks

Topic numpy distance preprocessing pandas

Category Data Science

Anatole · Accepted Answer · 2022年2月22日 10:46

I found here that performance across string distance libraries varies greatly : https://github.com/life4/textdistance#benchmarks

The python-Levenshtein library is lightning fast compared to the others so I will use this one. If it's not sufficient I will use parallelism as suggested by @Peter

Peter · Accepted Answer · 2022年2月22日 10:04

I have a similar problem and tried parallel computing using joblib. In terms of performance this approach seems okay. However, it appears that joblib "blocks" RAM memory when repeated very often. So I'm open for alternatives (or suggestions how to terminate the parallel job properly).

from joblib import Parallel, delayed
import distance
import pandas as pd
# https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html

# Define some distance measure
def calc_dist(myrow):
    return distance.levenshtein(myrow[0], myrow[1])

# Some fake data
df = pd.DataFrame({
     "text1":["some text","foo","bar","new text","more words"], 
     "text2":["text","hello","bar","bar","move words"]})

# Columns to lists / zip them 
l1=df['text1'].tolist()
l2=df['text2'].tolist()
nlist = list(zip(l1,l2))

# Calculate distances
dist_vec = Parallel(n_jobs=2)(delayed(calc_dist)(i) for i in nlist)

print(dist_vec)
> [5, 4, 0, 8, 1]

Vectorized String Distance

About