Efficiently Sending Two Series to a Function For Strings with an application to String Matching (Dice Coefficient)
I am using a Dice Coefficient based function to calculate the similarity of two strings:
def dice_coefficient(a,b):
try:
if not len(a) or not len(b): return 0.0
except:
return 0.0
if a == b: return 1.0
if len(a) == 1 or len(b) == 1: return 0.0
a_bigram_list = [a[i:i+2] for i in range(len(a)-1)]
b_bigram_list = [b[i:i+2] for i in range(len(b)-1)]
a_bigram_list.sort()
b_bigram_list.sort()
lena = len(a_bigram_list)
lenb = len(b_bigram_list)
matches = i = j = 0
while (i lena and j lenb):
if a_bigram_list[i] == b_bigram_list[j]:
matches += 2
i += 1
j += 1
elif a_bigram_list[i] b_bigram_list[j]:
i += 1
else:
j += 1
score = float(matches)/float(lena + lenb)
return score
However, I am trying to evaluate the best match out of a large possible list, and i want to use list comprehension/map/vectorize the function calls for a whole series of strings to be matched to make this computationally efficient. However, I am having difficult getting the run time into a reasonable ballpark for even medium sized series (10K-100K elements).
I want to send two input series into/through the function, and then get the best possible match from all candidates on dflist1 against a second series: dflist2 . Ideally, but not necessarily, the return would be another series in the dflist1 dataframe return the best possible score also. I have an implementation of this working (below), but it's incredibly slow. Is it also possible to parrelelize this? I think this would be a hugely valueable problem to solve as it would perform the same function that reconcile csv currently does.
dflist1 = pd.read_csv('\\list1.csv', header = 0,encoding = "ISO-8859-1")
dflist2 = pd.read_csv('\\list2.csv', header = 0,error_bad_lines=False)
dflist1['Best Match'] = 'NA'
dflist1['Best Score'] = '0'
d = []
start = time.time()
for index, row in dflist1.iterrows():
d=[dice_coefficient(dflist1['MasterList'][index],dflist2['TargetList'][indexx]) for indexx,rows in dflist2.itertuples()]
dflist1['Best Match'][index]=dflist2['TargetList'][d.index(max(d))]
dflist1['Best Score'][index]=max(d)
print('Finished '+str(index)+' out of '+str(len(dflist1.index))+' matches after '+str(round(time.time() - start))+' seconds.')
Any help would be appreciated very much!
Topic jaccard-coefficient pandas python parallel efficiency
Category Data Science