Efficiently Sending Two Series to a Function For Strings with an application to String Matching (Dice Coefficient)
I am using a Dice Coefficient based function to calculate the similarity of two strings:
def dice_coefficient(a,b):
    try:
        if not len(a) or not len(b): return 0.0
    except:
        return 0.0
    if a == b: return 1.0
    if len(a) == 1 or len(b) == 1: return 0.0
    a_bigram_list = [a[i:i+2] for i in range(len(a)-1)]
    b_bigram_list = [b[i:i+2] for i in range(len(b)-1)]
    a_bigram_list.sort()
    b_bigram_list.sort()
    lena = len(a_bigram_list)
    lenb = len(b_bigram_list)
    matches = i = j = 0
    while (i  lena and j  lenb):
        if a_bigram_list[i] == b_bigram_list[j]:
            matches += 2
            i += 1
            j += 1
        elif a_bigram_list[i]  b_bigram_list[j]:
            i += 1
        else:
            j += 1
    score = float(matches)/float(lena + lenb)
    return score
However, I am trying to evaluate the best match out of a large possible list, and i want to use list comprehension/map/vectorize the function calls for a whole series of strings to be matched to make this computationally efficient. However, I am having difficult getting the run time into a reasonable ballpark for even medium sized series (10K-100K elements).
I want to send two input series into/through the function, and then get the best possible match from all candidates on dflist1 against a second series: dflist2 . Ideally, but not necessarily, the return would be another series in the dflist1 dataframe return the best possible score also. I have an implementation of this working (below), but it's incredibly slow. Is it also possible to parrelelize this? I think this would be a hugely valueable problem to solve as it would perform the same function that reconcile csv currently does.
dflist1 = pd.read_csv('\\list1.csv', header = 0,encoding = "ISO-8859-1")
dflist2 = pd.read_csv('\\list2.csv', header = 0,error_bad_lines=False)
dflist1['Best Match'] = 'NA'
dflist1['Best Score'] = '0'
d = []
start = time.time()
for index, row in dflist1.iterrows():
    d=[dice_coefficient(dflist1['MasterList'][index],dflist2['TargetList'][indexx]) for indexx,rows in dflist2.itertuples()]
    dflist1['Best Match'][index]=dflist2['TargetList'][d.index(max(d))]
    dflist1['Best Score'][index]=max(d)
    print('Finished '+str(index)+' out of '+str(len(dflist1.index))+' matches after '+str(round(time.time() - start))+' seconds.')
Any help would be appreciated very much!
Topic jaccard-coefficient pandas python parallel efficiency
Category Data Science