Applying a matching function for string and substring with missing values on a python dataframe

I have programmed the following functionality:

The function returns True, when the two strings match sequentially except for a * value and false when they differ by at least 1 character.

def matching(row1, row2):
    string = row1['number']
    sub_string = row2['number']
    flag = True
    i=0
    if len(string) == len(sub_string):
        while i  len(string) and flag==True:
            if string[i] != * and sub_string[i] != *:
                if string[i] != sub_string[i]:
                    flag = False
            i+=1
    else:
        flag = False
    
    return flag

Assuming I have a dataframe with the column 'number'. I want to apply this function to a dataframe in order to obtain the following format:

| number | unique_id |
| ------ | --------- |
| 178*A8 |     0     |
| 13**B4 |     1     |
| 17***8 |     0     |
| 82819B |     2     |
| 13**B4 |     1     |

I managed to write the unique_id with the following code, but it only works when the numbers match. I would like to perform the same functionality but using the function in the code first.

df['unique_id'] = pd.factorize(df['number'])[0]

| number | unique_id |
| ------ | --------- |
| 178*A8 |     0     |
| 13**B4 |     1     |
| 17***8 |     2     |
| 82819B |     3     |
| 13**B4 |     1     |

Edited: We will assume that matching will be done on a first-come, first-served basis. If the first value is 123*, all numbers matching('123*',X) == True will be assigned the same id.

Topic pandas python

Category Data Science


This corresponds to a deduplication or record linkage problem.

There are various ways to compare records (numbers in your case), but the main issue is almost always about the double loop: in the general problem, every possible pair of records must be compared. In case there are too many numbers for the double loop, you could implement the blocking technique described here.

Your design may have an additional issue: your matching method is not transitive, i.e. you can have cases where $a$ matches $b$, $b$ matches $c$ but $a$ doesn't $c$. Apparently you plan to solve this by picking the first match. This might not be optimal for matching a maximum of numbers.

I'm not expert at all with pandas but I doubt that there would be any predefined function which does what you need. factorize relies on strict equality, it's much simpler because it can collect all the unique values in one pass.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.