Applying a matching function for string and substring with missing values on a python dataframe

Question

Applying a matching function for string and substring with missing values on a python dataframe

Carola

2022年5月31日 02:07

I have programmed the following functionality:

The function returns True, when the two strings match sequentially except for a * value and false when they differ by at least 1 character.

def matching(row1, row2):
    string = row1['number']
    sub_string = row2['number']
    flag = True
    i=0
    if len(string) == len(sub_string):
        while i  len(string) and flag==True:
            if string[i] != * and sub_string[i] != *:
                if string[i] != sub_string[i]:
                    flag = False
            i+=1
    else:
        flag = False
    
    return flag

Assuming I have a dataframe with the column 'number'. I want to apply this function to a dataframe in order to obtain the following format:

| number | unique_id |
| ------ | --------- |
| 178*A8 |     0     |
| 13**B4 |     1     |
| 17***8 |     0     |
| 82819B |     2     |
| 13**B4 |     1     |

I managed to write the unique_id with the following code, but it only works when the numbers match. I would like to perform the same functionality but using the function in the code first.

df['unique_id'] = pd.factorize(df['number'])[0]

| number | unique_id |
| ------ | --------- |
| 178*A8 |     0     |
| 13**B4 |     1     |
| 17***8 |     2     |
| 82819B |     3     |
| 13**B4 |     1     |

Edited: We will assume that matching will be done on a first-come, first-served basis. If the first value is 123*, all numbers matching('123*',X) == True will be assigned the same id.

Topic pandas python

Category Data Science

Erwan · Accepted Answer · 2022年3月7日 16:05

This corresponds to a deduplication or record linkage problem.

There are various ways to compare records (numbers in your case), but the main issue is almost always about the double loop: in the general problem, every possible pair of records must be compared. In case there are too many numbers for the double loop, you could implement the blocking technique described here.

Your design may have an additional issue: your matching method is not transitive, i.e. you can have cases where $a$ matches $b$, $b$ matches $c$ but $a$ doesn't $c$. Apparently you plan to solve this by picking the first match. This might not be optimal for matching a maximum of numbers.

I'm not expert at all with pandas but I doubt that there would be any predefined function which does what you need. factorize relies on strict equality, it's much simpler because it can collect all the unique values in one pass.

Applying a matching function for string and substring with missing values on a python dataframe

About