How to improve regex while fetching record dynamically

The data is as follows:

COL1    COL2

 12    :402:agsh,hhjd,:45:hghgh,gruru,:12:fgh,ghgh,:22:hhhh
 57    :42:agshhhjd,:57:hghgh,gruru,:120:fghghgh,:12:hhhhhh

I am creating a third column field_info like:

 COL1  COL2                                                    field_info

 12   :402:agsh,hhjd,:45:hghghgruru,:12:fgh,ghgh,:22:hhhh      fgh,ghg
 57   :42:agshhhjd :57:hghgh,gruru:120:fghghgh :12:hhhhhh    hghgh,g

I am using a regex function as follows:

df.loc[:,'field_info']=df.col2.replace(regex=r'.*'+ df.col1.astype('str') +':(.{15}).*',value="\\1")

I have 2 columns col1 col2. col1 has some value which I am searching in col2 dynamically and extracting the next 15 characters from that. However, it's taking a lot of time. Can anyone suggest a faster way of doing this?

Topic regex python

Category Data Science


string = ':402:agsh,hhjd,:45:hghghgruru,:12:fgh,ghgh,:22:hhhh'
place = string.find('12')
def extract_substring(string, num):
    starting_point = place + len('12')
    return string[starting_point:(starting_point + 15)]
df.apply(lambda row:extract_substring(row['col2'], row['col1']), axis=1)
%timeit df.loc[:,'field_data']=df.col2.replace(regex=r'.*'+ df.col1.astype('str') +':(.{15}).*',value="\\1")

Should work as well, and doesn't use regexp


Based on your sample data, I replicated it 50000 times and the result is as follows-

>>> df = pd.DataFrame({'COL1':[12 ,57],'COL2': [':402:agsh,hhjd,:45:hghgh,gruru,:12:fgh,ghgh,:22:hhhh',':42:agshhhjd,:57:hghgh,gruru,:120:fghghgh,:12:hhhhhh']})

>>> for _ in range(50000):
        df = df.append({'COL1':12,'COL2': ':402:agsh,hhjd,:45:hghgh,gruru,:12:fgh,ghgh,:22:hhhh'}, ignore_index = True)
        df = df.append({'COL1':57,'COL2': ':42:agshhhjd,:57:hghgh,gruru,:120:fghghgh,:12:hhhhhh'}, ignore_index = True)

>>> df.shape 
(100002, 2)

Then I defined a custom function and applied to the columns-

>>> def somefunc(x,y):
        res = []
        for i in range(len(x)):
            ix = y[i].find(x[i]) + len(x[i])
            res.append(y[i][ix+1:ix+8])
        return res

>>> df['col3'] = somefunc(df['COL1'].astype(str),df['COL2'])
>>> df.head()
    COL1                                   COL2             col3 
0    12  :402:agsh,hhjd,:45:hghgh,gruru,:12:fgh,ghgh,:2...  fgh,ghg
1    57  :42:agshhhjd,:57:hghgh,gruru,:120:fghghgh,:12:...  hghgh,g
2    12  :402:agsh,hhjd,:45:hghgh,gruru,:12:fgh,ghgh,:2...  fgh,ghg
3    57  :42:agshhhjd,:57:hghgh,gruru,:120:fghghgh,:12:...  hghgh,g

I did not use regex and this function took nearly 5 seconds to complete on 100000 rows.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.