Fastest way to replace a value in a pandas DataFrame?

I am loading in 1.5m images with 80,000 classes (or I will have to when I eventually train) into a Keras generator and am using a pandas dataframe to do so. The problem is, with so many images, my code takes a long time to run. I have an issue with the specific task of replacing a value in the dataframe; it takes too long:

df = a pandas dataframe with all the names of the files in


# Code to change the names into absolute paths so keras can load in the entire df

for index, row in df.iterrows():
   j = row.values[0]    # 720 nano seconds
   path = my path + specific values + .jpg    # 1290 nano seconds

   df['id'].replace(to_replace=[j], value=path, inplace=True)    # 281000000 nano seconds

My issue is clearly with the last line and hence the title of the question. I managed to improve this by a magnitude of x4 with the code below, but is still to long:

df.loc[index, 'id'] = path    # 69100000 nano seconds

For your interest, with 1.5m entries in the dataframe, this will take:

# FORMULA
(time, ns * no. of rows)/1000000000 = no. of seconds
no. of seconds/3600 = no. of hours



(281000000 * 1500000)/1000000000 = 421,500 seconds
421500/3600 = 117 hours

(69100000 * 1500000)/1000000000 = 103,650 seconds
103650/3600 = 28 hours

As you can see, a great improvement, but still too long. And I haven't even began training yet. Does anyone know a faster way to do this?

Additionally, since this is my first project concerning big data, can anyone offer me tips about how to deal with so many images?

Many thanks,

Finn Williams

Topic time-complexity computer-vision pandas bigdata

Category Data Science


I think you are trying to change all the values in certain column to abspath you can simply use apply function

def full_path(x):
    return "my path" + "specific values" + ".jpg"
df[list(df.keys())[0]]=df[list(df.keys())[0]].apply(full_path)

If not I have one more solution

you can store the values which will be the replacements to a certain column (row.values[0] can also be expressed as df[list(df.keys())[0]]) in a list and replace that column in the dataframe with that list here is a working example

df = a pandas dataframe with all the names of the files in


# Code to change the names into absolute paths so keras can load in the entire df
values=[]
for index, row in df.iterrows():
   j = row.values[0]    # 720 nano seconds
   path = "my path" + "specific values" + ".jpg"    # 1290 nano seconds

   values.append(path)  
df[list(df.keys())[0]]=values

It seems I have immediately found an answer, so I shall post it here for the sake of others in the same situation.

I found on a website that the following code achieves the same result:

df.at[index, 'id'] = path    # 26300 nanoseconds

Calculations:

# FORMULA
(time, ns * no. of rows)/1000000000 = no. of seconds
no. of seconds/3600 = no. of hours



(26300 * 1500000)/1000000000 = 39.45 seconds
39.45/3600 = 0.011 hours

I went off to clean my teeth, and by the time I got back, surely enough my code had terminated.

Lessons learnt:

For others in a situation where their code is taking a long time to run, I recommend browsing for similar functions and work out how long they take using the timeit library. It saved me 117 hours of compute time.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.