Fastest way to replace a value in a pandas DataFrame?
I am loading in 1.5m images with 80,000 classes (or I will have to when I eventually train) into a Keras generator and am using a pandas dataframe to do so. The problem is, with so many images, my code takes a long time to run. I have an issue with the specific task of replacing a value in the dataframe; it takes too long:
df = a pandas dataframe with all the names of the files in
# Code to change the names into absolute paths so keras can load in the entire df
for index, row in df.iterrows():
j = row.values[0] # 720 nano seconds
path = my path + specific values + .jpg # 1290 nano seconds
df['id'].replace(to_replace=[j], value=path, inplace=True) # 281000000 nano seconds
My issue is clearly with the last line and hence the title of the question. I managed to improve this by a magnitude of x4 with the code below, but is still to long:
df.loc[index, 'id'] = path # 69100000 nano seconds
For your interest, with 1.5m entries in the dataframe, this will take:
# FORMULA
(time, ns * no. of rows)/1000000000 = no. of seconds
no. of seconds/3600 = no. of hours
(281000000 * 1500000)/1000000000 = 421,500 seconds
421500/3600 = 117 hours
(69100000 * 1500000)/1000000000 = 103,650 seconds
103650/3600 = 28 hours
As you can see, a great improvement, but still too long. And I haven't even began training yet. Does anyone know a faster way to do this?
Additionally, since this is my first project concerning big data, can anyone offer me tips about how to deal with so many images?
Many thanks,
Finn Williams
Topic time-complexity computer-vision pandas bigdata
Category Data Science