You need to apply one-hot-encoding before you split your data. Otherwise you will run into problems if there is a categorical attribute whose values are not all present in the train and test data.
It is a bit of guessing since I do not know what your data looks like but it might be what happened in your case. Here is a simple example. Suppose you have the following data sets obtained from your split before one-hot-encoding:
Train data:
attribute_1
1 a
2 b
Test data:
attribute_1
1 a
2 b
3 c
If you apply one-hot-encoding to these data sets separately you will end up with the following:
Train data:
attribute_1_a attribute_1_b
1 1 0
2 0 1
Test data:
attribute_1_a attribute_1_b attribute_1_c
1 1 0 0
2 0 1 0
3 0 0 1
As you can see the columns of your train and test data do not match anymore. This can be solved by one-hot-encoding before splitting into train and test data.
And for the one-hot-encoding I do not see any problems with data leakage.
EDIT (based on your comment)
Alternatively, e.g. if you have missing data which you want to impute before one-hot-encoding, you can split the data first and then "manually" make sure that both datasets have the same attrributes.
For example like this:
# create example dataframes
df_train = pd.DataFrame({
"attribute_1_a": [1, 0],
"attribute_1_b": [0, 1]
})
df_test = pd.DataFrame({
"attribute_1_a": [1, 0, 0],
"attribute_1_b": [0, 1, 0],
"attribute_1_c": [0, 0, 1]
})
# add missing columns to test dataset with all values being 0
for i in df_train.columns:
if i not in df_test.columns: df_test[i] = 0
# add missing columns to train dataset with all values being 0
for i in df_test.columns:
if i not in df_train.columns: df_train[i] = 0
# use the same column order for the test set as for train
df_test = df_test.reindex(df_train.columns, axis=1)
Now the dataframes will look like this and have the same attributes:
In: df_train
Out:
attribute_1_a attribute_1_b attribute_1_c
0 1 0 0
1 0 1 0
In: df_test
Out:
attribute_1_a attribute_1_b attribute_1_c
0 1 0 0
1 0 1 0
2 0 0 1
However, check your datasets after this manipulation to make sure it went thru properly and you do not have any inconsistencies!