How to split data in R using dplyr if we want to have rows of the same group to belong to the same split?
In my current pipeline, I have sensed that there is data leakage. This is because the same person, though with slightly different values, is in both training and testing set. As a result, my model is overfitting.
For eg my data looks like this:
PID Var_1 Var_2
Person A 0 1
Person B 0 1
Person C 0 0
Person A 1 3
Person B 1 2
Person D 0 1
Person C 0 1
I want to split this data such that the rows of the same person will be in either training or testing set, i.e I want the split to look like this:
Training:
PID Var_1 Var_2
Person A 0 1
Person B 0 1
Person A 1 3
Person B 1 2
Testing:
PID Var_1 Var_2
Person C 0 0
Person D 0 1
Person C 0 1
Topic dplyr data-cleaning r
Category Data Science