How to split data in R using dplyr if we want to have rows of the same group to belong to the same split?

In my current pipeline, I have sensed that there is data leakage. This is because the same person, though with slightly different values, is in both training and testing set. As a result, my model is overfitting.

For eg my data looks like this:

PID       Var_1   Var_2
Person A     0      1
Person B     0      1
Person C     0      0
Person A     1      3
Person B     1      2
Person D     0      1 
Person C     0      1    

I want to split this data such that the rows of the same person will be in either training or testing set, i.e I want the split to look like this:

Training:

PID       Var_1   Var_2
Person A     0      1
Person B     0      1
Person A     1      3
Person B     1      2    

Testing:

PID       Var_1   Var_2
Person C     0      0
Person D     0      1  
Person C     0      1   

Topic dplyr data-cleaning r

Category Data Science


Figured out an easy way to do this.

  1. First we will just select PID from the real data.
  2. Then we will just sample 0.75 % of these PID and save these point as training PID and the rest as testing PID.
  3. We will thne find the intersection between this list and the real data using PID.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.