How to split data in R using dplyr if we want to have rows of the same group to belong to the same split?

Question

How to split data in R using dplyr if we want to have rows of the same group to belong to the same split?

Dee

2020年6月17日 17:21

In my current pipeline, I have sensed that there is data leakage. This is because the same person, though with slightly different values, is in both training and testing set. As a result, my model is overfitting.

For eg my data looks like this:

PID       Var_1   Var_2
Person A     0      1
Person B     0      1
Person C     0      0
Person A     1      3
Person B     1      2
Person D     0      1 
Person C     0      1

I want to split this data such that the rows of the same person will be in either training or testing set, i.e I want the split to look like this:

Training:

PID       Var_1   Var_2
Person A     0      1
Person B     0      1
Person A     1      3
Person B     1      2

Testing:

PID       Var_1   Var_2
Person C     0      0
Person D     0      1  
Person C     0      1

Topic dplyr data-cleaning r

Category Data Science

Dee · Accepted Answer · 2020年6月17日 17:21

Figured out an easy way to do this.

First we will just select PID from the real data.
Then we will just sample 0.75 % of these PID and save these point as training PID and the rest as testing PID.
We will thne find the intersection between this list and the real data using PID.

How to split data in R using dplyr if we want to have rows of the same group to belong to the same split?

About