How to combine and separate test and train data for data cleaning?

Question

How to combine and separate test and train data for data cleaning?

Ishan Dutta

2022年5月2日 13:28

I am working on an ML model in which I have been provided the data in 2 files test.csv and train.csv. I want to perform data cleaning on both files together be concatenating them and then separating them.

I know how to concatenate 2 dataframes, but after data cleaning how will I separate the two files? Please help me complete the code.

CODE

test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

df = pd.concat([test, train])

//Data Cleaning steps

//Separating them back to train and test set for providing input to model

Topic dataframe python-3.x pandas dataset python

Category Data Science

amaresh hiremani · Accepted Answer · 2022年5月2日 13:28

before concatenation of test and train data. add new column to train and test data called type. And after preprocessing separate them based on column type. Here is a sample code.

test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

test['type'] = "test"
train['type'] = "train"

df = pd.concat([test, train])

preprocess(df)

df.drop(['type'],axis = 1,inplace = True)

train = df[df['type'] == "train"]

test = df[df['type'] == "test"]

Erfan · Accepted Answer · 2020年9月14日 10:24

Add an indicator column while concatenating the two dataframes, so you can later seperate them again:

df = pd.concat([test.assign(ind="test"), train.assign(ind="train")])

Then later you can split them again:

test, train = df[df["ind"].eq("test")], df[df["ind"].eq("train")]

Amar nayak · Accepted Answer · 2020年9月13日 05:38

Method 1: Develop a function that does a set of data cleaning operation. Then pass the train and test or whatever you want to clean through that function. The result will be consistent.

Method 2: If you want to concatenate then one way to do it is add a column "test" for test data set and a column "train" for train data set. Perform you operation then use python split to again divide it into 2 dataframe

data[data['type']=="test"]

fswings · Accepted Answer · 2020年9月13日 00:23

There are several methods to choose from. If you insist on concatenating the two dataframes, then first add a new column to each DataFrame called source. Make the value for test.csv 'test' and likewise for the training set.

When you have finished cleaning the combined df, then use the source column to split the data again.

An alternative method is to record all the operations you perform on the training set and simply repeat for the test set. This won't work it you normalise values based on the population.

How to combine and separate test and train data for data cleaning?

About