How to combine and separate test and train data for data cleaning?

I am working on an ML model in which I have been provided the data in 2 files test.csv and train.csv. I want to perform data cleaning on both files together be concatenating them and then separating them.

I know how to concatenate 2 dataframes, but after data cleaning how will I separate the two files? Please help me complete the code.

CODE

test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

df = pd.concat([test, train])

//Data Cleaning steps

//Separating them back to train and test set for providing input to model

Topic dataframe python-3.x pandas dataset python

Category Data Science


before concatenation of test and train data. add new column to train and test data called type. And after preprocessing separate them based on column type. Here is a sample code.

test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

test['type'] = "test"
train['type'] = "train"

df = pd.concat([test, train])

preprocess(df)

df.drop(['type'],axis = 1,inplace = True)

train = df[df['type'] == "train"]

test = df[df['type'] == "test"]

Add an indicator column while concatenating the two dataframes, so you can later seperate them again:

df = pd.concat([test.assign(ind="test"), train.assign(ind="train")])

Then later you can split them again:

test, train = df[df["ind"].eq("test")], df[df["ind"].eq("train")]

Method 1: Develop a function that does a set of data cleaning operation. Then pass the train and test or whatever you want to clean through that function. The result will be consistent.

Method 2: If you want to concatenate then one way to do it is add a column "test" for test data set and a column "train" for train data set. Perform you operation then use python split to again divide it into 2 dataframe

data[data['type']=="test"]

There are several methods to choose from. If you insist on concatenating the two dataframes, then first add a new column to each DataFrame called source. Make the value for test.csv 'test' and likewise for the training set.

When you have finished cleaning the combined df, then use the source column to split the data again.

An alternative method is to record all the operations you perform on the training set and simply repeat for the test set. This won't work it you normalise values based on the population.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.