Division of data into training and validation sets

I have a multi-sensory dateset for the activities of daily living. It contains data from 10 volunteers each performing 9 activities. Each volunteer wears 6 sensors on their body with the recorded data type quaternions, acceleration, and angular velocity. For each volunteer, I have total of 7 CSV files i-e 6 for each sensor and one for annotation.

Now, I would like to divide the data of 7 volunteers into training and validation and the remaining 3 for testing. For 7 volunteers I have a total of almost 49 CSV files.

What should be the required approach to divide these into training and validation sets? I can find a lot of information regarding a single CSV files, but not about bunch of these.

I am looking forward to some advice.

Topic activity-recognition machine-learning-model deep-learning

Category Data Science


If the data for each volunteer have the same format, then you can proceed in following manner.

Step I

Combine the 7 CSVs of each volunteer into a single CSV. Merge these CSVs column-wise (and do not append rows). I assume that your sensor data columns are features (X) and annotation column/columns are target (y). So now you have 10 CSVs, one for each volunteer.

Step II

Merge the CSVs for creating final train and validation sets. You can now merge (append to each other row-wise) the 7 CSVs (of 7 volunteers) together and merge (append to each other row-wise) other 3 CSVs (of 3 volunteers) together. And this is how you will get a single CSV for training and a single CSV for testing.

Both of these steps can be easily done using various functions available in python pandas library.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.