Performing EDA on a dataset with missing features

I'm new to DS.

I want to perform EDA on such dataset, where these are the missing features stats of my train and test sets:

  • train:

    Test_0 0 Test_1 31 Test_2 0 Test_3 141 Test_4 0 Test_5 0 Test_6 0 Test_7 0 Test_8 1045 Test_9 0 Test_10 0 Test_11 0 Test_12 0 Test_13 0 Test_14 0 Test_15 2967 Class 0 dtype: int64

  • test:

    Test_0 0 Test_1 7 Test_2 0 Test_3 46 Test_4 0 Test_5 0 Test_6 0 Test_7 0 Test_8 279 Test_9 0 Test_10 0 Test_11 0 Test_12 0 Test_13 0 Test_14 0 Test_15 738 dtype: int64

I have 3616 data lines in total on my train set and 905 on my test set. How can I decide on which features to throw away and which to fill artificially (and how to fill - I read a bit about mean filling etc.)

If anyone can also point me to a guide that explains this issues I would appreciate it.

Thanks!

Topic exploratory-factor-analysis visualization data-cleaning

Category Data Science


There are a lot of techniques through which you can fill the missing values. Some of them are:

1.) Replacing with mean, median or mode as you correctly pointed out.

2.) Replacing with a constant value like 0

3.) KNN Imputer

4.) Iterative Imputer

Which ones to use depends on what kind of data you have. Or you can try all and see which gives you best results

Cheers!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.