Optimization of a simple M x N dataset

I have a dataset consisting of M questionnaires and N students. Each students replied to some questionnaires.

I would like to make the dataset better, by removing some questionnaires and/or some students. The goal is to optimize the dataset so we have as few holes as possible. To be clear, a hole in the dataset is when a student did not reply to a questionnaire.

Let's say the number of holes in the dataset is H. We want H as low as possible, while M and N are as high as possible.

How would one go to optimize such a problem?

Topic missing-data optimization dataset

Category Data Science


It would be helpful to clarify how and why the dataset should be "optimized".

If I'm understanding the question correctly, we can think of this problem as simply a M x N array, where the rows indicate questionnaires and columns indicate students. The entries are 1 if answered else 0.

If you want no holes at all (H=0), simply drop any rows/columns with missing data. Obviously if every questionnaire was not answered by at least 1 student, then the whole dataset is dropped.

If you want to selectively drop some rows (questionnaires) or columns (students), you would need some reason to do so. For instance if a questionnaire was answered by no student, or if a student did not answer any questionnaires, then maybe you could justify dropping it. Otherwise I don't really understand the goal you are trying to achieve.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.