Optimization of a simple M x N dataset

Question

Optimization of a simple M x N dataset

Samuel Faure

2022年4月16日 05:16

I have a dataset consisting of M questionnaires and N students. Each students replied to some questionnaires.

I would like to make the dataset better, by removing some questionnaires and/or some students. The goal is to optimize the dataset so we have as few holes as possible. To be clear, a hole in the dataset is when a student did not reply to a questionnaire.

Let's say the number of holes in the dataset is H. We want H as low as possible, while M and N are as high as possible.

How would one go to optimize such a problem?

Topic missing-data optimization dataset

Category Data Science

Adam · Accepted Answer · 2022年4月16日 05:16

It would be helpful to clarify how and why the dataset should be "optimized".

If I'm understanding the question correctly, we can think of this problem as simply a M x N array, where the rows indicate questionnaires and columns indicate students. The entries are 1 if answered else 0.

If you want no holes at all (H=0), simply drop any rows/columns with missing data. Obviously if every questionnaire was not answered by at least 1 student, then the whole dataset is dropped.

If you want to selectively drop some rows (questionnaires) or columns (students), you would need some reason to do so. For instance if a questionnaire was answered by no student, or if a student did not answer any questionnaires, then maybe you could justify dropping it. Otherwise I don't really understand the goal you are trying to achieve.

Optimization of a simple M x N dataset

About