Correlations with NA or with Zeros?

When calculating correlations in R e.g. via cor is it better to treat missing data as NAs or as Zeros? The latter would be regarded as numerical valid values so I'd guess NA would be better?

Topic missing-data correlation r

Category Data Science


Normally, statistical softwares exclude NAs from any estimation procedure. That's not the case if you change an NA with 0. That may create a lot of distortions.

Indeed, the reasons of the NAs are important. For example, if it's a cross-section study and the person just didn't want to answer some of the questions (arbitrarily) then perhaps it's ok. However, it that question didn't apply to him/her and that's why it's an NA, there maybe it's a different type of subject and should be treated differently among its peers. (i.e. grouping data).


Imputing missing values with 0 is definitely wrong, keeping NAs as NAs is certainly less wrong. But more correctly you should figure out why you have missing values in the first place, as the reason for the missing values might require different approaches.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.