How to deal with missing data for Bernoulli Naive Bayes?
I am dealing with a dataset of categorical data that looks like this:
content_1 content_2 content_4 content_5 content_6
0 NaN 0.0 0.0 0.0 NaN
1 NaN 0.0 0.0 0.0 NaN
2 NaN NaN NaN NaN NaN
3 0.0 NaN 0.0 NaN 0.0
These represent user downloads from an intranet, where a user is shown the opportunity to download a particular piece of content. 1
indicates a user seeing content and downloading it, 0
indicates a user seeing content and not downloading it, and NaN
means the user did not see/was not shown that piece of content.
I am trying to use the scikit-learn Bernoulli Naive Bayes model to predict the probability of a user downloading content_1
, given if they have seen downloaded / not downloaded content_2-7
.
I have removed all data where content_1
is equal to NaN
as I'm obviously only interested in data points where a decision was actively made by the user. This gives data as:
content_1 content_2 content_3 content_4 content_5 content_6
0 1.0 NaN 1.0 NaN NaN 1.0
1 0.0 NaN NaN 0.0 1.0 0.0
2 1.0 0.0 NaN NaN NaN 1.0
In the above framework, NaN
, is a missing value. For data points where a Nan
is present, I want the algorithm to ignore that category, and use only those categories present in the calculation.
I know from these questions: 1, that there are essentially 3 options when dealing with missing values:
- ignore the data point if any categories contain a
NaN
(I.e. remove the row) - Impute some other placeholder value (e.g. -1 etc.) or
- Impute some average value corresponding to the overall dataset distribution.
However, these are not the best option for the following reason:
- Every single row contains at least 1 NaN. This means, under this arrangement I would discard the entire dataset. Obviously a no go.
- I do not want the
missing value
to add to the probability calculation, which will happen if I replaceNan
with say -1. I'm also using a Bernoulli Naive Bayes, so as I understand, this requires singly0 or 1
values. - As this is categorical data, it does not make sense for me to do this, in this way (it was either seen or not, and if not, it is not needed).
The answer here indicated that the best way to do this, is, when calculating probabilities, to ignore that category if it is a missing value (essentially you are saying: only compute a probability based on the specific categories I have provided with non missing values).
I do not know how to encode this when using the scikit-learn Naive Bayes model, whether to do this as a missing value.
Here's what I have so far:
df=pd.read_clipboard()
from sklearn import datasets
from sklearn.naive_bayes import BernoulliNB
# Create train input / output data
y_train = df['content_1'].values
X_train = df.drop('content_1', axis=1).values
# Loud Bernoulli Naive Bayes model
clf = BernoulliNB()
clf.fit(X_train, y_train)
Obviously, this returns an error because of the present NaNs
. So how can I adjust the scikit-learn Bernoulli model to automatically ignore the columns with NaNs
, and instead take only those with 0 or 1?
I am aware this may not be possible with the stock model, and reviewing the documentation seems to suggest this. As such, this may require significant coding, so I'll say this: I am not asking for someone to go and code this (nor do I expect it); I'm looking to be pointed in the right direction, for instance if someone has faced this problem / how they approach it / relevant blog or tutorial posts (my searches have turned up nothing).
Thanks in advance - appreciate you reading.
Topic missing-data naive-bayes-classifier scikit-learn classification python
Category Data Science