Very low probability in naive Bayes classifier 1

I have some training data (TRAIN) and some test data (TEST). Each row of each table contains an observed class (X) and some columns of binary (Y). I'm using a Python script that is intended to predict the probability (Pr) of X given Y in the test data based on the training data. It uses a Bernoulli naive Bayes classifier. Here is my script:

https://stackoverflow.com/questions/55187516/look-up-bernoullinb-probability-in-dataframe

It works on the dummy data that is included with the script.

On the real data, I know from experience which class some of the Y columns are indicative of. My script however is giving probability predictions like "1" where I don't think that the class is correct and "6e-77" on correct classes.

Any advice on what I can try please?

Edit

There are two problems. The very low probability is caused by the naive assumption that nothing is related to anything else. This is described here: https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html#sphx-glr-auto-examples-calibration-plot-calibration-curve-py

The incorrect answers are caused by my code getting confused about which class is which, as described on my Stack Overflow post.

Topic prediction probability naive-bayes-classifier machine-learning

Category Data Science


Each column of binary (Y) is a feature. The Bernoulli naive Bayes classifier could identify the class (X) where the number of features (Y) was less than 17. The real data had more features than that. I found that another method could classify it accurately. That was:

Trainining:

(1) Count which features (Y) are in each class (X) in the training data

Testing:

(2) Give each row a score (Z) with a starting value of 0.5

(3) For each row:

  • If each feature (Y) is in the class (X) in the training data then add 1 to the score (Z).

  • If each feature (Y) is not in the class (X) in the training data then subtract 1 from the score (Z).

  • If the class (X) is not in the training data then don't do anything

The score (Z) was a good classifier for my data.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.