partial fitting, how to ensure one hot captures all features consistently
Doing some data science on ~4 million samples, with lots of columns being categorical.
One column has ~1000 categories and my boss insists on including it in the analysis.
My output is also predicting classes (I'll use gnb.predict_proba()
)
So, I'm taking a random subset of my data for partial fitting, and repeating.
# train = ~3 million rows of data as a dataframe
gnb = naive_bayes.GaussianNB()
for i in range(10):
dds = train.sample(n=10**4)
(dfX,dfY) = makeXY(dds) #gets one-hot- encoded X and Y dataframes
gnb.partial_fit(dfX,[getClass(x) for x in dfY.values],classes=np.unique([getClass(x) for x in dfY.values]))
How can I ensure I get all the possible classes AND that they are in the same order every time?
Category Data Science