partial fitting, how to ensure one hot captures all features consistently

Doing some data science on ~4 million samples, with lots of columns being categorical.

One column has ~1000 categories and my boss insists on including it in the analysis.

My output is also predicting classes (I'll use gnb.predict_proba())

So, I'm taking a random subset of my data for partial fitting, and repeating.

# train = ~3 million rows of data as a dataframe
gnb = naive_bayes.GaussianNB()
for i in range(10):
    dds = train.sample(n=10**4)
    (dfX,dfY) = makeXY(dds) #gets one-hot- encoded X and Y dataframes
    gnb.partial_fit(dfX,[getClass(x) for x in dfY.values],classes=np.unique([getClass(x) for x in dfY.values]))

How can I ensure I get all the possible classes AND that they are in the same order every time?

Topic pandas python

Category Data Science


If you're using sklearn, this is a great use of CountVectorizer as a workaround since you can specify a vocabulary.

To start, get a list of all the 1000 categories and set that as the vocabulary in the transformer. Then convert the column to a string data type and apply this transformer to each batch:

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(vocabulary = pre_made_category_list)
endoded_variable_matrix = cv.fit_transform(dfX[categorical_column])

Even though it's technically a count vectorizor, since there is only one word in the string, the counts for each row will be 1 for the category it is and 0 for everything else, so its effectively one hot encoding the variable. The order of the columns will be the order of the vocabulary, so the matrix will be consistent between folds.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.