One Hot Encoding where all sequences don't have all values

Question

One Hot Encoding where all sequences don't have all values

megamind

2022年4月7日 14:05

Is there a way (other than manually creating dictionaries) to one hot encode sequences in which not all values can be present in a sequence? sklearn's OneHotEncoder and numpy's to_categorical only account for the values in the current sample so for example, encoding DNA sequences of 'AT' and 'CG' would both be [[1, 0], [0, 1]]. However, I want A, T, C, and G to be accounted for in all sequences so 'AT' should be [[1, 0, 0, 0], [0, 1, 0, 0]] and 'CG' should be [[0, 0, 1, 0], [0, 0, 0, 1]].

Topic one-hot-encoding encoding data-cleaning machine-learning

Category Data Science

Brian Spiering · Accepted Answer · 2021年11月2日 13:30

You can use scikit-learn's OneHotEncoder like this:

from sklearn.preprocessing import OneHotEncoder

X = [['A', 'T'], ['C', 'G']]
enc = OneHotEncoder()
enc.fit_transform(X).toarray()

The result is

array([[1., 0., 0., 1.],
       [0., 1., 1., 0.]])

One Hot Encoding where all sequences don't have all values

About