One Hot Encoding where all sequences don't have all values

Is there a way (other than manually creating dictionaries) to one hot encode sequences in which not all values can be present in a sequence? sklearn's OneHotEncoder and numpy's to_categorical only account for the values in the current sample so for example, encoding DNA sequences of 'AT' and 'CG' would both be [[1, 0], [0, 1]]. However, I want A, T, C, and G to be accounted for in all sequences so 'AT' should be [[1, 0, 0, 0], [0, 1, 0, 0]] and 'CG' should be [[0, 0, 1, 0], [0, 0, 0, 1]].

Topic one-hot-encoding encoding data-cleaning machine-learning

Category Data Science


You can use scikit-learn's OneHotEncoder like this:

from sklearn.preprocessing import OneHotEncoder

X = [['A', 'T'], ['C', 'G']]
enc = OneHotEncoder()
enc.fit_transform(X).toarray()

The result is

array([[1., 0., 0., 1.],
       [0., 1., 1., 0.]])

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.