One Hot Encoding where all sequences don't have all values
Is there a way (other than manually creating dictionaries) to one hot encode sequences in which not all values can be present in a sequence? sklearn's OneHotEncoder
and numpy's to_categorical
only account for the values in the current sample so for example, encoding DNA sequences of 'AT' and 'CG' would both be [[1, 0], [0, 1]]. However, I want A, T, C, and G to be accounted for in all sequences so 'AT' should be [[1, 0, 0, 0], [0, 1, 0, 0]] and 'CG' should be [[0, 0, 1, 0], [0, 0, 0, 1]].
Topic one-hot-encoding encoding data-cleaning machine-learning
Category Data Science