Encoding very large dataset to one-hot encoding matrix
I have a dataset of text corpus where the unique characters in the text are around 400. The maximum row length is 3000. We have 20000 rows, so we would have like $2000\times3000\times400$ one-hot encoding matrix, which lead to memory error as the size needed jumped over 900 GB of RAM. There are dimensionality reduction techniques such as PCA and others, but other than that what would you recommend in my case please to overcome this issue? The text is not natural language but source code for programs, so I am not sure if word2vec and others are suitable here to get word embeddings since again this is not natural language.
Topic one-hot-encoding dimensionality-reduction
Category Data Science