Encoding very large dataset to one-hot encoding matrix

Question

Encoding very large dataset to one-hot encoding matrix

Avv

2022年4月21日 16:26

I have a dataset of text corpus where the unique characters in the text are around 400. The maximum row length is 3000. We have 20000 rows, so we would have like $2000\times3000\times400$ one-hot encoding matrix, which lead to memory error as the size needed jumped over 900 GB of RAM. There are dimensionality reduction techniques such as PCA and others, but other than that what would you recommend in my case please to overcome this issue? The text is not natural language but source code for programs, so I am not sure if word2vec and others are suitable here to get word embeddings since again this is not natural language.

Topic one-hot-encoding dimensionality-reduction

Category Data Science

Brian Spiering · Accepted Answer · 2022年4月21日 16:26

Any sequence of discrete symbols can be embedded using word2vec or related algorithms.

After the text is embedded, each character will be represented as a fix-length dense vector. You can pick the size of the vector to manage the size of memory used.

Encoding very large dataset to one-hot encoding matrix

About