Trying to compress text with NLP

For a university project, I need to send text in Spanish via SMS. As these have a cost, I am trying to compress this text in an inefficient way. This consists of first generating a permutation of codes formed by two characters of many alphabets (fines, Cyrillic, etc.) to which I assign a word that has more than two characters (to say that it is being compressed). Then I take each word in a sentence and assign it its associated code. In this way, in tests, I get a compression rate of at least 40% (passing 60% in the best case), at the cost of storing approximately 90,000 base words of the Spanish dictionary with their respective code, which although it weighs less than 10MB, can be done better.

My second attempt has been to look up the most common sub-sentences in Spanish text and assign a code to them. For this I look at the dependencies of the words in the text with stanfordnlp and the formed sentences I assign a code to them. This way I manage to increase the compression rate a bit, but I still think I could do better.

Lately, I've been trying other NLP techniques. My approach is to take a sentence and apply SnowballStemmer from nltk.stem.snowball (or lemmatizer from stanfordnlp ). This is with the idea that by getting the base of the words, I will be able to encode many words with a single code and I will have to store fewer codes and text, being able to add typical names for example, and thus achieve even more compression of the text. However, when I try to go back from a compound sentence with the stemmer output to the original (or similar) my problem arises because I have not found information on how to do it so that the coherence of the output is acceptable. For example, I have the following:

sentence = Hola como estas? queria preguntarte si sabes algo acerca de nlp

Applying snowballStemmer:
Hola ---- hol
como ---- com
estas? ---- estas?
queria ---- queri
preguntarte ---- preguntart
si ---- si
sabes ---- sab
algo ---- algo
acerca ---- acerc
de ---- de
nlp ---- nlp

To a sentence:
outputStemmer = 'hol com estas? queri preguntart si sab algo acerc de nlp'

I have tried to take this output somehow to regression or least-squares problem where the idea would be to give weight to the prepositions (and other structures that give semantic coherence to a sentence) and then to suppose that there will necessarily be one of these between two words of the output and to go varying in such a way that the error (that I have not yet thought how to measure) is minimized and thus it approaches more to a sentence with some sense. The problem of this idea is that there are many details of the algorithm that I still need to think more about and that with the stemmer I have only the base of the word but not the variations that I can reach from this base (if I stored them maybe I would increase my compression percentage but also the text to store).

I would be very grateful if you could help me and guide me because maybe the problem is much easier to solve with some NLP techniques (which I don't know much) and ML.

Topic stanford-nlp nltk regression nlp machine-learning

Category Data Science


It sounds like you could just apply BPE (Byte-Pair Encoding) to arrive at a reasonably optimal minimum vocabulary, then encode whatever subsequences you are left with, by applying your approach (or maybe even just some simple hashing algorithm? I suspect that compressing into integers will always be more efficient than compressing into alphabetical characters).

For BPE encoding, I recommend checking the Subword Tokenization section of this pretty great blog post:

Subword Tokenization splits the piece of text into subwords (or n-gram characters). For example, words like lower can be segmented as low-er, smartest as smart-est, and so on.

Hope this helps!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.