State-of-the-art Python packages that can evaluate language similarity

I am trying to evaluate the likelihood of generating a specific sentence out of a large set of sentences. To do this, I start from a simple approach: training a custom n-gram language model and calculating the perplexity values for a list of sentences.

I found that the package KenLM (https://www.aclweb.org/anthology/W11-2123/) was often used to do this task. However, it's kind of old (published in 2011).

On the other hand, I noticed that the two most famous state-of-the-art NLP packages, BERT and GPT-2, are both about pre-trained models.

I wonder if there is any package newer than KenLM suitable for this kind of likelihood evaluation task.

Topic language-model nlp similarity

Category Data Science


It seems what you need is a language model. You should train it with your "large set of sentences" and then use it to compute the likelihood of any given sentence.

KenLM is a classic language model. It implements interpolated modified Kneser Ney Smoothing. The main publications describing it are this and this.

Modern neural language models may give you better performance. Depending on your requirements, you may try with the simpler AWD-LSTM, which is based on a regularized long short term memory, or the more complex GPT-2 model, based on the Transformer architecture. For Transformer-based models, I suggest using the HuggingFace Transformers python library, which makes it very easy to train and use such a type of models, and has a repository of pre-trained models.

BERT is quite different from a normal language model; it is a masked language model. It is not meant for the kind of likelihood estimation you are aiming at, but there are some proposals to use it that way too.


I suggest you use the Hugging Face implementation which has all the state of the art language models, and fine tune them on your dataset. They have easy to use APIs for finetuning which are same across all the LM models.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.