Compare Books using book categories list NLP

I have a database of books. Each book has a list of categories that describe the genre/topics of the book (I use Python models).

The categories in the list most of the time are composed of 1 to 3 words.

Examples of a book category list:

['Children', 'Flour mills', 'Jealousy', 'Nannies', 'Child labor', 'Conduct of life'],
[Children's stories, 'Christian life'],
['Children', 'Brothers and sisters', 'Conduct of life', 'Cheerfulness', 'Christian life'],
['Fugitive slaves', 'African Americans', 'Slavery', 'Plantation life', 'Slaves', 'Christian life', 'Cruelty']

I want to create/use an algorithm to compare the books and find similarity between 2 books using NLP/machine learning models.

The categories are not well defined and tend to change. For example there could be a category 'story' and another called 'stories' since the text in the system doesn't use saved categories but an open text box. So far I tried 2 algorithms:

  • cossine similiarity with WordNet - split the category to get a bag of words and check if each word has a synonym in the other book lists.
  • Check the similarity using the nlp model of the spacy library (Python) - distance algorithm.

So far I used WordNet model from the nltk package and spacy. I had problems with those two algorithms because when the algorithm compares a category that contains 2 or 3 words the results wasn't accurate and each of them had specific problems.

Which algorithm/Python models that can handle strings containing 2 or 3 words can I use to compare the books?

B.t.w this is the first time I ask here. If you need more details about the database or what I did so far please tell me.

Topic spacy nltk nlp python machine-learning

Category Data Science


Your problem could be framed as multi-label classification, each instance can have multiple labels. For a given book, predict which labels are likely.

In Python, there scikit-multilearn is designed for the multi-label classification problem.

Additionally, you may want to consolidate labels that are similar (e.g., 'story' and 'stories'). The consolidation can be done with find and replace.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.