Compare Books using book categories list NLP

Question

Compare Books using book categories list NLP

Eitan Rosati

2022年2月27日 21:30

I have a database of books. Each book has a list of categories that describe the genre/topics of the book (I use Python models).

The categories in the list most of the time are composed of 1 to 3 words.

Examples of a book category list:

['Children', 'Flour mills', 'Jealousy', 'Nannies', 'Child labor', 'Conduct of life'],
[Children's stories, 'Christian life'],
['Children', 'Brothers and sisters', 'Conduct of life', 'Cheerfulness', 'Christian life'],
['Fugitive slaves', 'African Americans', 'Slavery', 'Plantation life', 'Slaves', 'Christian life', 'Cruelty']

I want to create/use an algorithm to compare the books and find similarity between 2 books using NLP/machine learning models.

The categories are not well defined and tend to change. For example there could be a category 'story' and another called 'stories' since the text in the system doesn't use saved categories but an open text box. So far I tried 2 algorithms:

cossine similiarity with WordNet - split the category to get a bag of words and check if each word has a synonym in the other book lists.
Check the similarity using the nlp model of the spacy library (Python) - distance algorithm.

So far I used WordNet model from the nltk package and spacy. I had problems with those two algorithms because when the algorithm compares a category that contains 2 or 3 words the results wasn't accurate and each of them had specific problems.

Which algorithm/Python models that can handle strings containing 2 or 3 words can I use to compare the books?

B.t.w this is the first time I ask here. If you need more details about the database or what I did so far please tell me.

Topic spacy nltk nlp python machine-learning

Category Data Science

Brian Spiering · Accepted Answer · 2022年2月27日 21:30

Your problem could be framed as multi-label classification, each instance can have multiple labels. For a given book, predict which labels are likely.

In Python, there scikit-multilearn is designed for the multi-label classification problem.

Additionally, you may want to consolidate labels that are similar (e.g., 'story' and 'stories'). The consolidation can be done with find and replace.

Compare Books using book categories list NLP

About