Compare Books using book categories list NLP
I have a database of books. Each book has a list of categories that describe the genre/topics of the book (I use Python models).
The categories in the list most of the time are composed of 1 to 3 words.
Examples of a book category list:
['Children', 'Flour mills', 'Jealousy', 'Nannies', 'Child labor', 'Conduct of life'],
[Children's stories, 'Christian life'],
['Children', 'Brothers and sisters', 'Conduct of life', 'Cheerfulness', 'Christian life'],
['Fugitive slaves', 'African Americans', 'Slavery', 'Plantation life', 'Slaves', 'Christian life', 'Cruelty']
I want to create/use an algorithm to compare the books and find similarity between 2 books using NLP/machine learning models.
The categories are not well defined and tend to change. For example there could be a category 'story'
and another called 'stories'
since the text in the system doesn't use saved categories but an open text box.
So far I tried 2 algorithms:
- cossine similiarity with WordNet - split the category to get a bag of words and check if each word has a synonym in the other book lists.
- Check the similarity using the
nlp
model of the spacy library (Python) - distance algorithm.
So far I used WordNet model from the nltk
package and spacy
.
I had problems with those two algorithms because when the algorithm compares a category that contains 2 or 3 words the results wasn't accurate and each of them had specific problems.
Which algorithm/Python models that can handle strings containing 2 or 3 words can I use to compare the books?
B.t.w this is the first time I ask here. If you need more details about the database or what I did so far please tell me.
Topic spacy nltk nlp python machine-learning
Category Data Science