Data Set and guidance for Occupations/ Roles classification problem

I am working on a project where I need to find similar roles -- for example, Software Engineer, Soft. Engineer , Software Eng ( all should be marked similar)

Currently, I have tried using the Standard Occupational Classification Dataset and tried using LSA, Leveinstein and unsupervised FastText with Word Movers Distances. The last option works but isn't great.

I am wondering if there are more comprehensive data sets or ways available to solve this problem?? Any lead would be helpful!

Topic fasttext word2vec dataset nlp machine-learning

Category Data Science


You can calculate the text similarity using Transformers. With transformers, we can get better accuracies. Try the following code:

pip install sentence-transformers==1.2.1

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-uncased')

sen = [
"Software Engineer", 
"Soft. Engineer" , 
"Software Eng",
"Senior Software Engineer",
]

sen_embeddings = model.encode(sen)

from sklearn.metrics.pairwise import cosine_similarity
#let's calculate cosine similarity for sentence 0:
cosine_similarity(
    [sen_embeddings[0]],
    sen_embeddings[1:]
)

If the similarity score is greater than 0.6 ( or 0.7), you can assume the texts to be similar.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.