Text to Text classification

Question

Text to Text classification

ahc

2022年4月21日 09:43

I am new comer to the field of data science and have been struggling with a simple classification problem. It seems to be generic enough and I have a suspicion that there must be a better way to frame/model this problem. I would appreciate any help.

Background

In our system, we have millions of tickets (similar to JIRA tickets) where each ticket has attributes like title, description, tags etc.
A user can create a dashboard and add any number of these tickets to their dashboards. Each dashboard has a title and description.
Currently there are ~100k tickets in ~3k dashboards.

Problem Statement

Given a new ticket, I want to suggest which dashboards can it be added to.
Given a new dashboard, I want to suggest which tickets can be added to it.

My Attempts

In my first attempt, I tried to use a Multi-Class Text Classification with Doc2Vec Logistic Regression.
- Basically, I created vectors from ticket titles (using Doc2Vec) and then ran a logistic regression with these vectors as input and dashboard titles as labels.
- However following this approach I only got to 2-3% accuracy.
- I think that is because logistic regression with ~3k labels is not a good choice.
In my second attempt, I created 2 vectors (using Doc2Vec) for ticket title and dashboard title and trained a neural network with ticket title vector as input and dashboard title vector as output.
- As before, I only achieved 2% accuracy with this approach.

Question

I would like to know from experts, if I am on the right track here with these approaches? If so, should I continue tweaking my model to improve the accuracy?
Or am I on a completely incorrect track? If so, are there better approaches to model such a classification problem? I am a bit lost and would appreciate any pointers.

Topic multilabel-classification multiclass-classification machine-learning

Category Data Science

Kasra Manshaei · Accepted Answer · 2022年4月21日 09:43

I think that is because logistic regression with ~3k labels is not a good choice

you are right but I rephrase it a bit better:

In general, Classification with ~3k labels is not a good choice!

You basically have a Search/Recommendation problem. Given your input, you find the best fitting ticket/dashboard and assign it. It is a very interesting ML project actually!

I give a confident starter. If it did not work, please come back with results and I update the answer:

If you want to go Unsupervised

Query-Document Matching

Use a simple TF-IDF to vectorise your text
Apply a dimensionality reduction to reduce high-dimensionality sparse vectors to low-dimensionality dense vectors. If you use matrix factorisations for this, you are basically doing famous classic LSA
In that vector space, you find the closest label to your query and assign it to the query

Topic Modeling

Apply a simple LDA to model topics for the corpus
Given a query, find the best matching topic of that query and assign the query to that topic (cluster)
Please note that LDA finds intrinsic topics. So if your labels are different than topics that it finds, you need to rely on labels and ignore this solution

A little bit more Supervised##

Create a dataset from your corpus (or maybe you already have it) in which sentence pairs (titles, descriptions, etc.) which belong to same topic/label have label $1$, and sentence pairs which belong to different topics/classes/labels have label $-1$ and sentence pairs with neutral relation have the label $0$. I put an example as PS at the end.
Feed this data to S-Bert to fine-tune the pre-trained model
Read this, learn it and use it for finding most similar ticket/dashboard to the query

PS: How data for S-Bert looks like (I just made up some dummy examples! hope you get the idea)

sentence1: He is a man
sentence2: He is male
label: 1

sentence1: programming is hard
sentence2: Maradona was a magician
label: -1

sentence1: don't know what to write here
sentence2: never mind, I think you got what I mean
label: 0
.
.
.

Text to Text classification

I think that is because logistic regression with ~3k labels is not a good choice

If you want to go Unsupervised

A little bit more Supervised##

About