Which ML method for multiclass (non-binary) text classification should I choose (from SparkML)?

Question

Which ML method for multiclass (non-binary) text classification should I choose (from SparkML)?

user83701

2022年3月6日 09:05

I am working on a quite big dataset that will be processed on the cluster, so this is why I am using PySpark for that purpose.

The presentable records of this dataset have a such structure:

+----------+------------+--------------------+--------------------+--------------------+
|         0|  07/29/2013|       Consumer Loan|        Vehicle loan|Managing the loan...|
|         1|  07/29/2013|Bank account or s...|    Checking account|Using a debit or ...|
|         2|  07/29/2013|Bank account or s...|    Checking account|Account opening, ...

After some preprocessing/data cleansing operations I would like to create and then obviously train a model that will classify issues (Issue) into some categories, that are still unknown. I have read some articles about TF-IDF, but not sure if this could be suitable for this case.

Topic pyspark classification machine-learning

Category Data Science

Uday · Accepted Answer · 2019年10月14日 03:16

If you want to categorise your text using machine learning techniques, you have to get fixed length features from text to train any ML model. You can do That using bag of words, tf-idf, averaging word vectors. If you are using any deep learning based models, you can use LSTM with word vectors or CNN’s.

Which ML method for multiclass (non-binary) text classification should I choose (from SparkML)?

About