Classification using texts as features

Question

Classification using texts as features

sgduran91

2022年5月14日 17:04

I want to build a classification model to match customers and products. I have a description of each product, and a description of each customer, and the label : customer *i* buy/did not buy product *j*.

Each sample/row is a pair (customer, product), so Feature 1 is customer's description, Feature 2 is product's description, and the target variable y is: y = 1 : customer buys product, y = 0 otherwise. The goal is to predict for new arriving products whether each customer is going to buy them or not.

I want to use Tf-Idf Vectorizer. I don't in which specific step I should fit_transform the descriptions, and how to put together Feature 1 with Feature 2.

Should I concatenate the descriptions of each pair (customer, product) and fit_transform only once I have the concatenation?
Should I put together 2 columns using ColumnTransformer? If so, is the classifier going to fit correctly the obtained features?
Should I transform using a unique vocabulary?

I found here a reference of three possible ways of working with two columns, but I don't see which one fits for my case.

Ps. Until now, I only got to build a similarity pairwise coefficient (using this), but there is no classification, and I know using labelled data can help. In particular, similarity measure gives the same weight to any text coincidence, but some coincidences should be more important than others.

Topic text-classification tfidf scikit-learn nlp machine-learning

Category Data Science

Erwan · Accepted Answer · 2021年3月15日 16:59

One uses a frequency or TFIDF representation in the features when the target directly depends on specific words. For example for spam classification words like "cheap, free, viagra, exclusive..." are direct indicators of the target label.

In your case the target doesn't directly depend on specific words, it depends whether the same words appear in both the customer and product descriptions. This is an indirect relationship and most regular ML algorithms cannot really deal with that. So your design is unlikely to work in my opinion.

Until now, I only got to build a similarity pairwise coefficient (using this), but there is no classification, and I know using labelled data can help. In particular, similarity measure gives the same weight to any text coincidence, but some coincidences should be more important than others.

This makes more sense for your purpose: use only the similarity score as a feature and train a model to predict the label. Technically the model will only learn the optimal threshold to separate the labels, so you can just use linear regression for instance. You could improve this method by calculating different types of similarity measures and provide all of them as features.

Note: if you use TFIDF vectors for measuring similarity, it doesn't give the same weight to every word. However don't expect perfect result: a lot depends on the data itself, are you sure that the customer description gives useful indications about the products they're interested in? For example if a customer description contains the word "computer" it doesn't mean they're interested in every possible type of computer.

Classification using texts as features

About