Classification using texts as features
I want to build a classification model to match customers and products. I have a description of each product, and a description of each customer, and the label : customer *i* buy/did not buy product *j*
.
Each sample/row is a pair (customer, product)
, so Feature 1 is customer's description, Feature 2 is product's description, and the target variable y is: y = 1 : customer buys product
, y = 0 otherwise
. The goal is to predict for new arriving products whether each customer is going to buy them or not.
I want to use Tf-Idf Vectorizer. I don't in which specific step I should fit_transform
the descriptions, and how to put together Feature 1 with Feature 2.
Should I concatenate the descriptions of each pair
(customer, product)
andfit_transform
only once I have the concatenation?Should I put together 2 columns using
ColumnTransformer
? If so, is the classifier going to fit correctly the obtained features?Should I transform using a unique vocabulary?
I found here a reference of three possible ways of working with two columns, but I don't see which one fits for my case.
Ps. Until now, I only got to build a similarity pairwise coefficient (using this), but there is no classification, and I know using labelled data can help. In particular, similarity measure gives the same weight to any text coincidence, but some coincidences should be more important than others.
Topic text-classification tfidf scikit-learn nlp machine-learning
Category Data Science