Find optimal feature combinations and ordering for a multi-class clasification problem
We have a multi-class classification problem where the training data looks as follows:
name | A | B | C | brand |
---|---|---|---|---|
Snickers Ltd | company | huge | sales | Snickers |
Acme Intl | office stationary | commercial | Acme | |
Davidoff cigars | big | Davidoff | ||
Max Car Company | car repair | small | garage | MaxAuto |
As can be seen we have one free text feature column(name) and several categorical feature columns that may be empty. Brand has to be predicted. The categorical features have a large (1000+) number of possible values. The above is a sample and we have several more categorical features.
Our domain experts inform us that brand could be predicted based on various combinations: e.g.
(name, A) -- brand or
(name, C) -- brand or
(A,B,C) -- brand
We have a well known list of about 2500 brands that we are interested in. Our training data is comparitively small with only 200k records. So far we have had poor results with Random Forest approach and are open to rule-based classification as well.
Is it possible to come algorithmically determine the best sequence of rules to predict the target?
Topic multiclass-classification association-rules random-forest
Category Data Science