How to generate a rule-based system based on binary data?

Question

How to generate a rule-based system based on binary data?

greenButMellow

2022年6月2日 22:38

I have a dataset where each row is a sample and each column is a binary variable. The meaning of $X_{i, j} = 1$ is that we've seen feature $j$ for sample $i$. $X_{i, j} = 0$ means that we haven't seen this feature but we might will. We have around $1000$ binary variables and around $200k$ samples.

The target variable, $y$ is categorical.

What I'd like to do is to find subsets of variables that precisely predict some $y_k$.

For example, we could find the following rule:

$$\{ v_1: 1, v_8: 1, v_{12}: 0 \} \mapsto y_2$$

In words, If $v_1$ and $v_8$ was seen but $v_{12}$ was not then predict $y_2$.

I think precision is important more than recall. That is, it's more important to me not to make a misclassification rather than have a high recall (per rule)

What I have tried:

Logistic regression: With this model I was able to rank the features (with clf.coef_ using scikit-learn) but it is still unclear which subset of rules to choose
Decision Tree: The idea was to train a DTree and then collect all the paths to the leaves. Each path can be interpreted as a rule. The data is highly imbalance and even though I used different configurations (including class_weight=imbalance) most of the rules included many not exist features and fewer exist features. Also, many of them suffered from low precision or a very low recall.

What do you think about my current approaches?

What would you do instead?

Thanks!

Topic decision-trees logistic-regression classification statistics machine-learning

Category Data Science

Erwan · Accepted Answer · 2022年6月2日 22:38

Decision trees should be able to give you good results, but you might need to increase the depth in order to reach very specific rules involving many variables. Each leaf of the tree represents a subset of the data that satisfies this path with the proportion of instances in the subset with each label. Thus the leaves with a high proportion/probability represent the most reliable rules, and by relying only these rules you obtain high precision.

However note that a deep tree can cause a risk of overfitting: if the training data contains patterns which don't appear in the test data, the performance on the test set might be lower.

Note: I'm not very familiar with association rules but it might be relevant.

How to generate a rule-based system based on binary data?

About