How to generate a rule-based system based on binary data?
I have a dataset where each row is a sample and each column is a binary variable. The meaning of $X_{i, j} = 1$ is that we've seen feature $j$ for sample $i$. $X_{i, j} = 0$ means that we haven't seen this feature but we might will. We have around $1000$ binary variables and around $200k$ samples.
The target variable, $y$ is categorical.
What I'd like to do is to find subsets of variables that precisely predict some $y_k$.
For example, we could find the following rule:
$$\{ v_1: 1, v_8: 1, v_{12}: 0 \} \mapsto y_2$$
In words, If $v_1$ and $v_8$ was seen but $v_{12}$ was not then predict $y_2$.
I think precision is important more than recall. That is, it's more important to me not to make a misclassification rather than have a high recall (per rule)
What I have tried:
- Logistic regression: With this model I was able to rank the features (with
clf.coef_
using scikit-learn) but it is still unclear which subset of rules to choose - Decision Tree: The idea was to train a DTree and then collect all the paths to the leaves. Each path can be interpreted as a rule. The data is highly imbalance and even though I used different configurations (including
class_weight=imbalance
) most of the rules included many not exist features and fewer exist features. Also, many of them suffered from low precision or a very low recall.
What do you think about my current approaches?
What would you do instead?
Thanks!