Multi-label classification with nested features

I need to perform a multi-label classification. I have three features and they are nested. I am unsure how to combine this or what kind of classification algorithm would be best. Some multi level neural network as shown here seems good, but the nested features don't seem to be taken into account there.

I present the nested features (X) and labels (Y) in the two datasets below: one subject ID can have one or more features and one or more classes. Features and classes can be 'occupied' by one or more subject.

Note: I have about 100k subjects, 1k features (at the third level) and 200 classes.

data_features
       subject_id   feature1   feature2   feature3
               1          a         aa          aaa
               2          a         aa          aab
               3          a         ab          aba
               1          a         ab          abb
               2          b         ba          baa
               3          b         ba          bac
               1          b         ba          bad
               2          b         ba          bad
               3          c         ca          caa
               4          c         ca          caa
               5          c         cb          cba
               6          c         cb          cbb
  


data_labels
       subject_id   label1   label2   label3   label4
               1        0        1        0        0
               2        0        1        1        1
               3        0        1        1        0
               4        1        1        0        1
               5        1        0        0        0
               6        0        1        1        1
               7        0        0        0        1
               8        1        1        1        1
               9        0        0        1        1
              10        1        0        1        0
              11        0        1        0        1
              12        1        0        0        1

I am quite unsure what algorithm would combine those the best? (I am skilled in R and SAS and decent in Python, but will learn any other language that would be needed)

Topic multilabel-classification neural-network

Category Data Science


Based on the comments, looks like the nested features need not be nested and can be broken down into individual features.

E.g. if:

feature1 = brand
feature2 = brand + model
feature3 = brand + model + version

Then we should use the standard ML approach with features as:

feature1 = brand
feature2 = model
feature3 = version

The behavior of association among the features will still be captured by the ML model.


I think it's always a good idea to start simple, so I'd simply suggest to try with all the features, including the different levels of "nesting" so around 2k apparently. Given that the dataset is large, I don't see any obstacle to trying this way. For the same reason I would start with a very simple model like Decision Trees or SVM, which have the additional advantage that they're fast to train. This could be a first step which provides you with a decent baseline, at least.

If the number of features turns is an issue for a more advanced option, I think this is a good case for using feature extraction (for example PCA): this would reduce the number of features and also merge features which represent the same information.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.