How do I combine predictions from classifiers for two different problem?

I am working on a classification problem for predicting whether the shipment is going to be late or not.

I would say the classifier is mediocre at predicting the positive class at the moment. But the ambition is to improve it.

However, after doing some analysis, I have found out that there is an important component (Customs) that has appeared as the cause of shipment delay in majority of FN.

Currently, I don't have a feature that would directly be associated to customs and be used in the model. Moreover, I think because of the product we are shipping, custom process may vary.

My original problem was at a shipment level, because of which I had to exclude the products in the shipment. But, now I want to include the products. It is a many to many relation - Shipment can have multiple products and Vice a versa.

Following is my thought:

To have a separate predictor in addition to the original one that would predict if the product in the shipment is going to be late/not based on planned days for customs.

This is where I am struggling, if this is a right approach how do I consolidate the predictions for both the models to come up with a single prediction as Late or No Late?

In addition to this I need to understand if there is another way to tackle this?

Topic ensemble feature-engineering feature-construction class-imbalance classification

Category Data Science

My intuition would be to try to integrate the information about the products directly in the original model. Typically the possible products in a shipment can be represented as boolean features (one hot encoding), but this part might need some feature engineering if there are too many different products:

  • simple option: only a small set of features representing types of products (I'm assuming that it's not the specific product which causes custom delays, it's the type of product)
  • advanced option: feature selection/extraction to reduce the number of features

Generally a joint model (a single model which deals with all the information at once) tends to perform better, in particular because in the other option errors in the first model propagate to the second one. Also the two models option doesn't allow the second model to leverage any specific feature from the first one.

Note that this is just my intuition, I could be wrong.

Side note: probably this is already taken into account but I guess that the value of the shipment is also an important factor for customs delays.


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.