Making predictions with limited user generated data

We've trained a ML model and deployed it to production. The trained ML model uses about 50-60 features. A user inputs set of information on our platform which is nowhere close to all the features that the model is trained on.

How do we make a prediction with ML algorithm that's trained on far greater number of features than your test point?

A credit scoring example. Model is trained on 1000s of users' credit history, demographics, location, income, expenses, financials etc. Based on this trained model, we'd like to predict the score of a new user on our platform. We can collect some basic info but cannot obtain all the data.

Are there ways to make predictions when your test data point has limited information compared to your trained model? It's also unrealistic to make assumptions about the test data as you simply don't have enough information. What are some other work arounds?

Topic real-ml-usecase machine-learning

Category Data Science


Well in this case the model is trained on a different scenario than the one in inference time, but in this case, I would begin by trying two approaches:

  • retraining the models including these kind of samples (clients from which you have unknown values as Nones) so the training dataset will become more sparse, but more realistic for the use case you need when making predictions
  • try to impute the unknown columns which you think are the most important ones (for attributes where it is feasible) via clients clustering based on your training data and applied to those inference samples (source) which could help you identify the client type; you can implement something similar with this sklearn KNNimputer, see also these considerations

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.