From development environment to production

I have been working on a project as part of my master degree in participation with a firm. I developed a predictive model in the past few months that is essentially a document classification model. The biggest limitation of the research and model is the lack of data available for training. I have a small data set of 300 documents, where as the features are in excess of 15000 terms (before feature selection).

  1. How do we identify or estimate the number of data-points (documents) necessary to gain a level of confidence that will allow us to move forward with the testing stage into a production environment ? So how do I estimate the necessary dataset size required for the performance metrics to be viable enough to ensure it is a solid model?

  2. Are there standards or metrics in the industry for when a model is sufficiently tested and should qualify for production?

I cannot find literature about this topic.

Thank you

Topic model-selection software-development predictive-modeling machine-learning

Category Data Science


There isn't a well established way of estimating the number of data points that you'll need. It's much more an art than a science. As you gain more experience, you'll learn some common sense lessons (in hindsight) along the way. For instance, you should never have more parameters than data points; if you're building random forests, you should not have more trees than data points. If you're doing deep learning, you should never have more neurons than data points (these are extreme examples). As a rule of thumb, try to avoid having 10 times more features than you have data points. 300 data points is very small, and you so you should probably limit yourself to linear models.

As Hobbes mentioned, cross validation and/or holdout sets are the proper way to judge the readiness of your model irrespective of the datapoint-to-feature ratio. If the error on the training set is 10% better than the error on the test/validation set, you're probably overfitting and your model is not fit for production. The 10% threshold is very much a rule of thumb.

Also worth mentioning, once you feel comfortable that your model is fit for production, you should never deploy it and then just walk away. If your model has value, you should do ongoing post production monitoring. In the real world, your model's performance will inevitably degrade over time (regardless of how strong/accurate it was at the time of the model build) as the environment that your model was trained for evolves. This is certainly true for finance and insurance industries.

Production monitoring can also be used to pull the plug if your model performs far worse than you anticipated. If you have a very low opinion of the potential stability of your model, deploy it silently (don't allow the predictions to go to downstream systems) and measure how your model would have done had you hypothetically deployed it.

A last piece of advice: when you do predictions on new documents you'll nearly always come across words that you have not accounted for in your training set. You should find a way to build this awareness into your model. For example, you could choose to use only keywords as predictors or you could create indicator variables for new words. Either way, you should not expect the vocabulary size to remain the same.


I have run into this problem before, and while I'm still relatively new to the field, I have some thoughts.

In answer to your first question, if cross-validating your model yields consistent results, you possibly have enough data. If new data is considerably throwing your model off, you likely do not have enough data. There are a number of metrics that can measure the robustness of your model, I have mostly used AUC.

Here is an interesting post about more data vs. better algorithms: http://www.kdnuggets.com/2015/06/machine-learning-more-data-better-algorithms.html

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.