sklearn - StandardScaler - Use in Production

I transformed my input data using StandardScaler as given here:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Code looks like this:

X=df_encoded.drop(columns=['HeartDisease'],axis=1)
y=df_encoded['HeartDisease']
col=X.columns
sc = StandardScaler()
x_standardized_array = sc.fit_transform(X)
x_df = pd.DataFrame(x_standardized_array, columns = col)

Now, my problem is that I need to deploy my final solution to a product service... actually providing the algorithm solution with a frontend. The StandardScaler is a problem because the product cannot be using the old data set to fit to the old data and then process the new data set. I can get access to the mean and scale arrays, however, there is no way that I see to initialize a StandardScaler with an existing mean and scale array.

How can I deploy a StandardScaler in production by initializing it with previous settings when the original data is not available or it is not practical to use the original data ?

Topic scikit-learn

Category Data Science


What about serializing your solution:

>>> from joblib import dump, load  # For serialization. Pre-installed by sklearn.
>>> from sklearn.preprocessing import StandardScaler
>>> import numpy as np
>>> X = np.random.uniform(size=(100, 5))  # Your data prior to deployment.
>>> standard_scaler = StandardScaler().fit(X)
>>> dump(standard_scaler, 'my-standard-scaler.pkl')  # Save the solution.
>>> # Deployment...
>>> same_standard_scaler = load('my-standard-scaler.pkl')  # Load the original solution.
>>> # Sanity checks
>>> np.array_equal(standard_scaler.mean_, same_standard_scaler.mean_)
True
>>> np.array_equal(standard_scaler.scale_, same_standard_scaler.scale_)
True

There are 2 scenarios:

  1. Your training data have entirely different distribution vs. production.

In this case, be cautious - you are having a sampling bias. This is bad because your model learns from the training data, and would not be able to cope with new data. In this case, it is best to rethink your problem and data collection process.

  1. You expect data distribution in production to divert gradually from training data.

This is a common issue called data drift. One solution is to monitor the change in distribution, and re-train the model with new incoming data where necessary.

Finally, if for whatever reason you really want to hard-code a mean and std in, you may use the set_params method call, or do the subtraction and division manually.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.