Reverse scaling Synthetic KDE data
For Python 3.9, sklearn version: 0.24.2 and numpy version: 1.20.3, I am using a Kernel Density Estimation (KDE) generative model. The goal is to generate new data using a given input data. The steps to achieve this involve:
- Scale input data to be in range [-1, 1] using MinMaxScaler
- Train KDE model on scaled input data
- Use trained KDE model to generate new sample/synthetic (scaled) data
- Use trained scaler from step - 1 to get data back in original scale
The accompanying code for this is:
# Generate random data-
x = np.random.normal(loc = -2.3, scale = 5.7, size = (2000, 1))
x.shape
# (2000, 1)
x.min(), x.max()
# (-22.290805843994956, 20.51752418364843)
# Visualize current distribution-
n, bins, patches = plt.hist(x.flatten(), bins = int(np.ceil(np.sqrt(x.size))))
plt.show()
# Initialize and train a Min-Max scaler-
mm_scaler = MinMaxScaler(feature_range = (-1, 1))
x_scaled = mm_scaler.fit_transform(x)
# Sanity check-
x_scaled.min(), x_scaled.max()
# (-1.0, 1.0)
# Define a range of bandwidth values to hyper-parameter tune-
bandwidth = np.arange(0.01, 3, .01)
bandwidth.min(), bandwidth.max()
# (0.01, 2.9899999999999998)
# Define a KDE instance-
kde_model = KernelDensity(kernel = 'gaussian')
# Define GridSearchCV object-
grid = GridSearchCV(
estimator = kde_model,
param_grid = {'bandwidth': bandwidth}
)
# Perform hyper-parameter tuning with GridSearchCV on scaled training data-
grid.fit(x_scaled)
# The best model can be retrieved by using the 'best_estimator_' value of the GridSearchCV object-
grid.best_estimator_
# KernelDensity(bandwidth=0.09)
# Return parameter value that maximizes the log-likelihood of data-
grid.best_params_
# {'bandwidth': 0.09}
grid.best_score_
# -44.610465074723415
# Get a 'best' model from above-
kde_best = grid.best_estimator_
kde_best.bandwidth, kde_best.kernel
# (0.09, 'gaussian')
# Sample/generate new samples from KDE model-
x_sampled = kde_best.sample(n_samples = 500)
# Reverse scaling to get back original data-
x_sampled_orig = mm_scaler.inverse_transform(x_sampled)
x_sampled_orig.shape
# (500, 1)
x.min(), x.max()
# (-22.290805843994956, 20.51752418364843)
x_sampled_orig.min(), x_sampled_orig.max()
# (-20.845276754891053, 15.763606405371979)
The scaling of 'x' and 'x_sampled_orig' is different due to two reasons:
- The 'mm_scaler' was trained on 'x' which had a different distribution as compared to 'x_sampled_orig'
- KDE trained model will generate stochastic data which will generate new data samples having different min and max values
If there are some other reasons, please let me know.
My question is: how can I make 'x_sampled_orig' synthetic samples have as close min and max range as compared to 'x' original data?
Topic density-estimation generative-models python-3.x
Category Data Science