Reverse scaling Synthetic KDE data

Question

Reverse scaling Synthetic KDE data

Arun

2022年3月7日 10:27

For Python 3.9, sklearn version: 0.24.2 and numpy version: 1.20.3, I am using a Kernel Density Estimation (KDE) generative model. The goal is to generate new data using a given input data. The steps to achieve this involve:

Scale input data to be in range [-1, 1] using MinMaxScaler
Train KDE model on scaled input data
Use trained KDE model to generate new sample/synthetic (scaled) data
Use trained scaler from step - 1 to get data back in original scale

The accompanying code for this is:

# Generate random data-
x = np.random.normal(loc = -2.3, scale = 5.7, size = (2000, 1))

x.shape
# (2000, 1)

x.min(), x.max()
# (-22.290805843994956, 20.51752418364843)


# Visualize current distribution-
n, bins, patches = plt.hist(x.flatten(), bins = int(np.ceil(np.sqrt(x.size))))
plt.show()

# Initialize and train a Min-Max scaler-
mm_scaler = MinMaxScaler(feature_range = (-1, 1))
x_scaled = mm_scaler.fit_transform(x)

# Sanity check-
x_scaled.min(), x_scaled.max()
# (-1.0, 1.0)

# Define a range of bandwidth values to hyper-parameter tune-
bandwidth = np.arange(0.01, 3, .01)

bandwidth.min(), bandwidth.max()
# (0.01, 2.9899999999999998)

# Define a KDE instance-
kde_model = KernelDensity(kernel = 'gaussian')

# Define GridSearchCV object-
grid = GridSearchCV(
    estimator = kde_model,
    param_grid = {'bandwidth': bandwidth}
)

# Perform hyper-parameter tuning with GridSearchCV on scaled training data-
grid.fit(x_scaled)

# The best model can be retrieved by using the 'best_estimator_' value of the GridSearchCV object-
grid.best_estimator_
# KernelDensity(bandwidth=0.09)

# Return parameter value that maximizes the log-likelihood of data-
grid.best_params_
# {'bandwidth': 0.09}

grid.best_score_
# -44.610465074723415

# Get a 'best' model from above-
kde_best = grid.best_estimator_

kde_best.bandwidth, kde_best.kernel
# (0.09, 'gaussian')


# Sample/generate new samples from KDE model-
x_sampled = kde_best.sample(n_samples = 500)

# Reverse scaling to get back original data-
x_sampled_orig = mm_scaler.inverse_transform(x_sampled)

x_sampled_orig.shape
# (500, 1)

x.min(), x.max()
# (-22.290805843994956, 20.51752418364843)

x_sampled_orig.min(), x_sampled_orig.max()
# (-20.845276754891053, 15.763606405371979)

The scaling of 'x' and 'x_sampled_orig' is different due to two reasons:

The 'mm_scaler' was trained on 'x' which had a different distribution as compared to 'x_sampled_orig'
KDE trained model will generate stochastic data which will generate new data samples having different min and max values

If there are some other reasons, please let me know.

My question is: how can I make 'x_sampled_orig' synthetic samples have as close min and max range as compared to 'x' original data?

Topic density-estimation generative-models python-3.x

Category Data Science

Reverse scaling Synthetic KDE data

About