Comparing distributions Python

Question

Comparing distributions Python

Arun

2022年2月9日 18:47

I have a larger dataset (random variable) 'x' containing values approximating a Gaussian distribution. From 'x', a much smaller random variable 'y' is sampled without replacement. I want to compare their distributions using histograms. The code in Python 3.9 is as follows:

# Create a gaussian distribution-
x = np.random.normal(loc = 0, scale = 2.0, size = 20000000)

# Sample from 'x' without replacement-
y = np.random.choice(a = x, size = 400000, replace = False)

x.size, y.size
# (20000000, 400000)


# Compare the distributions using 'histplot()' in seaborn with different bin sizes for x  y-
sns.histplot(data = x, bins = int(np.ceil(np.sqrt(x.size))), label = 'x')
sns.histplot(data = y, bins = int(np.ceil(np.sqrt(y.size))), label = 'y')
plt.xlabel(values)
plt.legend(loc = 'best')
plt.title(Comparing Distributions)
plt.show()

This produces the output:

# Compare the distributions using 'histplot()' in seaborn with same bin sizes for x  y-
sns.histplot(data = x, bins = int(np.ceil(np.sqrt(x.size))), label = 'x')
sns.histplot(data = y, bins = int(np.ceil(np.sqrt(x.size))), label = 'y')
plt.xlabel(values)
plt.legend(loc = 'best')
plt.title(Comparing Distributions)
plt.show()

This produces the output:

In my opinion, the second plot is wrong because each histogram should be computed and visualized with it's own bin size for the given data.

To further analyze the two distributions using a histogram-

n_x, bins_x, _ = plt.hist(x, bins = int(np.ceil(np.sqrt(x.size))))
n_y, bins_y, _ = plt.hist(y, bins = int(np.ceil(np.sqrt(y.size))))

# number of values in all bins-
n_x.size, n_y.size
# (4473, 633)

# bin size-
bins_x.size, bins_y.size
# (4474, 634)

# bin-width-
bw_x = bins_x[1] - bins_x[0]
bw_y = bins_y[1] - bins_y[0]

bw_x, bw_y
# (0.004882625722377298, 0.02781399915135907)

Since 'y' has a much smaller size than 'x', consequently, it's bin-width (0.0278) is much larger than 'x' bin-width (0.0049). Hence, this produces a different histogram and visualization. Since 'y' is sampled from 'x', using Kolmogorov Smirnov two sample test doesn't make sense.

What's the appropriate way to compare these two distributions?

Topic distribution gaussian python-3.x dataset

Category Data Science

Warlax56 · Accepted Answer · 2022年2月9日 18:47

If you want to compare their shape you need to do two things:

account for size of the set
account for number of bins

the more data you have, the higher your bins will be. The more bins you have, the shorter the bins will be (because you're dividing the same quantity of data into more bins)

this is what I came up with

#importing
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Create a gaussian distribution-
x = np.random.normal(loc = 0, scale = 2.0, size = 200000)

# Sample from 'x' without replacement-
y = np.random.choice(a = x, size = 40000, replace = False)

#binning
n_x, bins_x = np.histogram(x, bins = int(np.ceil(np.sqrt(x.size))))
n_y, bins_y = np.histogram(y, bins = int(np.ceil(np.sqrt(y.size))))

#normalizings
n_x=n_x/len(x)/len(bins_y)
n_y=n_y/len(y)/len(bins_x)

#plotting
plt.plot(bins_x[:-1],n_x)
plt.plot(bins_y[:-1],n_y)
plt.show()

which renders this

I'm not sure if the y value, in this case, is of any practical utility.

Edit:

It strikes me that scaling up the subset to match the original set might make more sense under more use cases (for instance, trying to efficiently plot a subset of data). This would do the trick:

#Scaling the subsample
n_y=n_y/len(y)/len(bins_x)*len(x)*len(bins_y)

then you wouldn't have to scale n_x at all

Joe Richardson · Accepted Answer · 2022年2月8日 23:40

It is unclear what you mean by compare here...typically the question of comparison of distributions takes the form of "are these two distributions drawn from the same underlying distribution?" In this case, y is definitely drawn from x.

There exist a few distance metrics that are useful to describe dissimilarity (or similarity) between two histograms like earth movers, or Euclidean. The quadratic form distance could also be used here, but requires a bit more linear algebra. https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/rubner-jcviu-00.pdf

Along that same similarity/dissimilarity line of thinking - just computing the intersection or the union of the histograms could be what you are looking for?

Comparing distributions Python

Edit:

About