Finding similarity between two datasets

I have two datasets. One is actual percentage of white population in counties in an american state and the other is the simulated percentage of white population in counties in an american state.

Bits about my simulation:

It is a random simulation done on California map with two different agents, white and minority. Their total population is based on the real white to minority ratio in California. For example if there is 70% white and 30% minority in California then the agents(say total 100) would be 70 white and 30 minority. First the map is randomly populated with both the agents and then around 100 iteration of the simulation is performed. On every iteration agent moves based on certain conditions. Data is taken after the 100th iteration and it includes what percentage of the white and minority are in a certain county in California.

So below is the data from the state of California

california_actual_white = [0.52, 0.72, 0.9, 0.86, 0.91, 0.91, 0.67, 0.79, 0.89, 0.77, 0.89, 0.84, 0.9, 0.81, 0.82, 0.81, 0.87, 0.82, 0.71, 0.86, 0.86, 0.9, 0.86, 0.82, 0.89, 0.91, 0.82, 0.84, 0.93, 0.72, 0.85, 0.91, 0.8, 0.64, 0.88, 0.77, 0.76, 0.54, 0.67, 0.89, 0.61, 0.85, 0.55, 0.87, 0.88, 0.94, 0.87, 0.61, 0.87, 0.83, 0.73, 0.9, 0.88, 0.88, 0.9, 0.84, 0.75, 0.79]

california_simulated_white = [0.48, 0.54, 0.6, 0.62, 0.66, 0.69, 0.71, 0.71, 0.71, 0.72, 0.74, 0.75, 0.77, 0.78, 0.79, 0.79, 0.8, 0.8, 0.8, 0.81, 0.81, 0.82, 0.82, 0.82, 0.83, 0.84, 0.85, 0.85, 0.87, 0.87, 0.87, 0.88, 0.91, 0.92, 0.93, 0.93, 0.94, 0.94, 0.94, 0.94, 0.95, 0.95, 0.97, 0.97, 0.98, 0.98, 0.98, 0.98, 0.99, 0.99, 0.99, 0.99, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

How would I found a metric of similarity between these two datasets.

I found out a these three option can be used to find similarity and also all of them have a method in Python:

1) Earth mover's distance

2) Kullback–Leibler divergence

3) Cosine Similarity

But I have some doubts using these methods. They are

1) With Kullback-Leibler divergence and Cosine Similarity, the value changes if i reshuffle both the arrays and compute both the metrics again but with Earth movers distance its not the case. It will give you the same value for the two datasets regarding the reshuffling/position of the data points which made me think to use this metric.

2) But the second doubt is that you can use the K-L divergence and Earth movers distance only with two probability distribution. So i am not sure here that if the two datasets above are the probability distribution or not.

So my final two questions are -

1) Are the above datasets considered as probability distribution? If yes why?

2) If 1) is yes then what is the best method to determine the similarity. If 1) is no then what is the best method to determine the similarity ?

Before I was leaning towards Earth Movers Distance as I stated the reason in the doubts section.

Topic simulation python statistics machine-learning

Category Data Science


I would say that they are probability distributions. You can interpret them as the probability that a randomly drawn person from a given county belongs to the white majority. However, they are not probability vectors, as this would require them to sum up to 1. It's more that each element of the array defines a distribution, e.g a random person from county 1 is white with a probability of 0.52 and a member of a minority with probability 1-0.52=0.48. If you want to compare this distribution with your simulation, you would have to make the comparison element-wise, e.g. compare the 0.52/0.48 actual distribution with the 0.48/0.52 simulated distribution. This is also why reshuffling leads to different results: suddenly you are comparing the actual distribution from county 1 with e.g. the simulated distribution from county 5. This does not make much sense. So if you shuffle, you need to shuffle both arrays in the same way.

As you can interpret the distributions in each of the counties as being independen from each other, you can compute DKL or cosine or Earth's Movers for each county and sum them up probably.

I'm not sure which method is the best. I would probably go with DKL out of habit. You could compute it as such

dkl = 0.52*log(0.52/0.48) + (1-0.52)*log((1-0.52)/(1-0.48)) + ...

I hope this helps!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.