Aggregating standard deviations

Imagine I have a collection of data, let's say the travel time for a road segment. On this collection I want to calculate the mean and the standard deviation. Nothing hard so far.

Now imagine that instead of having my collection of values for one road segment, I have multiple collections of values that correspond to the multiple sub segments that compose the road segment.

For each of these collections, I know the average and the standard deviation. From that, I want to aggregate these multiple average and standard deviation in order to get the average and standard deviation for the whole road segment.

For example, let's suppose I have the following dataset :

           subSegmentA , subSegmentB , subSegmentC , subSegmentD
values              20            45            25            70
                    30            55            10            60
                    10            10            10            80
                    15            50            30            75
                    15            40            15            75
                    20            40            20            80
                    30            45            20            65
                    10            40            25            70

average          18.75        40.625        19.375        71.875
stddev      7.90569415   13.47948176   7.288689869   7.039429766

expected_global_average : 150.625
expected_global_stddev  : 18.40758772

For the average there is no problem, a simple sum do the job, but I have trouble with the global_stddev.

I tried multiple solutions from here, without success.

Edit : After further research, it seems mathematicaly impossible to calculate the standard deviation of a set based only from the standard deviation and average of subsets.

So I am trying to calculate a new metric, that would approximate this global standard deviation.

To do so, I can use in addition to the avg/stddev per subsegments, the length ratio of the subsegment to the road.

Topic aggregation

Category Data Science


Do I understand correctly that you do actually have the data for all segments? Or do you only have the mean and standard deviation of each subset? In this case, it is possible to compute the estimate standard deviation over the entire collection of subsets!

As is explained in the second answer to that question on CrossValidated (and as you noted), computing the mean over your entire segment is as simple as taking the mean of the means i.e. the simple average of all your $\mu$-values.

If the former is true (you have all the data), then it is possible to compute the variance (i.e. the standard deviation squared) of the combined data, in a similar way in which you compute the mean.

Think about what it means to compute the sample variance. In words:

First compute the mean of the population. Next, sum up the squared difference between every single sample and the population mean. Divide this by the number of samples in the population (minus 1).

The formula is as follows:

$$\bar{\sigma}^2 = \frac{1}{N - 1} \sum_{i=1}^{N} (x_i - \bar{\mu})^2$$

where $\bar{\mu}$ is the sample mean (over one of your subsets) and $\bar{\sigma}^2$ is the sample variance (over one of your subsets).

Have a look at this answer on Math Overflow.

To scale up the variance estimate over multiple subsets, however, you must account for the difference in variances between groups. The way we can do this is to use the mean of the global population $\mu_{global}$ (over all your subsets). Imagine two of your subSegments are $A$ and $B$, then the formula to compute the aggregated variance would be:

$$\sigma_{A+B}^2 = \frac{1}{N - 1} \left( \sum_{i=1}^{N_A} (A_i - \mu_{global})^2 + \sum_{i=1}^{N_B} (B_i - \mu_{global})^2 \right)$$

where the $\mu_{global}$ is what you already compute, the average of the means, here it'd be:

$$\mu_{global} = \frac{1}{2} (\mu_A + \mu_B)$$

To get the standard deviation, just take the square root of the $\sigma_{A+B}$ value above.

To relate this back to the description of the variance computation above, all we are changing now is the mean value that we are subtracting!


If you are curious as to why we divide by $N-1$ when computing averages here: it helps ensure the estimate we compute is not biased. It is called Bessel's correction. One downside, is that it will increase the mean squared error of the estimate you compute, which becomes clear when you think about a tiny subset of data, where that $-1$ makes a large impact.


It is not perfect but you can try to re-create synthetic data based on the mean and sd. In R, you can use the rnorm function to create a normal distribution from mean and sd. The following is one of the ways to do it. Hope it helps! P.S. I just choose n = 1000 to illustrate how it can be done; you can try using different numbers.

a <- c(20,30,10,15,15,20,30,10)
b <- c(45,55,10,50,40,40,45,40)
c <- c(25,10,10,30,15,20,20,25)
d <- c(70,60,80,75,80,60,65,70)

n <- 1000

e <- c(rnorm(n, mean(a), sd(a)) +
   rnorm(n, mean(b), sd(b)) +
   rnorm(n, mean(c), sd(c)) +
   rnorm(n, mean(d), sd(d)))

mean(e)
sd(e)

Try it online

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.