Re-sampling of a Histograms Bins

Question

Re-sampling of a Histograms Bins

mazecreator

2020年1月2日 00:49

I would like to be able to resample a histograms bins without having access tot he raw data. And just to be clear, by resample, I mean to change the number of bins and still provide a good estimate of the original probabilities of those bins.

I can think of many ways to do this, but having trouble figuring out which is the best method which maintains the same probability in the resulting histogram. The easy one would be if the input histogram X had x bins and the desired result histogram Y has y bins where x = y. This is a simple 1 - 1 sampling of the original bins. The problem forms to me as I decrease y lower than x or increase it above x.

For example: If x = 10 bins and y = 20 bins, it seems like you could simply double each of x's bins so you have Y = { x1, x1, x2, x2, x3, x3, ..., x10, x10 } but this seems like a naive approach as it seems like the 2nd copy of the previous bin should be influenced also by the next bin value y2 = (x1 + x2)/2 for example.

If x = 20 bins and y = 7 bins I can see it isn't fair to simply sample a value based upon a linear interpolation between data point as there might be 3 or 4 points on either side of the sample that should be a part of the probability for the resampled data.

I would also like to consider the possibility that the histogram is contained on the ends, so in the case of measuring water temperature below freezing isn't a likely temp for water nor above boiling for the standard cases. I would like to be able to consider the probability beyond one or both extreme bins to be 0.

Is there a standard algorithm which can be coded in C++/C# or something in pseudo code that I can convert to code for the above re-sampling / re-sizing?

Topic historgram probability

Category Data Science

Edmund · Accepted Answer · 2020年1月2日 00:49

You may use Interpolation on the the bins and heights of the HistogramList to produce a smooth PDF for use in ProbabilityDistribution. With a smooth distribution then you need not worry about increasing or decreasing bin sizes.

For example, with data

SeedRandom[9123]
data = RandomVariate[WeibullDistribution[2, 2], 10^6];

has Histogram

hist = Histogram[data, Automatic, "PDF", PlotRange -> Full]

You have the histogram bins and heights so I'll just use HistogramList to collect them and display with ListStepPlot.

hlData = HistogramList[data, Automatic, "PDF"];
lstep = ListStepPlot[Transpose@{First@hlData, Append[Last@hlData, 0]}, Mesh -> Full]

A smooth InterpolatingFunction can be constructed from the bins and heights with Interpolation.

ifData = Interpolation[Transpose@{First@hlData, Append[Last@hlData, 0]}];
ifPlot = Plot[ifData[x], {x, Sequence @@ First@ifData["Domain"]}, PlotStyle -> Purple]

It can be seen that it matches the bins and heights.

Show[hist, ifPlot, lstep]

We want to use this function as the PDF of a ProbabilityDistribution so that NProbability can be calculated and RandomVariates can be generated.

The function needs to be non-negative over its domain.

NMinimize[ifData[x], {x} ∈ Interval @@ ifData["Domain"]]

{0., {x -> 7.4}}

Minimum value is zero at x equal to 7.4.

Must Integrate to 1 over its domain.

NIntegrate[ifData[x], {x} ∈ Interval @@ ifData["Domain"]] // N

0.996821

The integral is just shy of 1 but we can ask ProbabilityDistribution to "Normalize" the PDF.

dist = ProbabilityDistribution[ifData[x], {x, Sequence @@ First@ifData["Domain"]}, 
   Method -> "Normalize"];

This distribution can be used in RandomVariate, Probability and other functions of the Random Variables guide.

Probabilities can be calculated

NProbability[0.9 < x < 2.2, x \[Distributed] dist]

0.514336

Pseudo-random numbers generated

RandomVariate[dist, 5]

{0.494373, 1.16545, 2.94366, 4.06116, 1.72519}

Properties like the SurvivalFunction, CDF, and others can be calculated.

Plot[SurvivalFunction[dist, x], {x, Sequence @@ First@ifData["Domain"]}, 
 PlotStyle -> Purple]

Update

Widening the bins will result in a loss of information so I would not recommend it. However, if you must then Subdivide the InterpolatingFunction "Domain" into the number of bins you required.

downsamples =
  Function[numbins,
    With[{bins = Subdivide[##, numbins] & @@ First@ifData["Domain"]},
     {
      bins,
      NProbability[#1 < x < #2, x \[Distributed] dist] & @@@ 
       Partition[bins, 2, 1]
      }]
    ] /@ Range[36, 12, -6];

Addressing your question in the comment; The total probability in each of the 5 cases is still one.

Total[downsamples[[All, 2]], {2}]

{1., 1., 1., 1., 1.}

However, notice the differences in the PDF histograms.

ListStepPlot[
 Transpose[{First@#, Append[Last@#, 0]}] & /@ downsamples,
 Mesh -> Full,
 PlotRange -> All,
 PlotLegends -> StringTemplate["`` bins"] /@ Range[36, 12, -6]]

The initial histogram had 37 bins. The plot above shows how the information is lost as the bin width is widened (number of bins decreased). I would recommend working with dist instead.

Hope this helps.

Re-sampling of a Histograms Bins

Update

About