Is there a Softmax-like transformation with scale-invariance and linarity?

At the moment I'm using XGBoost to generate a prediction of probabilities with a custom objective-function to build something like an expert system. To do so I need to transform the raw XGBoost predictions into a probability distribution, where every value lies in the range from 0 to 1 and they all sum up to 1.

Naturally you start out with the Softmax transformation. But as it turns out this function has some significant drawbacks for this kind of application. The raw predictions can vary quite significantly up to a range of -100 to +1000, which leads to several problems.

The first is that Softmax(1000) can't be computed anymore because e^1000 is too large on my machine, one can fix this issue by subtracting max(x) from x (as long as you only act vector-wise).

The second one is way harder to come by: Due to the non-linearity of the Softmax you get quite extreme probability distributions with bigger raw predictions, here is an example:

softmax(14, 141) =
0.0000000000000000000000000000000000000000000000000000000699199
1.0000000000000000000000000000000000000000000000000000000000000

which basically renders the whole approach invalid, since now it's hardly a weighted average of distributions, but rather a selection of just one.

One way to mitigate this issue is to work with extreme small values (0.001 to 0.1) for eta, but this does not feel satisfying since it's not tackling the core issue at hand.

Another problem is, that the Softmax is not sacle-invariant as you can see in this example:

softmax(1,2) = 0.2689414 0.7310586
softmax(10,20) = 0.00004539787 0.99995460213

Something like 0.3333333 0.6666667 as a result for both would work better in this case, since 1/2 = 10/20.


So I thought of a different transformation function, which keeps the good Softmax propperties (e.g. every value lies in the range from 0 to 1 and they all sum up to 1), but acts linear on the data and is sacle-invariant. But I could not find such a function. An ordinary rescaling like x / sum(x) does not work since it can produce negative values (e.g.: rescaling(-2, -1, 0, 1) = 1.0 0.5 0.0 -0.5) and abs(rescaling()) does not add up to 1.

Another idea of mine was to scale the data (to sd = 1 mean = 0) before running it trough the Softmax, which seems to work for longer vectors, but feels a bit hacky. I'm a bit sacred that this will lead to other problems I cant really pinpoint at the moment.

Do you have an idea, a hint as to what I can do to transform the raw XGBoost predictions into a nicely distributed probabilities?

Topic transformation softmax xgboost

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.