Training a model where each response in the observation data has a different known varience

I have a dataset where each response variable is the number of successes of N Bernoulli trials with N and p (the probability of success) being different for each observation. The goal is to train a model to predict p given the predictors. However observations with a small N will have a higher variance and higher N.

Consider the following scenario to illustrate better: Assume coins with different pictures on them have a different bias and that the bias is dependent on the picture on the coin. I have a large number of coins each with a different picture on them and each with a different bias p. I want to create a model that can predict the bias of a coin given only the picture on the coin. I flip each coin a different number of times and record the number of successes and total number of flips. So my data set consists of each picture and its estimate p=successes/flips.

So my question is when training my model how should I handle this. It seems more weight should be given to observations with a higher sample size(number of flips). I don't think it makes sense to include number flips as a predictor variable because the point is to build a model which predicts p using only the picture on the coin so this difference in variance for the response for each observation should be taken into account when training the model.

I am using several types of model but mainly working with keras and xgboost

Topic training keras weighted-data xgboost

Category Data Science


I may be understanding the question now. Still using the coin example, as I said above, the number of trials for a given coin only affects the confidence for the estimated probability/bias for that one coin. So it seems like you are asking how to incorporate that "confidence" into the response variable, if at all. In other words, you are asking if your model should reflect the uncertainty concerning the true value of $p$ for each coin, given the number of coin flips you performed.

I don't think assigning different weights to observations is appropriate in this situation because, again, the number of flips for one coin does not have anything to do with other coins.

I am not sure if this will satisfy your needs, but there is something called interval regression that is used to model a response/dependent variable that is defined as an interval between lower and upper bounds. It is a type of regression for censored data (a problem where the true value of response is not known) and is typically used for modeling such variables as income ranges or survival times. In your case, you could compute a confidence interval for the true value of $p$ for each coin, using the $p$ calculated from your trials and the number of trials specific to each coin. Then you would use this regression with two response variables: lower limit and upper limit of the confidence interval.

Based on my quick search, I am not finding a lot of Python support for this type of model, except:

Maximum Margin Interval Trees - Decision trees for interval regression

Drouin, A., Hocking, T.D. & Laviolette, F. (2017). Maximum Margin Interval Trees. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

https://arxiv.org/abs/1710.04234

https://aldro61.github.io/mmit/

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.