About confidence/prediction intervals: parametric methods VS non-parametric (via bootstrap) methods

About the methodology to find confidence and/or prediction intervals in, let's say, a regression problem, I know 2 main options:

  1. Checking normality in the estimates/predictions distribution, and applying well known Gaussian alike methods to find those intervals if the distribution is gaussian
  2. Applying non-parametric methodologies like bootstraping, so we do not need to assume/check/care whether our distribution is normal

With this in mind, I would basically always go for the second one because:

  • it is meant to be generic, as it does not assume any kind of distribution
  • it feels more like experimental as you can freely run as many iterations as you want (well, if it is computionally feasible)

The only drawback I could see is the computational cost, but it could be parallelized...

Can anyone give another point/advice?

Topic bootstraping non-parametric machine-learning

Category Data Science


The simplest type of bootstrapped confidence intervals, percentile intervals (described in simple terms here) are by no means the only way of creating bootstrapped CIs. You should think of bootstrapped confidence intervals as a family of techniques - there are some examples listed here along with some good information around the assumptions underlying the use of bootstrapped statistics.

How many original samples you need in order for bootstrapped confidence intervals to converge varies. Like the parametric approach, this convergence will arrive at the wrong value to the extent that your sample population is not representative of the global population you are estimating. A rule of thumb is to use parametric approaches if the sample population is n < 30, but in principle bootstrap could sometimes work with smaller samples (see this thread). If you have 30 samples or more, standard practice is to draw >1000 bootstrap samples, but the important thing is that you see the estimates converging.

If you want to use bootstrap CIs you should be really explicit about the approach you have used whenever you present your analysis, because different ways of producing them can lead to different intervals.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.