Is R2 score a reasonable regression measure on huge datasets?

I'm running a regression model on a pretty large data set and getting a fairly woeful $R^2$ score of ~0.2 (see plot below), despite the plot looking like the model is generally pointing in the right direction.

My question is, when you have over a million data points, how high can you realistically expect the $R^2$ to go in real world data with a decent amount of noise?

What prompts by scepticism of such traditional measures are articles such as this that discuss how quantity of data can degrade statistical tests.

Let me know what you think and any regression examples using $R^2$ score as a quality metric.

Topic r-squared metric regression performance

Category Data Science


The coefficient of determination $r^2$ is defined in terms of variance: it is the proportion of variance in the dependent variable that is explained by the independent variable. Variance is a property of normal distributed data. Hence, the coefficient of determination can only be used when you assume that both the dependent and independent variables are normally distributed.

Just like other properties of normal distributed data, the estimation of $r^2$ improves when the amount of data increases. With very little data a coincidental colinearity might be the case, but this is not possible with large amounts of data.

Back to your example. Your data is cleary not normal distributed, it has right-skew and has large outliers. For this reason it is not advisable to use $r^2$. Imagine for example that in the lower left corner (where the majority of the data is) you would observe a negative trend, but in general there is a positive trend. The regression line would be the same and the $r^2$ would be in the same range. This is known as the Simpsons paradox.

In short, if your data is normal distributed you can use $r^2$ for any size of dataset. If it is not normal distributed you cannot use $r^2$.


I think Paul's answer is a really good one. The one additional point I'd make about the $R^2$ score is that you should only compare different $R^2$ scores between models estimated on the same set of data. Conceptually, it does not make any sense to compare $R^2$ scores between models derived from different data because $R^2$ is itself just a measure of variance in the outcome explained by the model. Different sets of data will have different amounts of variance explainable.

This is a chief reason why defining a "good" $R^2$ score is hard. The $R^2$ score is of course dependent on your model's fit, but also on the data itself (and the set of data is dependent on its collection, as well as the domain from which it is drawn).


There is no general answer of what to expect for an $R^2$ score. And there is no general answer for whether a model with this $R^2$ score is a "good" model. There are many cases where (1) this kind of $R^2$ score is not unreasonable and (2) the model is still useful. Looking at this data, $R^2=0.243$ feels about right, with most of the data concentrated in an almost disk-like blob near the origin.

The article you link to deals with statistical confidence, which is a very different issue than $R^2$. I would guess that given the amount of data, despite a $R^2$ score you consider is low, the p-value will be very high. The article touches upon p-hacking, which is an issue if you consider many possible inputs with competing models. With just one $x$, as is here, if this is the only model you have built, this is reasonable.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.