Why is my training accuracy decreasing higher degrees of polynomial features?

I am new to Machine Learning and started solving the Titanic Survivor problem on Kaggle.

While solving the problem using Logistic Regression I used various models having polynomial features with degree $2,3,4,5,6$ . Theoretically the accuracy on training set should increase with degree however it started decreasing post degree $2$ . The graph is as per below

Topic classifier logistic-regression accuracy scikit-learn

Category Data Science


Higher polynomial degrees correspond to more parameters. Typically, a model with more parameters will fit the data better, as it will have higher likelihood (and the goal is to maximize the log-likelihood of the parameters). Yes it will overfit, but overfitting should still mean higher accuracy on the training data. So why would more parameters stop fitting the training data? That's because of the Bayesian Occam's razor effect. Models with more parameters do not necessarily have higher marginal likelihood. Think about it as follows; as your parameters increase, the model needs to spread the probability mass over the solution space ever more thinly, so the model will tend to be flat (i.e. counter intuitively it will underfit). This is referred to as the conservation of probability mass principle. For more on this, refer to Kevin P. Murphy's "Machine Learning: A Probabilistic Perspective" 5.3.1

Another suspect would be the curse of dimensionality, as the solution space will grow exponentially as you add more parameters. So although it might seem that the model suddenly underfits the training data, the solution space has grown too much after increasing the degree beyond a certain threshold.


Have you tried normalization or doesn't your algorithm need that?

  • if $x,y < 1.0$ then $x^2,y^2,xy,...$ are too small
  • if $x,y > 1.0$ then $x^2,y^2,xy,...$ are too big

Many machine learning algorithms need to normalize them as they are in the same scale. $$ x = (x - x_{mean})/x_{std} \\ x^2 = (x^2 - x^2_{mean})/x^2_{std} \\ .. and so on. $$

If you don't normalize them, training may be very slow or not converge.
You can utilize the sklearn.preprocessing.StandardScaler.


I disagree with the assertion of, "Theoretically the accuracy on training set should increase with degree". The goal of polynomial regression is not to randomly try new polynomials. The goal is to use a polynomial that better fits your data because the correlation is not linear.

Let's think about the end result of linear regression - it usually something like y = mx + b

If you show that to a data scientist, they're going to tell you it's linear regression. You show that to a math student and they will tell you its the formula for a straight line. Either way, it's just a formula for a graph. But, note that this is for a straight line and not all data is linear. So, knowing that you're just coming up with a formula, you should think about polynomial regression in the same way - what graph am I trying to draw?

If you use a scatter plot and you are seeing a correlation but that relationship is exponential, then you should use the corresponding polynomial; same goes for all of the other variations. There is no logical explanation to use a polynomial that will not draw a graph that will closely align with your data correlation.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.