Trouble understanding regression line learned by SGDRegressor

Question

Trouble understanding regression line learned by SGDRegressor

lazarea

2021年12月30日 10:19

I am working on a demonstration notebook to better understand online (incremental) learning. I read in sklearn documentation that the number of regression models that support online learning via the partial_fit() method is fairly limited: only SGDRegressor and PassiveAgressiveRegressor are available. Additionally, XGBoost also supports the same functionality via the xgb_model argument. For now, I chose SGDRegressor to experiment with.

I created a sample dataset (dataset generation code below). The dataset looks like this:

Even though this dataset is clearly not a good candidate for a linear regression model like SGDRegressor, my point with this snippet is merely to demonstrate how the learned parameters (coef_, intercept_) and regression line change as more and more data points are seen by the model.

My approach:

collecting the first 100 data points after sorting the data
training an initial model on those first 100 observations and retrieving the learned parameters
plotting the learned regression line
iteration: take N new observations, use partial_fit(), retrieve the updated parameters, and plot the updated regression line

The problem is, the learned parameters and the regression line doesn't seem correct at all after training on the first 100 observations. I tried tinkering with the max_iter and eta0 arguments of SGDRegressor() as I thought SGD merely couldn't converge to the optimal solution as learning rate was too slow and/or maximum number of iterations was too low. However, this didn't seem to help.

Here are my plots:

My full code:

from sklearn import datasets
import matplotlib.pyplot as plt

random_state = 1

# generating first section
x1, y1 = datasets.make_regression(n_samples=1000, n_features=1, noise=20, random_state=random_state)
x1 = np.interp(x1, (x1.min(), x1.max()), (0, 20))
y1 = np.interp(y1, (y1.min(), y1.max()), (100, 300))

# generating second section
x2, y2 = datasets.make_regression(n_samples=1000, n_features=1, noise=20, random_state=random_state)
x2 = np.interp(x2, (x2.min(), x2.max()), (15, 25))
y2 = np.interp(y2, (y2.min(), y2.max()), (275, 550))

# generating third section
x3, y3 = datasets.make_regression(n_samples=1000, n_features=1, noise=20, random_state=random_state)
x3 = np.interp(x3, (x3.min(), x3.max()), (24, 50))
y3 = np.interp(y3, (y3.min(), y3.max()), (500, 600))

# combining three sections into X and y
X = np.concatenate([x1, x2, x3])
y = np.concatenate([y1, y2, y3])

# plotting the combined dataset
plt.figure(figsize=(15,5))
plt.plot(X, y, '.');
plt.show();

# organizing and sorting data in dataframe
df = pd.DataFrame([])
df['X'] = X.flatten()
df['y'] = y.flatten()
df = df.sort_values(by='X')
df = df.reset_index(drop=True)

# train model on first 100 observations
model =  linear_model.SGDRegressor()
model.partial_fit(df.X[:100].to_numpy().reshape(-1,1), df.y[:100])
print(fmodel coef: {model.coef_[0]:.2f}, intercept: {model.intercept_[0]:.2f})
regression_line = model.predict(df.X[:100].to_numpy().reshape(-1,1))
plt.figure(figsize=(15,5));
plt.plot(X,y,'.');
plt.plot(df.X[:100], regression_line, linestyle='-', color='r');
plt.title(SGDRegressor on first 100 observations with default arguments);

What am I misunderstanding or overseeing here?

Topic linear-regression online-learning python machine-learning

Category Data Science

Tim J · Accepted Answer · 2021年12月30日 10:19

A single call to partial_fit is very unlikely to get you a good fit, as it only performs one iteration of stochastic gradient descient. As stated in the docs:

Internally, this method uses max_iter = 1. Therefore, it is not guaranteed that a minimum of the cost function is reached after calling it once. Matters such as objective convergence and early stopping should be handled by the user.

source

I'm not very familiar with online learning and partial fits, but it seems you need to apply some kind of looping function if you want this to work. After playing around for a bit I found this simple modification already greatly improves the results:

# train model on first 100 observations  
model = linear_model.SGDRegressor()
amount = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300]
for a in amount:
    model.partial_fit(df.X[:a].to_numpy().reshape(-1, 1), df.y[:a])
    print(f"amount: {a}, model coef: {model.coef_[0]:.2f}, intercept: {model.intercept_[0]:.2f}")
regression_line = model.predict(df.X[:800].to_numpy().reshape(-1, 1))
plt.figure(figsize=(15, 15))
plt.plot(X, y, '.')
plt.plot(df.X[:800], regression_line, linestyle='-', color='r')
plt.title("SGDRegressor on first 100 observations with default arguments")
plt.show()

Here you can see in the output that the intercept is increasing, whereas the coefficient is decreasing, which is what we would expect a good fit to look like.
I hope this is enough to get your project moving again!

Trouble understanding regression line learned by SGDRegressor

About