Help with MLP convergence
I posted this question on AI SE and got advised to ask here for guidance. I've been stuck for a couple of days trying to figure it out how the standard MLP works and why my code doesn't converge at all to solve XOR
(doesn't breaks as well and produce some numbers). To make things short and straightforward (you can get more details in the above link), I'm stuck at coding backpropagation with a simple architecture ($1$ hidden layer) in a small library written in C#
and the source code can be found here.
To update the output's weights, I followed the next equation:
$$ \frac{\partial E^2}{\partial w_{kj}}=-2\cdot(y_k-\hat{y_k})\cdot \underbrace{\frac{\partial f_k^o}{\partial net_k^o}}_{f'_k(net_k^o)} \cdot \underbrace{\frac{\partial net_k^o}{\partial w_{kj}}}_{f_j(net_j^h)} $$
The output's biases:
$$\frac{\partial E^2}{\partial b_k} = -2 \cdot (y_k-\hat{y_k}) \cdot \frac{\partial f_k^o}{\partial net_k^o}\cdot1$$
And to update the hidden layer's weights:
$$ \frac{\partial E^2}{\partial w_{ji}}=-2\cdot(y_k-\hat{y_k}) \cdot \underbrace{\frac{\partial f_k^o}{\partial net_k^o}}_{f'_k(net_k^o)} \cdot \underbrace{\frac{\partial net_k^o}{\partial f_j^h}}_{(w_k^o)^T} \cdot \underbrace{\frac{\partial f_j^h}{\partial net_j^h}}_{f'_j(net_j^h)} \cdot \underbrace{\frac{\partial net_j^h}{\partial w_{ji}}}_{x_i} $$
And the hidden layer's biases:
$$ \frac{\partial E^2}{\partial b_j^h}=-2\cdot(y_k-\hat{y_k}) \cdot \underbrace{\frac{\partial f_k^o}{\partial net_k^o}}_{f'_k(net_k^o)} \cdot \underbrace{\frac{\partial net_k^o}{\partial f_j^h}}_{(w_k^o)^T} \cdot \underbrace{\frac{\partial f_j^h}{\partial net_j^h}}_{f'_j(net_j^h)} \cdot \underbrace{\frac{\partial net_j^h}{\partial b_j^h}}_{1} $$
The dataset that I'm using to train:
double[,] dataset = { {1, 1, 0}, {1, 0, 1}, {0, 0, 0}, {0, 1, 1} };
The actual backpropagation:
public Listdouble Backpropagation(double[,] dataset, double eta = 0.01, double threshold = 1e-5)
{
Listdouble ret = new Listdouble();
double rows = dataset.GetLength(0);
const int number_iter = 500000;
int counter = 0;
double squaredError = 2 * threshold;
while (counter number_iter squaredError threshold)
{
squaredError = 0;
counter++;
for (int i = 0; i rows; i++)
{
double[] xp = new double[i_n];
for (int j = 0; j xp.Length; j++)
xp[j] = dataset[i, j];
double[] yp = new double[dataset.GetLength(1) - i_n]; // number of outputs expected
for (int j = 0; j yp.Length; j++)
yp[j] = dataset[i, i_n + j];
Matrix Yp = new Matrix(yp); // column based matrix
var ff = Feedforward(xp);
Matrix[] net = ff.Item1;
Matrix[] fnet = ff.Item2;
Matrix i_m = ff.Item3;
Matrix Op = fnet[fnet.Length - 1];
Matrix error = Op - Yp;
squaredError += error.SquaredSum();
// Backpropagation
// % hadamard product, * matrix multiplication
Matrix error_o = (-2 * error) % Matrix.Map(fnet[fnet.Length - 1], da_f);
Matrix error_h = (Matrix.T(w[w.Length - 1]) * error_o) % Matrix.Map(fnet[fnet.Length - 2], da_f);
Matrix gradient_wo = error_o * Matrix.T(fnet[fnet.Length - 2]);
Matrix gradient_bo = error_o;
Matrix gradient_wh = error_h * Matrix.T(i_m);
Matrix gradient_bh = error_h;
w[1] = w[1] - (eta * gradient_wo);
b[1] = b[1] - (eta * gradient_bo);
w[0] = w[0] - (eta * gradient_wh);
b[0] = b[0] - (eta * gradient_bh);
}
squaredError /= rows;
ret.Add(squaredError);
}
return ret;
}
I'm using $2$ criterias to stop the algorithm:
- number of iterations
- mean squared error bellow a given threshold
The Matrix.Map
function just loops through the spots in the matrix and apply a given function da_f
to each spot (in this case, the sigmoid's derivative).
Running the algorithm the squaredError
variable got stuck every time around $0.5$ with $2$ hidden neurons (It actually do not reaches $0.5$, but gets pretty close). I tried to increase the number of hidden neurons to something like $9$ but nothing changed. It continued to converge to $0.5$, even when tuning the $\eta$ from $0.01$ to $0.2$. With $2$ neurons in the hidden layer the squaredError
produced the following curve:
Using the Feedforward
for XOR
inputs with $2$ hidden layers after the backpropagation proccess produces the following:
1 1 0.953206137420423
1 0 0.889913390479977
0 0 0.846650110272423
0 1 0.932942164537037
What I'm doing wrong to produce this kind of behaviour? Any hint is much appreciated!
Thanks in advance.
Topic mlp convergence implementation algorithms machine-learning
Category Data Science