Help with MLP convergence

I posted this question on AI SE and got advised to ask here for guidance. I've been stuck for a couple of days trying to figure it out how the standard MLP works and why my code doesn't converge at all to solve XOR (doesn't breaks as well and produce some numbers). To make things short and straightforward (you can get more details in the above link), I'm stuck at coding backpropagation with a simple architecture ($1$ hidden layer) in a small library written in C# and the source code can be found here.

To update the output's weights, I followed the next equation:

$$ \frac{\partial E^2}{\partial w_{kj}}=-2\cdot(y_k-\hat{y_k})\cdot \underbrace{\frac{\partial f_k^o}{\partial net_k^o}}_{f'_k(net_k^o)} \cdot \underbrace{\frac{\partial net_k^o}{\partial w_{kj}}}_{f_j(net_j^h)} $$

The output's biases:

$$\frac{\partial E^2}{\partial b_k} = -2 \cdot (y_k-\hat{y_k}) \cdot \frac{\partial f_k^o}{\partial net_k^o}\cdot1$$

And to update the hidden layer's weights:

$$ \frac{\partial E^2}{\partial w_{ji}}=-2\cdot(y_k-\hat{y_k}) \cdot \underbrace{\frac{\partial f_k^o}{\partial net_k^o}}_{f'_k(net_k^o)} \cdot \underbrace{\frac{\partial net_k^o}{\partial f_j^h}}_{(w_k^o)^T} \cdot \underbrace{\frac{\partial f_j^h}{\partial net_j^h}}_{f'_j(net_j^h)} \cdot \underbrace{\frac{\partial net_j^h}{\partial w_{ji}}}_{x_i} $$

And the hidden layer's biases:

$$ \frac{\partial E^2}{\partial b_j^h}=-2\cdot(y_k-\hat{y_k}) \cdot \underbrace{\frac{\partial f_k^o}{\partial net_k^o}}_{f'_k(net_k^o)} \cdot \underbrace{\frac{\partial net_k^o}{\partial f_j^h}}_{(w_k^o)^T} \cdot \underbrace{\frac{\partial f_j^h}{\partial net_j^h}}_{f'_j(net_j^h)} \cdot \underbrace{\frac{\partial net_j^h}{\partial b_j^h}}_{1} $$

The dataset that I'm using to train:

double[,] dataset = { {1, 1, 0}, {1, 0, 1}, {0, 0, 0}, {0, 1, 1} };

The actual backpropagation:

public Listdouble Backpropagation(double[,] dataset, double eta = 0.01, double threshold = 1e-5)
{
    Listdouble ret = new Listdouble();

    double rows = dataset.GetLength(0);

    const int number_iter = 500000;
    int counter = 0;

    double squaredError = 2 * threshold;

    while (counter  number_iter  squaredError  threshold)
    {
        squaredError = 0;
        counter++;

        for (int i = 0; i  rows; i++)
        {
            double[] xp = new double[i_n];

            for (int j = 0; j  xp.Length; j++)
                xp[j] = dataset[i, j];

            double[] yp = new double[dataset.GetLength(1) - i_n]; // number of outputs expected

            for (int j = 0; j  yp.Length; j++)
                yp[j] = dataset[i, i_n + j];

            Matrix Yp = new Matrix(yp); // column based matrix

            var ff = Feedforward(xp);

            Matrix[] net = ff.Item1;
            Matrix[] fnet = ff.Item2;
            Matrix i_m = ff.Item3;

            Matrix Op = fnet[fnet.Length - 1];

            Matrix error = Op - Yp;

            squaredError += error.SquaredSum();

            // Backpropagation

            // % hadamard product, * matrix multiplication

            Matrix error_o = (-2 * error) % Matrix.Map(fnet[fnet.Length - 1], da_f);

            Matrix error_h = (Matrix.T(w[w.Length - 1]) * error_o) % Matrix.Map(fnet[fnet.Length - 2], da_f);

            Matrix gradient_wo = error_o * Matrix.T(fnet[fnet.Length - 2]);
            Matrix gradient_bo = error_o;

            Matrix gradient_wh = error_h * Matrix.T(i_m);
            Matrix gradient_bh = error_h;

            w[1] = w[1] - (eta * gradient_wo);
            b[1] = b[1] - (eta * gradient_bo);

            w[0] = w[0] - (eta * gradient_wh);
            b[0] = b[0] - (eta * gradient_bh);
        }

        squaredError /= rows;

        ret.Add(squaredError);
    }

    return ret;
}

I'm using $2$ criterias to stop the algorithm:

  1. number of iterations
  2. mean squared error bellow a given threshold

The Matrix.Map function just loops through the spots in the matrix and apply a given function da_f to each spot (in this case, the sigmoid's derivative).

Running the algorithm the squaredError variable got stuck every time around $0.5$ with $2$ hidden neurons (It actually do not reaches $0.5$, but gets pretty close). I tried to increase the number of hidden neurons to something like $9$ but nothing changed. It continued to converge to $0.5$, even when tuning the $\eta$ from $0.01$ to $0.2$. With $2$ neurons in the hidden layer the squaredError produced the following curve:

Using the Feedforward for XOR inputs with $2$ hidden layers after the backpropagation proccess produces the following:

1 1 0.953206137420423
1 0 0.889913390479977
0 0 0.846650110272423
0 1 0.932942164537037

What I'm doing wrong to produce this kind of behaviour? Any hint is much appreciated!

Thanks in advance.

Topic mlp convergence implementation algorithms machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.