Layer notation for feed forward neural networks

Apologies in advance, for I have a fairly rudimentary question on the notations for studying Feed-Forward Neural Networks. Here is a nice schematic taken from this blog-post. Here $x_i = f_i(W_i \cdot x_{i-1})$ where $f_i$ is the activation function. Let us denote the number of nodes in the $i^{\text{th}}$ layer by $n_i$ and each example of the training set being $d-$dimensional (i.e., having $d$ features).

Which of the following do the nodes in the above graph represent?

  1. Each one of the $d$ features in every example in the training set. In this case, $n_0 = d$ and $x_0$ is $(d \times 1)$.
  2. Each example of the training set, which is $d-$dimensional. In this case, $n_0$ is the number of examples and $x_0$ is $(d \times n_0)$.

In both cases, the weight matrix $W_i$ is $(n_i \times n_{i-1})$.

On the one hand, most references like this blog-post claim it is (1), while on the other, I can also find few references such as this video which seem to claim it is (2). Which one of them carries the right interpretation?

Although seems like the back propagation algorithm can be executed in both representations, I'm quite sure it only makes sense in one of them. Any help will be greatly appreciated.

Topic backpropagation notation neural-network

Category Data Science


Well, the image you sent is not nicely denominated.

The first layer in the image is $x_0$ which is the input consisting of d dimensions, it is actually the first sample of the training set. Here its dimensions are $x_{01}, x_{02}, x_{03}, x_{04}$ (the green nodes in the left, hence, d equals $4$). Then the next layer which is called $x_1$ is the first hidden layer and subsequently $x_2$ is the second hidden layer and $x_3$ is the output of this feed-forward network.

By this definition, $x_0$ is the input with d dimensions $x_{01}, x_{02}, x_{03}, x_{04}$ and for calculating each node in the proceding hidden layer which is called $x_1$ here, we should do:

consider the most up node in hidden layer $x_1$ as the node we want to figure its value. We call it $x_{11}$, first we compute a linear computation of weights and inputs and then we apply some activation function $\sigma$ to it: $$x_{11} = \sigma(x_{01} \cdot w_{11} + x_{02} \cdot w_{12} + x_{03} \cdot w_{13} + x_{04} \cdot w_{14})$$

  • also, an offset might be added to this expression.
  • consider that hidden layers can be of any size.
  1. Each one of the $d$ features in every example in the training set. In this case, $n_0 = d$ and $x_0$ is $(d \times 1)$.

$n_0 = d$ and $x_0$ is $(d \times 1)$ is right and in the first layer, yes each node is depicting a single one of d features of the input. but not for the hidden layers.

  1. Each example of the training set, which is $d-$dimensional. In this case, $n_0$ is the number of examples and $x_0$ is $(d \times n_0)$.

No as I mentioned, this is an architecture that depicts the process for a single training set. Hence, each node is not a sample of the training set. You set $n_0$ for the number of nodes in the first layer which is input. so $n_0$ here equals d and $x_0$ which is the input equals $(x_{00}, x_{01}, x_{02}, ...x_{0d})$, $0$ showing that this is the first sample of training set.

In the backpropagation process, we have the same architecture. Then by calculating the gradient of each node, we update each weight. this process is done so many times to find the most optimal weights. there are various approaches for this weight updating thing like batch, etc updates.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.