Siamese Network - Sigmoid function to compute similarity score

Question

Siamese Network - Sigmoid function to compute similarity score

Stefan J.

2020年10月5日 22:23

I am referring to siamese neural networks introduced in this paper by G. Koch et al.

The siamese net computes 2 embeddings, then calculates the absolute value of the L1 distance, which would be a value in [0, +inf). Then the sigmoid activation function is applied to this non-negative input, so the output afterwards would be in [0.5, 1), right?

So, if two images are from the same class, your desired L1 distance should be close to 0, thus the sigmoid output should be close to 0.5, but the label given to it is 1 (same class); if two images are from different classes, your expected L1 distance should be very large, thus the sigmoid output should be close to 1, but the label given to it is 0 (diff. class).

How does the use of a sigmoid function in order to compute the similarity score (0 dissimilar, 1 similar) in a siamese neural network make sense here?

Topic siamese-networks neural-network

Category Data Science

Graph4Me Consultant · Accepted Answer · 2020年10月5日 22:23

I would like to augment the answer of @Shubham Panchal, since I feel the real issue is still not made explicit.

1.) $\alpha$ could also contain negative entries so that the sigmoid function maps to $(0,1)$.

2.) @Stefan J, I think you are absolutely correct: two identical embedding vectors would be mapped to $0.5$ while two vectors that differ would be mapped to (depending on $\alpha$) values towards $1$ or $0$, which is not what we want!

@Shubham Panchal mentions the Dense layer and provides a link to an implementation, which is correct.

Now to make it very clear and short, in the paper they forgot to mention that there is a bias!

So it should be $p = \sigma(b+ \sum_{j}\alpha_{j}|h_{1,L-1}^{(j)} - h_{2,L-1}^{(j)}|)$.

Let $\hat{h} := \begin{pmatrix}\hat{h}_{1} & \ldots & \hat{h}_{n}\end{pmatrix}^{T}$, where $\hat{h}_{j}:= |h_{1,L-1}^{(j)} - h_{2,L-1}^{(j)}|$.

Then we know that $\hat{h}_{i} \geq 0$ for all $i$. If you consider now the classification problem geometrically, then $\alpha$ defines a hyperplane that is used to separate vectors $\hat{h}$ close to the origin from vectors $\hat{h}$ further away from the origin. Note that for $\alpha = 1$, we have $\sum_{j}\alpha_{j}|h_{1,L-1}^{(j)} - h_{2,L-1}^{(j)}| = ||\hat{h}||_{1}$. Using $\alpha$ results thus in a weighting of the standard $1$-norm, $\sum_{j}\alpha_{j}|\hat{h}^{(j)}|$.

Already for $n=2$ you can see that you can have two classes where the hyperplane must not go through the origin. For example, let's say two images belong together, if $\hat{h}_{1} \leq c_{1}$ and $\hat{h}_{2} \leq c_{2}$. Now you can not separate those points from points with $\hat{h}_{1} > c_{1}$ or $\hat{h}_{2}> c_{2}$ using a hyperplane that contains the origin. Therefore, a bias is necessary.

Using the Dense layer in Tensorflow will use a bias by default, though, which is why the presented code is correct.

Shubham Panchal · Accepted Answer · 2020年10月4日 02:08

Your observation is correct @Stefan J, but did you observe that there is a Dense layer too? The absolute L1 distances are multiplied with the weights of the Dense layer. These outputs are then fed to the sigmoid function. If we are looking at a Keras implementation from One Shot Learning with Siamese Networks using Keras,

you see the last Dense layer.

Also, from the research paper,

The L1 distances are multiplied with $\alpha$ which is a training parameter. This sum is then passed on to the sigmoid $\sigma$ function. We may interpret these values of $\alpha$ as the weights of the last Dense layer. These weights get smaller after training.

Another obvious reason of a sigmoid function is to get similarity scores in ( 0, 1 ). The binary cross-entropy loss function is used with it.

Siamese Network - Sigmoid function to compute similarity score

About