Understanding Contrastive Divergence

Question

Understanding Contrastive Divergence

Finn Williams

2019年1月25日 01:40

I’m trying to understand, and eventually build a Restricted Boltzmann Machine. I understand that the update rule - that is the algorithm used to change the weights - is something called “contrastive divergence”. I looked this up on Wikipedia and found these steps:

Take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution.
Compute the outer product of v and h and call this the positive gradient.
From h, sample a reconstruction v' of the visible units, then resample the hidden activations h' from this. (Gibbs sampling step)
Compute the outer product of v' and h' and call this the negative gradient.
...

I don’t understand step 3 and I’m struggling to grasp the concept of Gibbs sampling. Would someone explain this simply to me? I have covered neural networks if that helps you.

Topic unsupervised-learning rbm neural-network python machine-learning

Category Data Science

christianb93 · Accepted Answer · 2019年1月25日 01:40

Gibbs sampling is an example for the more general Markov chain Monte Carlo methods to sample from distribution in a high-dimensional space.

To explain this, I will first have to introduce the term state space. Recall that a Boltzmann machine is built out of binary units, i.e. every unit can be in one of two states - say 0 and 1. The overall state of the network is then specified by the state for every unit, i.e. the states of the network can be described as points in the space $\{0,1\}^N$, where N is the number of units in the network. This point is called the state space.

Now, on that state space, we can define a probability distribution. The details are not so important, but what you essentially do is that you define energy for every state and turn that into a probability distribution using a Boltzmann distribution. Thus there will be states that are likely and other states that are less likely.

A Gibbs sampler is now a procedure to produce a sample, i.e. a sequence $X_n$ of states such that, roughly speaking, the distribution of these states across the state space reflects the probability distribution. Thus you want most of the $X_n$ to be in regions of the state space with high probability (and low energy), and few of them to be in regions with low probability (and high energy).

To do this, a naive Gibbs sampling approach would proceed as follows. You start with some state $X_0$. To find the state $X_1$, you would pick some unit and calculate the conditional probability for that unit to be in state 1 ("on") conditional on the current value of all other units. Call this number p. You would then set the unit to 1 with probability p and pick the next unit to repeat this to get from $X_1$ to $X_2$ and so forth.

In the special case of a restricted Boltzmann machine, this can be greatly simplified. Instead of going through, say, first all hidden units and then all visible units and update them like this one by one, you can, in fact, update all hidden units in one step and all visible units in one step, because any two hidden units and any two visible units are independent. Thus, for a full cycle through all units of the state space, you would:

calculate the probability for all hidden units to be 1, given the value of the visible units
set the hidden units to 1 with this probability
calculate the probability for the visible units to be 1, again conditional on the value of the hidden units, and
set the visible units to 1 with this probability

This constitutes one full Gibbs sampling step and is your step 1 + the first part of 3 (the second part is then needed for the further calculation and not part of the sampling). The reason why we do this in the CD algorithm is that we actually want to approximate an expectation value and use a sampler for this.

This is a complex topic and hard to summarize in a few sentences. If you want to learn more about the mathematics behind this (Markov chains) and on the application to RBMs (contrastive divergence and persistent contrastive divergence), you might find this and this document helpful - these are some notes that I put together while learning about this.

Understanding Contrastive Divergence

About