Pre-train using sigmoid and train using ReLU?

Using RBMs to pre-train a deep net as in this example RBM, the activation function is sigmoid and makes the math much easier.

What are the implications after the initial weights are learned using sigmoid activation functions to switch to ReLU for the train phase?

I suppose that using tanh in either phase (pre-train or train) and sigmoid or ReLU in the other would cause great problems, but since ReLU and sigmoid are similar for small values, would it still render the pre-train phase useless?

The question can be more general in the sense of how much knowledge can you transfer from a neural network using sigmoid activation functions to one identical in structure, but using ReLU activation functions.

Topic rbm

Category Data Science


Since RBM has only one layer of weights, why you bother changing sigmoid to ReLU in 1-layer net? Gradient vanishing is very unlikely to happen in such a shallow net.

You can also train Gaussian-Bernoulli or Gaussian-Gaussian RBM (more here), which has identity activation function, which is closer to ReLU than sigmoid, and what is more justified if you have real-valued data, not binary. However, these types of networks are a bit less stable to train because of this unconstrained activation.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.