Pre-train using sigmoid and train using ReLU?
Using RBMs to pre-train a deep net as in this example RBM, the activation function is sigmoid and makes the math much easier.
What are the implications after the initial weights are learned using sigmoid activation functions to switch to ReLU for the train phase?
I suppose that using tanh in either phase (pre-train or train) and sigmoid or ReLU in the other would cause great problems, but since ReLU and sigmoid are similar for small values, would it still render the pre-train phase useless?
The question can be more general in the sense of how much knowledge can you transfer from a neural network using sigmoid activation functions to one identical in structure, but using ReLU activation functions.
Topic rbm
Category Data Science