Training the parameters of a Restricted Boltzman machine

Why are the parameters of a Restricted Boltzmann machine trained for a fixed number of iterations (epochs) in many papers instead of choosing the ones corresponding to a stationary point of the likelihood?

Denote the observable data by $x$, hidden data by $h$, the energy function by $E$ and the normalizing constant by $Z$. The probability of $x$ is: \begin{equation} P(x) = \sum_h P(x,h) = \sum_h \frac{e^{-E(x,h)}}{Z}. \end{equation} The goal is to maximize the probability of $x$ conditional on the parameters of the model $\theta$. Suppose one has access to a sample of $N$ observations of $x$ with typical element $x_i$. As estimator, one could find the roots of the derivative of the average sample log-likelihood: \begin{equation} \left\lbrace \hat{\theta} \in \hat{\Theta} : N^{-1} \sum_{x_i} \frac{\partial \log p(x_i)} {\partial \theta} = 0 \right\rbrace \end{equation} and chose the one $\theta^\star \in \hat{\Theta} $ maximizing the empirical likelihood. There exists many different ways to approximate the derivative of the log-likelihood to facilitate (maybe even permit) its computation. For example, Contrastive Divergence and Persistent Contrastive Divergence are used often. I wonder whether it makes sense to estimate the parameters $\theta$ recursively until convergence while continuing to approximate the derivative of the log-likelihood. One could update the parameters after seeing each data point $x_i$ as: \begin{equation} \theta_{i+1} = \theta_{i} - \eta_i \frac{\partial \log p(x_i)}{\partial \theta_i} \end{equation} The practice I learned in Hinton et al. (2006) and in Tieleman (2008) is different though: Both papers define a number of fixed iterations a priori. Could somebody kindly educate me why recursively updating the parameters until convergence is not a good idea? In particular, I'm interested whether there's a theoretical flaw in my reasoning or whether computational capacities dictate sticking to a fixed number of iterations. I am grateful for any help!

Topic rbm neural-network optimization

Category Data Science


I believe the problem is that log-likelihood is not directly computable, because of exponential in number of units complexity. There exist different proxies to the true log-likelihood, for instance pseudo log-likelihood (more here), and in principle you can train RBM until PLL is not changed too much.

Hovewer, there is a lot of randomness involved in the training process, so PLL most likely will be quite noisy (even true log-likelihood would be I believe).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.