Importance of Random initialisation VS number of hidden units

A question crossed my mind not so long ago: I am doing experiments on Language Model with RNN (always with the same network topology: 50 hidden units, and 10M directs connections that are emulating N_grams models) and different fraction of corpus (10,25,50,75,100%) (9M words).

I noticed that while perplexity seems to decrease when the training data become more abundant, certain times it does not.

Last example : 143 118 109 106 112

My first thought was network initialization, so I began testing with a smaller corpus and 20 hidden units (for technical reasons. Even with 10% corpus, learning can take up to 30h, which is problematic for me), and I found after 50 tries that all nets converged on values within 3% of each other.

But, I thought that maybe the importance of this initialization is a function of the number of hidden units? I mean the more hidden units the more parameters to tune.

Also, maybe my stop criterion is too sensitive (It stops if evolution of perplexity between two iterations is inferior to a certain number). Do you think it would make an impact to allow it to run one of two iterations after the criterion was met to see if it was just a local thing?

Topic rnn neural-network language-model

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.