Optimal input setup for character-level text classification RNN
I want to classify 500-character long text samples as to whether they look like natural language using a character-level RNN. I'm unsure as to the best way to feed the input to the RNN. Here are two approaches I've thought of:
- Provide the whole 500 characters (one per time step) to the RNN, and predict a binary class, $\{0,1\}$.
- Provide shorter overlapping segments (e.g. 10 characters) and predict the next (e.g. 11th) character. Convert this to classification by taking the test input and calculate the joint probability of the observed characters based on predicted next-character distributions.
The first approach seems sub-optimal as I don't believe that the 1st character is going to have any effect on the prediction of the 500th character. The second approach gives me diminishingly small likelihoods when you calculate the joint probability.
I'm aiming for a more nuanced language model akin to n-gram frequency counting. I'm using simple RNNs for now but intend to swap to either LSTM or GRU.
Topic text-classification rnn neural-network language-model nlp
Category Data Science