Optimal input setup for character-level text classification RNN

I want to classify 500-character long text samples as to whether they look like natural language using a character-level RNN. I'm unsure as to the best way to feed the input to the RNN. Here are two approaches I've thought of:

  1. Provide the whole 500 characters (one per time step) to the RNN, and predict a binary class, $\{0,1\}$.
  2. Provide shorter overlapping segments (e.g. 10 characters) and predict the next (e.g. 11th) character. Convert this to classification by taking the test input and calculate the joint probability of the observed characters based on predicted next-character distributions.

The first approach seems sub-optimal as I don't believe that the 1st character is going to have any effect on the prediction of the 500th character. The second approach gives me diminishingly small likelihoods when you calculate the joint probability.

I'm aiming for a more nuanced language model akin to n-gram frequency counting. I'm using simple RNNs for now but intend to swap to either LSTM or GRU.

Topic text-classification rnn neural-network language-model nlp

Category Data Science


For character-level NLP, one-dimensional convolutions are often used to shrink the number of hidden states to a reasonable number.

For instance, ELMo first tokenizes the input into words and then uses a CNN with max-pooling to obtain character-based word embeddings that are used as an input to an LSTM.

In machine translation, there is an approach that shrinks the character-level input into pseudo-word hidden states. This is a one-dimensional CNN over the character input (here, without segmenting the input into words), the CNN is followed by max-pooling with window step 5 (which is the average word length) and a bunch of highway-layers. These pseudo-word states are then used again as an input to an LSTM.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.