N-grams for RNNs

Question

N-grams for RNNs

mojbius

2021年4月30日 02:08

Given a word $w_{n}$ a statistical model such a Markov chain using n-grams predicts the subsequent word $w_{n+1}$. The prediction is by no means random.

How is this translated into a neural model? I have tried tokenizing and sequencing my sentences, below is how they are prepared to be passed to the model:

train_x = np.zeros([len(sequences), max_seq_len], dtype=np.int32)
for i, sequence in enumerate(sequences[:-1]): #using all words except last
    for t, word in enumerate(sequence.split()):
        train_x[i, t] = word2idx(word) #storing in word vectors

The sequences look like this:

Given sentence Hello my name is:
Hello 
Hello my
Hello my name
Hello my name is

Passing these sequences as input to an RNN with an LSTM layer, the predictions of the next word (given a word) I'm getting are random.

Topic lstm ngrams rnn nlp

Category Data Science

Nitin · Accepted Answer · 2021年4月30日 02:08

You don't need to do the n-gram creation for an RNN like you're showing. The point of Neural Language modeling with RNN/LSTM is to avoid having to make the Markov assumptions you state. To use an RNN, you just feed the whole sentence as-is to the RNN as a sequence, and as a target, you feed a sequence with each word from the input shifted one to the right.

You can look at this repo for an example of an RNN language model: https://github.com/pytorch/examples/tree/master/word_language_model

it may be better to use LSTMs which are better able to capture long range dependencies (longer than a Markov model might allow!). I suspect you're getting random sequences because of the repetition of your short sequences, which just adds a lot of noise for the Neural Net.

Michael Solotky · Accepted Answer · 2020年8月3日 19:43

A neural language model tries to predict a conditional probability $P (w_{i + 1} | w_1, \dots, w_i)$. It approximates the probability with the following $P(w_{i+1} | s(w_1, \dots, w_i))$, where $s$ is a state function. After that an LSTM looked at all the words $w_1, \dots, w_i$, it has an updated state, so now it contains some useful information about all previous words. You've got an error in your code: you should take all words of a sentence, but the last. But you've taken all, but the last sentence.

In language modeling a normal sentence $w_i, \dots w_n$ is usually augmented with 2 special tokens: -- begin of sequence, -- end of sequence. So your example "Hello my name is" should transform into " Hello my name is ". Now your source tokens are all except the last i.e. " Hello my name is" and the targets you want to predict are all expect the first i.e. "Hello my name is ". You feed tokens in your LSTM one at a time and try to predict the next token.

N-grams for RNNs

About