Word-level text generation with word embeddings – outputting a word vector instead of a probability distribution

Question

Word-level text generation with word embeddings – outputting a word vector instead of a probability distribution

czypsu

2022年5月13日 23:02

I am currently researching the topic of text generation for my university project. I decided (ofc) to go with a RNN getting a sequence of tokens as input with a target of predicting the next token given the sequence. I have been reading through a number of tutorials and there is one thing that I am wondering about. The sources I have read, regardless of how they encode the X sequences (one-hot or word embeddings), encode the y target tokens as a one-hot vector to interpret the network output as a probability distribution over all the possible tokens. This way the task is actually framed as a multi-class classification problem (eg. as in here https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/).

I am indeed planning to encode my X sequences into sequences of vectors mapping each token to a pre-trained word vector but I was actually thinking of designing the network to output a vector of values of the same dimension as the input vectors and map the output vector to a specific word by looking up the most similar vector within the known pre-trained vectors. As a side note, this would frame it as a regression problem, wouldn't it (since we are trying to find a vector of numbers matching the vector of the target word)?

My question is - is the method described in the paragraph above (outputting a word vector instead of probability distribution) known under some term or name? I doubt that nobody has thought about it before me (unless it doesn't make sense, and I am unaware of it), but a quick google search describing the method hasn't found anything useful and I'd like to learn more.

Topic text-generation rnn word-embeddings nlp

Category Data Science

Julien Audet · Accepted Answer · 2022年2月21日 22:39

I had the same question, and I fixed my problem. I am also working on a "text generation" project (although with cards from the card game Magic: the Gathering: tokens are cards, not words). I also couldn't find other examples of RNN outputting word vectors.

Disclaimer: I am not yet confident with RNN, neural networks, and Tensorflow/Keras in general. I'm providing my answer because there is none posted yet, I had the same problem, and I now have a solution that works for me. I included as much information as I could. TLDR at the end

Here's what I did:

I pre-vectorized my vocabulary (using Gensim Word2Vec). I then split my corpus of vectors into training and testing groups. The model I used is from this link: https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/

Here is the model from the link:

# taken from: https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))

with three exceptions:

I removed the first "Embedding" layer because my "words" were already vectorized. The first layer is then the "LSTM" layer.
I changed the number of units in the last layer to the size of my word vectors. My vectors are of length 64.
I removed the "softmax" activation on the last layer. I don't want a categorical prediction, I want 64 numbers (that form a length 64 vector).

Here is my model adapted from the link:

model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(100, return_sequences=True))
model.add(tf.keras.layers.LSTM(100))
model.add(tf.keras.layers.Dense(100, activation='relu'))
model.add(tf.keras.layers.Dense(64))

Finally, in the link, the model is compiled with the loss function 'categorical_crossentropy', which I assume is appropriate for categorical predictions.

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

However, we're looking at error margins between vectors. I think a good way to do that is by looking at their cosine similarity. Cosine similarity measures the angle between vectors. (https://en.wikipedia.org/wiki/Cosine_similarity#:~:text=The%20term%20cosine%20distance%20is%20commonly%20used%20for,Schwarz%20inequality%20%E2%80%94and%20it%20violates%20the%20coincidence%20axiom.)

Luckily, Tensorflow has a cosine similarity error function. Here's my compile statement: model.compile(loss='cosine_similarity', optimizer='adam', metrics=['accuracy'])

This model can now be trained to output vectors similar to a target vector. To decode the output back to a word, simply find the vector closest to your output. For me, that is: word_vector_model.most_similar(prediction_vector) (model from Gensim.models.Word2Vec, it has a "most_similar()" method)

With those parameters, I was able to successfully train my model to predict the last "card vector" from lists of 60 cards. I currently have a slightly overfitted model because of my small dataset size and a big number of epochs. I think that overfitting is at least a good sign of the model correctly predicting vectors.

TLDR -------------------------------------------------------------

I tweaked the model from this link: https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/. Here's my take:

# define model
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.LSTM(100, return_sequences=True))
    model.add(tf.keras.layers.LSTM(100))
    model.add(tf.keras.layers.Dense(100, activation='relu'))
    model.add(tf.keras.layers.Dense(64))
    # compile model
    model.compile(loss='cosine_similarity', optimizer='adam', metrics=['accuracy'])
    # fit model
    model.fit(x_train, y_train, batch_size=120, epochs=100)
    print(model.summary())
    
    # save the model to file
    model.save('lstm_models/m5.h5')

Notice my last layer is not a softmax activation: it predicts numbers (that belong in a vector) not categories. Also notice that the last layer has num_units = len(word_vector) (64 for me).

Please let me know if that works for you, and if not, what you did, as it might help me with my project :)

Word-level text generation with word embeddings – outputting a word vector instead of a probability distribution

About