Sequence Embedding using embedding layer: how does the network architecture influence it?

Question

Sequence Embedding using embedding layer: how does the network architecture influence it?

HelpNeederStudent

2022年2月16日 17:37

I want to obtain a dense vector representation of protein sequences so that I can meaningfully represent them in an embedding space. We can consider them as sequences of letters, in particular there are 21 unique symbols which are the amino acids (for example: MNTQILVFIACVLIEAKGDKICL).

My approach is to use a sequence embedding that can be learned as a part of a deep learning model (built with Python using Keras libraries), that is a classifier (supervised) neural network which I train to classify sequences according to the host species they belong to. The steps I follow are the following:

Tokenization. The only way I can tokenize these sequences of amino acids symbols is to consider single characters as tokens. I also found this example in Kaggle Deep Protein Sequence family Classification in Python that classifies different proteins and uses single amino acids as tokens.
Embedding. Stealing words from the answer to the question How the embedding layer is trained in Keras Embedding layer:

An embedding layer is a trainable layer that contains 1 embedding matrix, which is two dimensional, in one axis the number of unique values the categorical input can take (for example 26 in the case of lower case alphabet) and on the other axis the dimensionality of your embedding space. The role of the embedding layer is to map a category into a dense space in a way that is useful for the task at hand, at least in a supervised task. This usually means there is some semantic value in the embedding vectors and categories that are close in this space will be close in meaning for the task.

Moreover, useful words from the blog titled as Neural Network Embeddings Explained:

The main issue with one-hot encoding is that the transformation does not rely on any supervision. We can greatly improve embeddings by learning them using a neural network on a supervised task. The embeddings form the parameters — weights — of the network which are adjusted to minimize loss on the task. The resulting embedded vectors are representations of categories where similar categories — relative to the task — are closer to one another.

Thus putting all these pieces together in my case: I choose as tokens the single amino acids, so I have 21 unique symbols (i.e. number of amino acids) and I choose 10 as dimension of the embedding space, thus my embedding layer dimension is 21 x 10. This means that, once the neural network is trained, I can extract the weights of the embedding layer that are 21 vectors (one for each amino acid) and each vector is a 10 dimensional vector.

As the second extract explains, each element of these vectors are values that are adjusted to minimize the loss in the task. I could see each amino acid as if it was a letter in a word and I wanted to classify these words according to something (like positive or negative comment); or as if it was a word in a sentence and I chose as tokens words and I wanted to classify these sentences according to something (like positive or negative comment).

Since I want sequence representation, I have to find a way to put together the embeddings of single amino acids and the only way that I found feasible is to average all the 10 dimensional vectors so to obtain for each sequence a 10 dimensional vector that is the average of the embeddings of all the amino acids. This would for sure highlight if there are over-represented symbols in one sequence with respect to other. Furthermore, since each symbol is associated to vectors whose values are adjusted to minimize loss on the task, averaging should preserve each single amino acid meaningful embedding and give a global meaningful embedding of the whole sequence that minimizes the loss on the task. This, in fact, seems to work for simple sentences embeddings. See the answer to a my question post: How to obtain vector representation of phrases using the embedding layer and do PCA with it in which each word of sentences were considered as tokens and I was looking for vectors embedding the whole sentences. Similarly, here I can see each single amino acid symbol as word and the whole sequence as sentence.

Hence this method should carry/embed two information: frequencies of letters of sequences and classes to which each sequence belong to (in this case host species).

But I would like it to consider also the global structure of letters positions (not only the relative frequencies), because with this method sequences like: MNTQILVFIACVLIEAKGDKICL (sequence 1) and AKGDKICLMNTQILVFIACVLIE (sequence 2) are represented by the same 10 dimensional vector.

Thus... from here the third point:

Choice of neural network architecture. Does the choice of neural network architecture influence the embedding of each single amino acid symbol and, consequently, the embedding of the entire sequence ? For example, if I use a LSTM neural network that should memorize the global structure of sequences, would it improve the dense vector representation (both in general and in this particular case) ? I would expect yes since, as reported also in the extracts, this embedding layer (so its weights) is trained as all the others so the backpropagation algorithm updates also these weights. In other words, if LSTM network has the power to recognize the importance of the position of each character according to the task (for example, it is able to learn that: if a letter is in position 1, it means it belongs to human, while instead if it is in position 2, it means it belongs to dog) then weights should be updated also according to this. Differently from a simple deep learning model that is not able to consider this information and to deal with sequences.

I understand that if I average the embeddings of the single amino acids, I would have anyway the problem that sequences like MNTQILVFIACVLIEAKGDKICL sequence 1 and AKGDKICLMNTQILVFIACVLIE sequence 2 would have the same dense vector representation. But does, also in this case in which I average the embeddings of all the symbols, a better choice of the network architecture give a better result in some way ?

In few words, what I really would like to know is:

Using LSTM would improve protein sequences embeddings (which are built with the method explained before, i.e. averaging the embeddings of the amino acids present in the sequence) or not ? If yes or no, why ? Or maybe it is correct to say that is necessary to use networks like LSTM because we are dealing with these kind of data that are sequences?

I think that the amino acid embeddings obtained with the embedding layer would be more meaningful if this embedding layer is part of a LSTM neural network. As it happens in text analysis: if we train a LSTM neural network to classify sentences according to their sentiments, this neural network is able to capture the underlying structure of inputs (our sentences) and puts relation between words in our vocabulary into a higher dimension (let's say 10 as in my example) by optimization. Thus I would expect that this happens also in my case that is a sequence of symbols instead of words...

But I would like a confirmation with some more explanations to clarify, maybe explaining me better how the LSTMs work and how the embedding layer in an LSTM neural network is updated (if it is necessary to understand the answer).

I apologize for the long question and I thank you in advance.

Topic embeddings bioinformatics sequence deep-learning nlp

Category Data Science

Brian Spiering · Accepted Answer · 2022年2月10日 00:29

My understanding is you want to classify protein sequences. For example, the sequence HWLQMRDSMNTYNNMVNRCFATCIRSFQEKKVNAEEMDCTKRCVTKFVGYSQRVALRFAE belongs to a dog.

You might be overthinking the problem. It might be useful to start building.

One option is to start with a simple model that works. The code would be something like:

from tensorflow.keras.layers import *
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(input_dim=max_features, output_dim=10, input_length=maxlen)
model.add(LSTM(units=20))
model.add(Dense(units=46))
model.add(Activation('softmax'))

From there make the model more complex and see it helps better capture the signals in the data and still make meaningful interperations.

Sequence Embedding using embedding layer: how does the network architecture influence it?

About