Understanding how Long Short-Term Memory works in classification of sequences of symbols

I want to use a LSTM neural network to classify sequences of protein according to the host species. For example, I have these sequences of letters (toy example, just to understand):

  • MNTQILVFIACVLIEAKGDKICL belongs to human
  • AKGDKICLMNTQILVFIACVLIE belongs to human
  • MNTQAKGDKICLILVFIACVLIE belongs to dog

The sequences are different only according to the position of the subsequence AKGDKICL and my network should learn to recognize this. Is a LSTM network able to do this ?

I am trying to focus on the meaning of Long Short-Term Memory. From An Introduction To Recurrent Neural Networks And The Math That Powers Them:

A recurrent neural network (RNN) is a special type of an artificial neural network adapted to work for time series data or data that involves sequences. Ordinary feed forward neural networks are only meant for data points, which are independent of each other. However, if we have data in a sequence such that one data point depends upon the previous data point, we need to modify the neural network to incorporate the dependencies between these data points. RNNs have the concept of ‘memory’ that helps them store the states or information of previous inputs to generate the next output of the sequence.

Moreover, from Recurrent Neural Networks by Example in Python:

a recurrent neural network (RNN) processes sequences — whether daily stock prices, sentences, or sensor measurements — one element at a time while retaining a memory (called a state) of what has come previously in the sequence. Recurrent means the output at the current time step becomes the input to the next time step. At each element of the sequence, the model considers not just the current input, but what it remembers about the preceding elements. This memory allows the network to learn long-term dependencies in a sequence which means it can take the entire context into account when making a prediction, whether that be the next word in a sentence, a sentiment classification, or the next temperature measurement. A RNN is designed to mimic the human way of processing sequences: we consider the entire sentence when forming a response instead of words by themselves.

Then, differently from simple RNNs that are affected by memory decay, a LSTM has the concept of storing events over an extended period (long-term memory).

So, in my simple exercise, this is traduced in the following: if I consider each letter as token, using keras tokenizer, I will obtain for each sequence an array of integers like:

[7, 8, 9, 10, 1, 2, 3, 11, 1, 4, 5, 3, 2, 1, 12, 4, 6, 13, 14, 6, 1, 5, 2]

Once I have translated this sequences of symbols into vectors, I can feed them to the LSTM network that is capable to capture the order of these integers (symbols), keep in memory it and consider it when it has to classify the next sequence?

For example, If I give it several sequences like the ones reported above, will he be able to recognize that if the subsequence AKGDKICL is positioned at the end or at the beginning of the sequence, it belongs to human while if it is in the middle, it belongs to dog ? Is this the meaning of Long Short-Term Memory ? And is this obtained if I choose each symbol of the sequence as token ?

Topic text-classification lstm bioinformatics neural-network nlp

Category Data Science


The solution of every problem starts with asking the right question and then look for an appropriate solution to it.

From my understanding of the problem you are trying to solve, it is a Sequence Classification problem, and Yes, RNN-LSTM is one of the approaches you can follow. I am no expert in protein sequencing, so I am just trying to provide a direction to look into.

There are two important factors you will need to consider for applying RNN-LSTM:

  1. Tokenization i.e. how you want to split your protein sequence. One way would be to break the sequence into individual characters which in my opinion will create unnecessary noise for the algorithm to learn patterns. It is more of a domain knowledge based choice which matters here. From your example it seems that you have prior knowledge of subsequences and that should be the way to create tokens and limit the vocabulary for this problem.

  2. Embeddings i.e. the vector representations for the protein sequence after tokenization. If it was English language you could have used pre-trained word embeddings like GloVe, but that is not the case. I would recommend you to search for any existing protein subsequence based embeddings to create vector representations for your use case. If not, then you can go ahead with one of the general frameworks for token vectorization.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.