Sequence to Sequence learning applied to list of numbers

I am looking to apply ML methods to genetic data. My goal is to predict which rare (generally de novo) mutations a person has based on what non-rare (generally inherited) mutations.

I have worked on this mutation data before, and stored the mutation data as one-hot vectors: a person X can have mutation Y zero times, once on chromatid A, once on chromatid B, or once on each chromatid. This is represented as {'0|0', '0|1', '1|0', '1|1'}.

The target data to predict would be a list of positions in the genome, which are large numbers. This list is of variable size, as not everybody has the same number of rare mutations.

I found this blog post which explains sequence to sequence learning, which looks close to what I would like to do. However, my source data is very different to what they use, and I'm not sure if having a list of numbers as target would work as well as having a list of characters.

Should I try to adapt their code to my problem, or is there a better model architecture that I should use? And if I do adapt this to my case, which major modifications should I start with? (I am fairly new to ML, and for now my applications have all been quite simple)

Topic sequence-to-sequence bioinformatics categorical-data machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.