Containing multicomponent data in rows or columns

I have been working with DNA sequences and compiled a table with features from those sequences. I have a column called Trimer, which contains strings. For some DNA sequences there is one trimer of interest so that column contains one 3 character string (i.e. "ATG"). For other rows in the table that trimer column has 2 or 3 trimers of interest so the Trimer column has multiple strings in it (i.e. "ATT, CTG, GAT"). All trimers from one sequence should be thought to have equal weight and importance.

I know I cannot analyze my data in this format. I was wondering whether to split the Trimer column into 3 columns so if a sequence only has one trimer of interest the cells in the other two columns will remain blank. When doing my analysis my worry is that these columns will be seen as different features and will be weighted differently.

I was also thinking to make multiple row entries for the same DNA sequence. But the independent variable is influenced by the combination of trimers.

Any advice on how to change my table or create dummy variables is much appreciated. Thank you!

Topic preprocessing data-formats data-cleaning

Category Data Science


Note: This more of a comment to help your question being answered I will delete it if needed. It is not clear to me which analysis you want to do with this dataset thus it is hard to advice on the structure and format of the dataset. In my experience when dealing with DNA strings and their associated features, R and in particular the Bioconductor package are the best. These packages contain already data structure, functions and operations purely design to deal with bioinformatics tasks.The GenomicAlignements could be a good start. Hope this helps!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.