Containing multicomponent data in rows or columns
I have been working with DNA sequences and compiled a table with features from those sequences. I have a column called Trimer, which contains strings. For some DNA sequences there is one trimer of interest so that column contains one 3 character string (i.e. "ATG"). For other rows in the table that trimer column has 2 or 3 trimers of interest so the Trimer column has multiple strings in it (i.e. "ATT, CTG, GAT"). All trimers from one sequence should be thought to have equal weight and importance.
I know I cannot analyze my data in this format. I was wondering whether to split the Trimer column into 3 columns so if a sequence only has one trimer of interest the cells in the other two columns will remain blank. When doing my analysis my worry is that these columns will be seen as different features and will be weighted differently.
I was also thinking to make multiple row entries for the same DNA sequence. But the independent variable is influenced by the combination of trimers.
Any advice on how to change my table or create dummy variables is much appreciated. Thank you!
Topic preprocessing data-formats data-cleaning
Category Data Science