Oversampling on Sequence(Text) data
Has anyone been able to perform synthetic oversampling on Sequential data? From what I've read and understand, the oversampling/undersampling techniques that are currently used are only applicable on structured, tabular data.
But, if I've got a sequential data like this:
Sequence Label
[1,2,3,5,0,0,0,0] 3
[4,5,2,3,5,0,0,0] 5
[3,4,0,0,0,0,0,0] 7
where each sequence consists of integer tokens and padding, how do I perform SMOTE/ any other synthetic oversampling techniques? I don't want to do random replication of examples, since that's not very meaningful and prone to overfitting.
Could someone give me suggestions as to how I can go about implementing this in Python?
Topic imbalanced-learn class-imbalance scikit-learn python machine-learning
Category Data Science