Splitting sentiment analysis training data into x-train and y-train for a RNN?
Suppose I have a dataset of comments from users, around multiple websites, such that in each row, there are two comments, and one is considered more 'negative' and one more 'positive' indicated by the placement of the comments in the 'negative' and 'positive' columns.
If I were to preprocess and vectorize the data, how would I split this up into x-train and y-train data for a categorical crossentropy RNN? I thought at first to have my x-train be tuples of length 2 where in each datapoint I have two comments where one is voted more negatively and the other more positively compared to the other, and then for my y-train be a tuple (a,b) where a or b is 1 and the other is 0, to denote which one is voted to be more negative or positive (depending on whether I want to view the comments as one more and one less negative or one more and one less positive). However, in my dataset, if I choose to rename 'positive' as 'less negative', the second column is 'less negative' and third column is 'negative'. It lends one to therefore just arrange their x-train data as ([less negative comment], [more negative comment]), but my y-train data would always therefore be (0,1).I could circumvent this by scrambling the order of comments to ensure there isn't that predictability, but is this 'cheating' (ie -- how can merely scrambling the order in the x-train data change the model accuracy?)?
Just reading this now, I'm also realizing my y-train can simply be either 0 or 1, denoting the index of the 'more negative' comment. Regardless -- is there a better way to go about this? A more standard way this is tackled?
Topic methodology rnn sentiment-analysis
Category Data Science