Tokenizer returning incorrect values and losing a lot of data

Question

Tokenizer returning incorrect values and losing a lot of data

hoshii_tomato

2021年11月29日 08:51

(cross posted from main stackoverflow) This is a weird situation so I hope I can explain it correctly. My partner and I are working on a ML project where we create a model that predicts whether a Reddit comment is sarcastic or not. (Data set for reference) We have created our model based on the training data csv (all seems good), and now want to test it based on the testing data csv. To do so we have split the testing data csv after importing it as a json on colab. The initial import was shaped weird (8, 500k+) so we transposed with pandas to make it shaped correctly so we could use it with our model (500k+, 8).

Here is the issue: when trying to Tokenize the values of the testing data csv (which printed out correctly before), somehow we end up with only 8 values out of 500k! Here is our code:

training_sentences = Test_transposed[:115958] #this is our transposed dataframe of test data, the shape is (115958,8)
testing_sentences = Test_transposed[115958:] #shape is (463832,8)
training_labels = labels[:115958]
testing_labels = labels[115958:]

# Use a ~80/20 train/test split based on our prior model
sarcasm_train = Test_transposed[:115958]
sarcasm_test = Test_transposed[115958:]
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok) #vocab_size = 10000 in this case, we have lowered it and increased it with no changes

tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(training_sentences)
#this output is list of len(8), with the values being: [[2],[3],[4],[5],[6]...[10]]
training_padded = pad_sequences(training_sequences,maxlen=max_length, padding=padding_type, truncating=trunc_type)
#ndarray with shape (8,50)
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
#this output is list of len(8), with the values being: [[2],[3],[4],[5],[6]...[10]]
    testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type) #ndarray with shape (8,50)

So somehow 1) a lot of data is being lost because we are only being returned 8 outputs and 2) our data ends up being the same because training_sequences and testing_sequences are literally exactly the same despite having different input data. Any help would be appreciated, thank you.

Edit: When initially tokenizing values from the training set, we did not have this issue. Only with this set. The output after model.predict(testing_padded) is something like [[0.467..],[0.467..],[0.467..],[0.467..],[0.467..]...x8] where all the values are nearly identical. This is a definite contrast to when using model.predict(training_df) on the training data which gives us specific and varied values (and a lot more than just 8!).

Topic tokenization data dataset data-cleaning machine-learning

Category Data Science

Tokenizer returning incorrect values and losing a lot of data

About