Sudden jumps in accuracy with logistic regression and bag of words : "glm.fit: algorithm did not converge"
I work on a bag of words, on the Toxic Comments Classifications challenge. The challenge is closed but the dataset is very nice to learn.
I use R, tf-idf, tm, and logistic regression.
I have a strange pattern in the accuracy results, linked with the error: glm.fit: algorithm did not converge. It tired the solution proposed in other answers and multiplied maxit by 4, but it did not help.
Glimpse of the functions used
sub-sampling
Original distribution is 200K non-toxic (0) and 20K toxic (1)
set.seed(42)
df_toxic = df[df$toxic == 1,]
df_ok = df[df$toxic == 0,]
df_ok_sampled = df_ok[sample(nrow(df_ok), nrow(df_toxic)), ]
df_sub = bind_rows(df_ok_sampled,df_toxic)
Bag of words creation
# Words
control_list_words = list(
tokenize = words,
language=en,
bounds = list(global = c(100, Inf)),
weighting = weightTfIdf,
tolower = TRUE,
removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
stemming = TRUE
)
dtm_words = DocumentTermMatrix(corpus, control=control_list_words)
# nGrams
control_list_ngrams = list(
tokenize = nGramsTokenizer,
language=en,
bounds = list(global = c(1000, Inf)),
weighting = weightTfIdf,
tolower = TRUE,
removePunctuation = TRUE,
removeNumbers = TRUE,
# We don't remove stop-words for nGrams as structure like are a or such a are meaningful for toxic comments
stopwords = FALSE,
stemming = TRUE
)
dtm_ngrams = DocumentTermMatrix(corpus, control=control_list_ngrams)
# Merge the two
X = cbind(m_words,m_ngrams)
Remove correlations
highlyCor = findCorrelation(cor(bow), cutoff = cutoff, exact = TRUE)
pruned_bow = bow[,-as.vector(highlyCor)]
Logistic regression
f - glm(df_toxic ~ ., data=df_train, maxit = 100, family = 'binomial')
Correlation cutoff vs accuracy
Allure of the confusion matrix
In the high dimensions yet low accuracy intervals: unbalanced
Reference
Prediction 0 1
0 2530 3253
1 243 5598
In the high accuracy intervals: balanced
Reference
Prediction 0 1
0 4883 900
1 641 5200
In the low dimensions and low accuracy intervals: unbalanced in the other way
Reference
Prediction 0 1
0 5272 511
1 3239 2602
???
Do you know what exactly is this glm.fit: algorithm did not converge and why raising maxit to 100 did not help?
Thanks
Topic linear-regression glm logistic-regression nlp r
Category Data Science