Sudden jumps in accuracy with logistic regression and bag of words : "glm.fit: algorithm did not converge"

I work on a bag of words, on the Toxic Comments Classifications challenge. The challenge is closed but the dataset is very nice to learn.

I use R, tf-idf, tm, and logistic regression.

I have a strange pattern in the accuracy results, linked with the error: glm.fit: algorithm did not converge. It tired the solution proposed in other answers and multiplied maxit by 4, but it did not help.

Glimpse of the functions used

sub-sampling

Original distribution is 200K non-toxic (0) and 20K toxic (1)

set.seed(42)

df_toxic = df[df$toxic == 1,]
df_ok = df[df$toxic == 0,]

df_ok_sampled = df_ok[sample(nrow(df_ok), nrow(df_toxic)), ]

df_sub = bind_rows(df_ok_sampled,df_toxic)

Bag of words creation


# Words 

control_list_words = list(
    tokenize = words,
    language=en,
    bounds = list(global = c(100, Inf)),
    weighting = weightTfIdf,
    tolower = TRUE,
    removePunctuation = TRUE,
    removeNumbers = TRUE,
    stopwords = TRUE,
    stemming = TRUE
)

dtm_words = DocumentTermMatrix(corpus, control=control_list_words) 
  
# nGrams

control_list_ngrams = list(
    tokenize = nGramsTokenizer,
    language=en,
    bounds = list(global = c(1000, Inf)),
    weighting = weightTfIdf,
    tolower = TRUE,
    removePunctuation = TRUE,
    removeNumbers = TRUE,
    # We don't remove stop-words for nGrams as structure like are a or such a are meaningful for toxic comments
    stopwords = FALSE,
    stemming = TRUE
)
  
dtm_ngrams = DocumentTermMatrix(corpus, control=control_list_ngrams)

# Merge the two
X = cbind(m_words,m_ngrams)

Remove correlations

highlyCor = findCorrelation(cor(bow), cutoff = cutoff, exact = TRUE)

pruned_bow = bow[,-as.vector(highlyCor)]

Logistic regression

f - glm(df_toxic ~ ., data=df_train, maxit = 100, family = 'binomial')

Correlation cutoff vs accuracy

Allure of the confusion matrix

In the high dimensions yet low accuracy intervals: unbalanced

           Reference
Prediction    0    1
         0 2530 3253
         1  243 5598

In the high accuracy intervals: balanced

           Reference
Prediction    0    1
         0 4883  900
         1  641 5200

In the low dimensions and low accuracy intervals: unbalanced in the other way

           Reference
Prediction    0    1
         0 5272  511
         1 3239 2602

???

Do you know what exactly is this glm.fit: algorithm did not converge and why raising maxit to 100 did not help?

Thanks

Topic linear-regression glm logistic-regression nlp r

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.