Sudden jumps in accuracy with logistic regression and bag of words : " algorithm did not converge"

I work on a bag of words, on the Toxic Comments Classifications challenge. The challenge is closed but the dataset is very nice to learn.

I use R, tf-idf, tm, and logistic regression.

I have a strange pattern in the accuracy results, linked with the error: algorithm did not converge. It tired the solution proposed in other answers and multiplied maxit by 4, but it did not help.

Glimpse of the functions used


Original distribution is 200K non-toxic (0) and 20K toxic (1)


df_toxic = df[df$toxic == 1,]
df_ok = df[df$toxic == 0,]

df_ok_sampled = df_ok[sample(nrow(df_ok), nrow(df_toxic)), ]

df_sub = bind_rows(df_ok_sampled,df_toxic)

Bag of words creation

# Words 

control_list_words = list(
    tokenize = words,
    bounds = list(global = c(100, Inf)),
    weighting = weightTfIdf,
    tolower = TRUE,
    removePunctuation = TRUE,
    removeNumbers = TRUE,
    stopwords = TRUE,
    stemming = TRUE

dtm_words = DocumentTermMatrix(corpus, control=control_list_words) 
# nGrams

control_list_ngrams = list(
    tokenize = nGramsTokenizer,
    bounds = list(global = c(1000, Inf)),
    weighting = weightTfIdf,
    tolower = TRUE,
    removePunctuation = TRUE,
    removeNumbers = TRUE,
    # We don't remove stop-words for nGrams as structure like are a or such a are meaningful for toxic comments
    stopwords = FALSE,
    stemming = TRUE
dtm_ngrams = DocumentTermMatrix(corpus, control=control_list_ngrams)

# Merge the two
X = cbind(m_words,m_ngrams)

Remove correlations

highlyCor = findCorrelation(cor(bow), cutoff = cutoff, exact = TRUE)

pruned_bow = bow[,-as.vector(highlyCor)]

Logistic regression

f - glm(df_toxic ~ ., data=df_train, maxit = 100, family = 'binomial')

Correlation cutoff vs accuracy

Allure of the confusion matrix

In the high dimensions yet low accuracy intervals: unbalanced

Prediction    0    1
         0 2530 3253
         1  243 5598

In the high accuracy intervals: balanced

Prediction    0    1
         0 4883  900
         1  641 5200

In the low dimensions and low accuracy intervals: unbalanced in the other way

Prediction    0    1
         0 5272  511
         1 3239 2602


Do you know what exactly is this algorithm did not converge and why raising maxit to 100 did not help?


Topic linear-regression glm logistic-regression nlp r

Category Data Science


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.