Binary Logistic Regression in R on the dataset, Titanic

I am new to R and Model Learning Algorithm. I am trying to perform Binary Logistic Regression on the training set using the Titanic dataset which is provided by default from R. I am running the algorithm on the set with the variable, Survived as the outcome variable. The variable, Survived contains Yes and No values. I am splitting the dataset into two set, training(40) and test(60). The data look like this below, Titanic Data

#Binary Logistic Regression
#Import dataset, Titanic
data(Titanic)
#Load data to the example as data.frame
example- as.data.frame(Titanic)
#Add a new column, Country to determine on where they are born
example['Country'] - NA
#Declare a vector of unique country
countryunique - array(c(Africa,USA,Japan,Australia,Sweden,UK,France))
#Declare an empty vector
new_country - c()
#Perfor looping through the column, Country
for(loopitem in example$Country)
{
    #Perform random selection of an array, countryunique 
    loopitem - sample(countryunique, 1)
    #Load the new value to the vector
    new_country- c(new_country,loopitem)
}
#Override the Country column with new data
example$Country- new_country

#Convert the column to factor but the Freq as numeric
example$Class- as.factor(example$Class)
example$Sex- as.factor(example$Sex)
example$Age- as.factor(example$Age)
example$Survived- as.factor(example$Survived)
example$Country- as.factor(example$Country)
example$Freq- as.numeric(example$Freq)

#Split the dataset to training and test set.
set.seed(20)
sample_size - floor(0.6 * nrow(example))
test_index - sample(seq_len(nrow(example)), size = sample_size)
#Load data into test for 60 percentage
test - example[test_index,]
#Load data into training for 40 percentage
training - example[-test_index, ]

#Logistic regression modelling
mod.lg - glm(Survived~., family=binomial(), data=training);
#Provide the summary of the model
summary(mod.lg)

The summary of the model is shown below.

Call:
glm(formula = Survived ~ ., family = binomial(), data = training)

Deviance Residuals: 
            1              4              5              7             10             12             
15             16             21             22             23             26             30  
-0.0000040454  -0.0000024660  -0.0000104674  -0.0000024921  -0.0000107568  -0.0000000211  
-0.0000000211  -0.0000053423   0.0000107568   0.0000041004   0.0000005560   0.0000103920   
0.0000024086  

Coefficients: (1 not defined because of singularities)
                 Estimate   Std. Error z value Pr(|z|)
(Intercept)           48.8492  822876.6829   0.000        1
Class2nd             -43.8783 1592352.1656   0.000        1
Class3rd             -39.2030  351691.5041   0.000        1
ClassCrew            -75.3682  822888.6960   0.000        1
SexFemale            -24.5969  819055.3208   0.000        1
AgeAdult              76.0607  827305.0519   0.000        1
Freq                  -0.6793    1165.4986  -0.001        1
CountryAustralia     -74.3782  849754.8545   0.000        1
CountryFrance         24.4715  895175.9026   0.000        1
CountrySweden        -47.8800  115169.7337   0.000        1
CountryUK             53.9582 1576877.4347   0.000        1
CountryUSA                 NA           NA      NA       NA

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 17.3232395027816  on 12  degrees of freedom
Residual deviance:  0.0000000005291  on  2  degrees of freedom
AIC: 22

Number of Fisher Scoring iterations: 25

I want to know on whether I am on the correct path to implementing Binary Logistic Regression on the dataset, Titanic and noticed that the result of the summary of the model contain many 0.000 on the third column.. How to fix this issue? How to interpret the summary of the model?

Thank you.

Topic rstudio regression logistic-regression r

Category Data Science


The small z-score (third column) tells you that there is "a lot of" uncertainty regarding your estimated coefficients. This is also expressed by the "large" p-value (=1). Essentially this means that your model did not learn anything useful.

When you look at the confidence band of your estimated coefficients confint(mod.lg), you see that the coefficients could be negative or positive (this is what a "high" p-value indicates, that coefficients are not statistically different from zero).

Why is that?

  1. You have very little training data. Two degrees of freedom left is way too little.
  2. You seem to assign "country" randomly. There is no useful information in this random variable.

I suggest using other data with more observations. Logistic regression (with R-Labs) is very well explained in "Introduction to Statistical Learning" (Ch. 4). Maybe you have a good read of the chapter and try the "Labs" so to get a sound idea of Logit.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.