Is this random forest logical correct and correct implemented with R and gbm?

For professional reasons I want to learn and understand random forests. I feel unsafe if my understanding is the correct or if I am doing logical errors.

I got a data set with 15 million entries and want to make a regression for a numerical target (time). The data structure is:

I have 7 categorical variables, 1 date and 4 numerical features. After data preparation I split the data into training and test data set.

Than I defined a gradient boosting machine model and search for the right parameters by try and error, research and more try and error. Is this approach correct as far?

#train and test are the prepared data frames
#RMSE is an implementation of the root mean squared error function
#gradient boosting machine training model
gbmTraining - gbm(train$target~., data=train, distribution = "gaussian", verbose = TRUE,
                 n.trees = 500, #number of trees
                 cv.folds=5, #cross validation 
                 n.cores=2, #number of processor cores
                 shrinkage = .03,       #learning rate
                 interaction.depth = 42, #number of notes
                 n.minobsinnode = 15)   #depth of tree
print(gbmTraining)
png(file = "Results/RelativeInfluence.png")
summary.gbm(gbmTraining, plot = TRUE)
dev.off()

print("best number of trees: ")
png(file = "Results/Convergence.png")
best_iter - gbm.perf(gbmTraining, method="cv", plot.it=TRUE)
dev.off()
print(best_iter)

#make train prediction
fitTrain - predict(gbmTraining, train, best_iter, type = "response", verbose = TRUE)

#training error 
errorTrain - RMSE(fitTrain, test$target)
errorTrain - round(errorTrain, digits = 1)
errorDiffTrain - fitTrain - test$target
print("meanTrainError(predicted, actual):")
print(mean(errorDiffTrain))

#write and save results
print("summary training")
print(head(data.frame("Actual" = train$target, "Predicted" = fitTrain)))
print(summary(data.frame("Actual" = train$target, "Predicted" = fitTrain)))
trainTable - data.frame("Actual" = train$target, "Predicted" = fitTrain)

write.csv(trainTable, "Results/trainTable.csv", row.names = FALSE)

Here I want to know if I use the correct arguments if I calculate the errors. Do I calculate the errors between the fitted data and data from the test data set or do I use here the data from the training data set?

If I found good parameters and want to go to validation the implementation becomes shorter because I use the found parameters. I create a new model for the tests:

#gbm testing model 
gbmTesting = gbm(test$target~., data=test, distribution = "gaussian", n.trees = 500,
                 n.cores=2, verbose = FALSE,
                 shrinkage = .03,
                 interaction.depth = 42,
                 n.minobsinnode = 15)

#make test prediction
fitTest - predict(gbmTesting, data=test, n.trees = 500, type = "response", verbose = FALSE)

#test error 
errorTest - RMSE(fitTest, test$target)
errorTest - round(errorTest, digits = 1)
errorDiffTest - fitTest - test$target
print("meanTestError(predicted, actual):")
print(mean(errorDiffTest))

print("summary testing")
print(head(data.frame("Actual" = test$target, "Predicted" = fitTest)))
print(summary(data.frame("Actual" = test$target, "Predicted" = fitTest)))
testTable - data.frame("Actual" = test$target, "Predicted" = fitTest)

write.csv(testTable, "Results/testTable.csv", row.names = FALSE)

Is this correct or do I fall into a mental tripping hazard?

p.s. the stack exchange community is so huge I am not sure if this has to be in some of the other sites? If yes, which one?

Topic gbm cross-validation random-forest r

Category Data Science


  1. Is the code correct? Yes, it looks correct to me

  2. How to calculate Errors: Calculate accuracy for train data Calculate accuracy for test data

If train error is high (RMSE in your case) - high bias, retrain model with more trees, less learning rate, more data (if possible)

If train error low, but test error high - high variance or overfit: regularization, increase k-fold

If tran and test error low - your model is good to go.

Hope this answeres your question.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.