Is this random forest logical correct and correct implemented with R and gbm?
For professional reasons I want to learn and understand random forests. I feel unsafe if my understanding is the correct or if I am doing logical errors.
I got a data set with 15 million entries and want to make a regression for a numerical target (time). The data structure is:
I have 7 categorical variables, 1 date and 4 numerical features. After data preparation I split the data into training and test data set.
Than I defined a gradient boosting machine model and search for the right parameters by try and error, research and more try and error. Is this approach correct as far?
#train and test are the prepared data frames
#RMSE is an implementation of the root mean squared error function
#gradient boosting machine training model
gbmTraining - gbm(train$target~., data=train, distribution = "gaussian", verbose = TRUE,
n.trees = 500, #number of trees
cv.folds=5, #cross validation
n.cores=2, #number of processor cores
shrinkage = .03, #learning rate
interaction.depth = 42, #number of notes
n.minobsinnode = 15) #depth of tree
print(gbmTraining)
png(file = "Results/RelativeInfluence.png")
summary.gbm(gbmTraining, plot = TRUE)
dev.off()
print("best number of trees: ")
png(file = "Results/Convergence.png")
best_iter - gbm.perf(gbmTraining, method="cv", plot.it=TRUE)
dev.off()
print(best_iter)
#make train prediction
fitTrain - predict(gbmTraining, train, best_iter, type = "response", verbose = TRUE)
#training error
errorTrain - RMSE(fitTrain, test$target)
errorTrain - round(errorTrain, digits = 1)
errorDiffTrain - fitTrain - test$target
print("meanTrainError(predicted, actual):")
print(mean(errorDiffTrain))
#write and save results
print("summary training")
print(head(data.frame("Actual" = train$target, "Predicted" = fitTrain)))
print(summary(data.frame("Actual" = train$target, "Predicted" = fitTrain)))
trainTable - data.frame("Actual" = train$target, "Predicted" = fitTrain)
write.csv(trainTable, "Results/trainTable.csv", row.names = FALSE)
Here I want to know if I use the correct arguments if I calculate the errors. Do I calculate the errors between the fitted data and data from the test data set or do I use here the data from the training data set?
If I found good parameters and want to go to validation the implementation becomes shorter because I use the found parameters. I create a new model for the tests:
#gbm testing model
gbmTesting = gbm(test$target~., data=test, distribution = "gaussian", n.trees = 500,
n.cores=2, verbose = FALSE,
shrinkage = .03,
interaction.depth = 42,
n.minobsinnode = 15)
#make test prediction
fitTest - predict(gbmTesting, data=test, n.trees = 500, type = "response", verbose = FALSE)
#test error
errorTest - RMSE(fitTest, test$target)
errorTest - round(errorTest, digits = 1)
errorDiffTest - fitTest - test$target
print("meanTestError(predicted, actual):")
print(mean(errorDiffTest))
print("summary testing")
print(head(data.frame("Actual" = test$target, "Predicted" = fitTest)))
print(summary(data.frame("Actual" = test$target, "Predicted" = fitTest)))
testTable - data.frame("Actual" = test$target, "Predicted" = fitTest)
write.csv(testTable, "Results/testTable.csv", row.names = FALSE)
Is this correct or do I fall into a mental tripping hazard?
p.s. the stack exchange community is so huge I am not sure if this has to be in some of the other sites? If yes, which one?
Topic gbm cross-validation random-forest r
Category Data Science