Automatically uses several cores on R

Question

Automatically uses several cores on R

Guilherme Felipe Reis

2019年6月6日 04:30

I am using a library called MFE to generate meta-features. However, I am working right now with several files and I notice that I am using only 1 core of my machine and taking too much time.

I have been trying to implement some libraries as I saw in another question: library(iterators) library(foreach) library(doParallel)

This one, but me being dumb could not implement it ='(.

I just would like to put this snippet running in all my cores so I can make it faster:

## Extract general, statistical and model based measures
metafeatures(Species ~ ., iris, groups=c("general", "statistical", "model.based"))

So how would I do this? MFE library

[EDIT]:

What I am doing right now:

library(iterators)
library(foreach)
library(doParallel)
library(mfe)

# foreach
split = detectCores()
eachStart = 25
# set up iterators
iters = iter(rep(eachStart, split))
# set up combine function
cl = makeCluster(split)
registerDoParallel(cl)
result = foreach(nstart=iters) %do%
  metafeatures(Species ~ ., iris, groups=c("general", "statistical", "model.based"))
stopCluster(cl)

result

SAMPLE OF EXPECTED RESULT What I am expecting is this following output and using all cores from my CPU.

Topic meta-learning metadata r machine-learning

Category Data Science

aranglol · Accepted Answer · 2019年6月6日 04:30

First, you are not dumb. We are all learning here.

Second, I have looked at your code and I can see the problem closer now.

The code below works, but it doesn't give anything useful (nor does it lead to any speedup). I'll try to explain why.

library(iterators)
library(foreach)
library(doParallel)
library(mfe)

data = iris

# foreach
split = detectCores()
eachStart = 25

# set up iterators
iters = iter(rep(eachStart, split))

# set up combine function
cl = makeCluster(split)
registerDoParallel(cl)

result = foreach(nstart=iters, .packages = c("iterators", "mfe")) %dopar% {

  metafeatures(Species ~ ., iris, groups=c("general", "statistical", "model.based"))

}

stopCluster(cl)

This will return a list of length equal to the number of cores on your machine, but all of the elements in the list will be exactly the same. You are getting no speedup (in fact, your code will run slower if you use this) because all you are doing is running the exact same piece of code n = (number of cores on your machine) number of times, namely this:

metafeatures(Species ~ ., iris, groups=c("general", "statistical", "model.based"))

You aren't iterating over anything in your foreach loop to give any benefit to running in parallel. The overall idea of running stuff in parallel (in this context, anyway) is to run the exact same piece of code (i.e. a function) on each worker node, but with different inputs to the function's arguments in each call. In the context of a for loop, you can think of this as a single iteration of the for loop being sent to a single worker (splits, as you defined them in your code) on your computer. The benefit of course is that if you have many cores on your computer, different iterations of your for loop can be computed at the exact same time (vs. on a single core machine, you would have to work on a single iteration of the for loop one at a time). It is important to note that this all implies that each iteration in your for loop is independent of all other iterations and that the order in which you compute iterations does not matter (i.e., your algorithm is NOT sequential).

Here is one example that I can think of where you could consider running in parallel: good old k-fold cross validation. Each resulting training set receives the exact same pre processing but all you are changing is what folds make up the training and test sets.

This could be written like this:

library(foreach)
library(doParallel)
library(caret)
library(e1071)
library(tidyverse)

data = iris

cl = makeCluster(detectCores())

#Generate three folds from our dataset

folds = createFolds(y = data$Species, k = 3)

#Train a support vector machine in parallel using foreach. Returns the predictions from the fitted model.

results = foreach(i = 1:length(folds), .packages = c("caret", "e1071", "tidyverse")) %dopar% {

  training = data[-folds[[i]], ]
  test = data[folds[[i]],]

  #Pre processing, if desired


  #Train the support vector machine. Parameters are just for example, ripped from a webpage that just so happened to be using the exact same dataset.
  model = svm(Species~., data=training, 
          method="C-classification", kernal="radial", 
          gamma=0.1, cost=10)

  tibble(Row = folds[[i]], Predicted.Class = predict(model, newdata = test))

}

#Bind all the data frames in the list "results"

processedResults = bind_rows(results) %>%
  arrange(-Row)

Notice how we are iterating over different training/test splits (with these lines):

training = data[-folds[[i]], ]
test = data[folds[[i]],]

but the function itself remains the same. Each core will receive a different train/test set in each iteration of the for loop. In your code, you aren't changing arguments to your function and so there is no point in running your code in parallel.

Now, if I understand mfe correctly, it is generating some summary statistics from your dataset. This could maybe run in parallel (maybe some statistics can be computed in parallel? I doubt there is any speed to be gained here), but I highly doubt it is much faster due to the overhead associated with running stuff in parallel (and hence, why the package creator did not choose to implement such a feature).

You mentioned random forest and how you were able to run it in parallel easily. This is because a random forest is easily made parallel (it is "embarrassingly parallel"). Each individual tree can be fit on a single worker (because each tree is fit independently of all other trees in the forest on a bootstrapped training set). The order in which the trees are fit doesn't matter either, because we just average over all of the trees predictions to get our final predictions.

My two cents: running in parallel is only useful if the algorithm itself can be run in parallel. You can't take advantage of all of your cores if the algorithm itself cannot be "divided" and then "combined". And even then, if the algorithm is trivial to run on a single thread then the computational overhead associated with running stuff in parallel makes it generally worse in these cases.

Automatically uses several cores on R

About