Understanding feature_parallel distributed learning algorithm in LightGBMClassifier

I want to understand feature_parallel algorithm in LightGBMClassifier. It describes how it is done traditionally and how LightGBM aims to improve it

The two ways are as follows (verbatim from linked site):

Traditional Feature_parallel:

Feature parallel aims to parallelize the “Find Best Split” in the decision tree. The procedure of traditional feature parallel is:

  1. Partition data vertically (different machines have different feature set).
  2. Workers find local best split point {feature, threshold} on local feature set.
  3. Communicate local best splits with each other and get the best one.
  4. Worker with best split to perform split, then send the split result of data to other workers.

Feature_parallel in LightGBM: (Problem text bold and italicized)

Since feature parallel cannot speed up well when #data is large, we make a little change: instead of partitioning data vertically, every worker holds the full data. Thus, LightGBM doesn’t need to communicate for the split result of data since every worker knows how to split data

The procedure of feature parallel in LightGBM: (Problem text bold and italicized)

  1. Workers find local best split point {feature, threshold} on the local feature set. (Why call it local? Since it is being found on the entire data this is the global best split)

  2. Communicate local best splits with each other and get the best one. (But, earlier it was mentioned that Thus, LightGBM doesn’t need to communicate for the split result of data since every worker knows how to split data)

  3. Perform best split. (Did we not already have the best split?)

Moreover, If each worker is retaining the entire dataset then it is capable of creating the entire estimator itself. What is distributed learning here?

Topic gradient-boosting-decision-trees lightgbm decision-trees

Category Data Science


So if you look at how a decision trees work, it select all the variables and perform split on it to find the best split. The process of finding the best split requires iterating over each variable(assuming colsubsample & rowsubsample is 1) which is a time consuming process.

In light GBM finding the best split for each variable and calculating gain from each variable is parallelised. As each node has complete data they dont communicate during this process but only once every feature gain is calculated whichever gives the best Gain value data is split on that and again same process runs.

However, this feature parallel algorithm still suffers from computation overhead for “split” when #data is large. So it will be better to use data parallel when #data is large.

Thats how light GBM perform much faster that GBM in most casses

While in traditional:

In traditional, Worker with best split to perform split, then send the split result of data to other workers.Other workers split data according to received data.

The shortcomings of traditional feature parallel:

Has computation overhead, since it cannot speed up “split”, whose time complexity is O(#data). Thus, feature parallel cannot speed up well when #data is large.

Need communication of split result, which costs about O(#data / 8) (one bit for one data).

Also, read about data parallel when you have big data

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.