Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support all LightGBM parallel tree learners in Dask #3834

Closed
StrikerRUS opened this issue Jan 24, 2021 · 13 comments
Closed

Support all LightGBM parallel tree learners in Dask #3834

StrikerRUS opened this issue Jan 24, 2021 · 13 comments

Comments

@StrikerRUS
Copy link
Collaborator

According to this piece of code

allowed_tree_learners = {
'data',
'data_parallel',
'feature',
'feature_parallel',
'voting',
'voting_parallel'
}
if tree_learner is None:
logger.warning('Parameter tree_learner not set. Using "data" as default')
params['tree_learner'] = 'data'
elif tree_learner.lower() not in allowed_tree_learners:
logger.warning('Parameter tree_learner set to %s, which is not allowed. Using "data" as default' % tree_learner)
params['tree_learner'] = 'data'

Dask module supports all available tree learner types.
Refer to https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html#choose-appropriate-parallel-algorithm for more details about difference among tree_learner param values.

However, our CI tests are run only for data parallel tree learner type.

tree_learner='data',

I believe that parametrization of all tests with different tree learners will improve tests and make us confident in Dask module quality.

How tests can be parametrized:

data_output = ['array', 'scipy_csr_matrix', 'dataframe']

@pytest.mark.parametrize('output', data_output)

@jameslamb jameslamb changed the title Run tests for different parallel tree learners in Dask tests Support all LightGBM parallel tree learners in Dask Jan 24, 2021
@jameslamb
Copy link
Collaborator

@StrikerRUS I disagree with calling this a "good first issue", and believe it should be a feature request. Using a different tree learner isn't as simple as just changing the tree_learner parameter. You have to be sure that the way that your data (either Dask DataFrame or Dask Array) is partitioned matches the method you chose.

For a dataset with n observations and k columns:

  • feature parallel = partitions have shape (n, k_i). All rows but a subset of features.
  • data parallel = partitions have shape (n_i, k). All features but a subset of rows

Implementing this will require some non-trivial changes to the tests and the Dask module.

@jameslamb
Copy link
Collaborator

jameslamb commented Jan 24, 2021

I'm going to close this and add it to #2302. Thank you very much for writing this up, and for all of the Dask issues you've written up! Having this all documented in the backlog takes a lot of work but it's also critical to getting other contributors to come help.

Anyone who sees this is welcome to pick up this feature! Leave a comment and we can re-open the issue.

@StrikerRUS
Copy link
Collaborator Author

Implementing this will require some non-trivial changes to the tests and the Dask module.

Huh, I was suspecting this but shy to say it out loud! 🙂

Given that only data tree learner is actually supported right now, I strongly believe we should make it explicit and raise errors for these values

'feature',
'feature_parallel',
'voting',
'voting_parallel'

@jameslamb
Copy link
Collaborator

I strongly believe we should make it explicit and raise errors for these values

I've proposed a PR: #3848

@StrikerRUS
Copy link
Collaborator Author

Seems that feature_parallel cannot be supported

LightGBM/src/c_api.cpp

Lines 130 to 131 in 7880b79

if (config_.tree_learner == std::string("feature")) {
Log::Fatal("Do not support feature parallel in c api");

@jameslamb
Copy link
Collaborator

Oh wow!

For context, looks like that comment was added in #986.

So does this mean feature-parallel learning is only supported in the CLI?

@StrikerRUS
Copy link
Collaborator Author

So does this mean feature-parallel learning is only supported in the CLI?

I think so.

@jameslamb
Copy link
Collaborator

ok I can make a separate pull request to update the parameter documentation for that, then

@jameslamb
Copy link
Collaborator

Adding some information for anyone looking to add voting_parallel support.

You can see a description at https://lightgbm.readthedocs.io/en/latest/Features.html#optimization-in-network-communication. And much more details in https://proceedings.neurips.cc/paper/2016/file/10a5ab2db37feedfdeaab192ead4ac0e-Paper.pdf if you're curious.

From the paper:

Data-parallel: Training data are horizontally partitioned according to the samples and allocated to
different machines. Then the machines communicate with each other the local histograms of all
attributes (according to their own data samples) in order to obtain the global attribute distributions and
identify the best attribute and split point [12] [14]

In this paper, we proposed a new data-parallel algorithm for decision tree, called Parallel Voting
Decision Tree (PV-Tree), which can achieve much better balance between communication efficiency
and accuracy. The key difference between conventional data-parallel decision tree algorithm and
PV-Tree lies in that the former only trusts the globally aggregated histogram information, while the
latter leverages the local statistical information contained in each machine through a two-stage voting
process, thus can significantly reduce the communication cost. Specifically, PV-Tree contains the
following steps in each iteration. 1) Local voting. On each machine, we select the top-k attributes
based on its local data according to the informativeness scores (e.g., risk reduction for regression,
and information gain for classification). 2) Global voting. We determine global top-2k attributes
by a majority voting among the local candidates selected in the previous step. That is, we rank the
attributes according to the number of local machines who select them, and choose the top 2k attributes
from the ranked list. 3) Best attribute identification. We collect the full-grained histograms of the
globally top-2k attributes from local machines in order to compute their global distributions.

In theory, adding voting_parallel to lightgbm.dask might be as simple as adding it to the unit tests (basically pytest.mark.parametrize("tree_learner", distributed_training_algorithms) on test_classifier, test_regressor, and test_ranker in https://github.com/microsoft/LightGBM/blob/4580393f604d825c318c053891f2870e6a40347f/tests/python_package_test/test_dask.py) and the altering this check:

if params['tree_learner'] not in {'data', 'data_parallel'}:

Voting parallel uses horizontally-partitioned data just like data_parallel, so none of the data distribution stuff in lightgbm.dask should need to change.

@jmoralez
Copy link
Collaborator

So does this mean feature-parallel learning is only supported in the CLI?

I think so.

Should I remove this?

'feature',
'feature_parallel',

@jameslamb jameslamb reopened this Mar 20, 2021
@jameslamb
Copy link
Collaborator

jameslamb commented Mar 20, 2021

No, I'd prefer to leave that and rely on LightGBM to throw the error mentioned above.

If you try to do feature parallel and your data are vertically partitioned, it wouldn't make sense for LightGBM to try to perform data parallel learning (which requires horizontally partitioned data).

@jmoralez
Copy link
Collaborator

Adding voting and voting_parallel here:

if params['tree_learner'] not in {'data', 'data_parallel'}:

Just leaves the feature parallel to go into the if and trigger this warning first: UserWarning: Support for tree_learner feature_parallel in lightgbm.dask is experimental and may break in a future release. and then the error. Maybe we should remove the if and go straight to the error? The warning kind of suggests that there is support for feature_parallel in the dask api but there isn't.

@jameslamb
Copy link
Collaborator

oh good point! Ok I would support removing that warning entirely at the same time you add support for voting parallel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants