Support all LightGBM parallel tree learners in Dask #3834

StrikerRUS · 2021-01-24T02:28:23Z

According to this piece of code

LightGBM/python-package/lightgbm/dask.py

Lines 251 to 264 in ac706e1

    
           allowed_tree_learners = { 
        
               'data', 
        
               'data_parallel', 
        
               'feature', 
        
               'feature_parallel', 
        
               'voting', 
        
               'voting_parallel' 
        
           } 
        
           if tree_learner is None: 
        
               logger.warning('Parameter tree_learner not set. Using "data" as default') 
        
               params['tree_learner'] = 'data' 
        
           elif tree_learner.lower() not in allowed_tree_learners: 
        
               logger.warning('Parameter tree_learner set to %s, which is not allowed. Using "data" as default' % tree_learner) 
        
               params['tree_learner'] = 'data'

Dask module supports all available tree learner types.
Refer to https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html#choose-appropriate-parallel-algorithm for more details about difference among tree_learner param values.

However, our CI tests are run only for data parallel tree learner type.

LightGBM/tests/python_package_test/test_dask.py

Line 182 in ac706e1

tree_learner='data',

I believe that parametrization of all tests with different tree learners will improve tests and make us confident in Dask module quality.

How tests can be parametrized:

LightGBM/tests/python_package_test/test_dask.py

Line 31 in ac706e1

data_output = ['array', 'scipy_csr_matrix', 'dataframe']

LightGBM/tests/python_package_test/test_dask.py

Line 166 in ac706e1

@pytest.mark.parametrize('output', data_output)

The text was updated successfully, but these errors were encountered:

jameslamb · 2021-01-24T03:31:21Z

@StrikerRUS I disagree with calling this a "good first issue", and believe it should be a feature request. Using a different tree learner isn't as simple as just changing the tree_learner parameter. You have to be sure that the way that your data (either Dask DataFrame or Dask Array) is partitioned matches the method you chose.

For a dataset with n observations and k columns:

feature parallel = partitions have shape (n, k_i). All rows but a subset of features.
data parallel = partitions have shape (n_i, k). All features but a subset of rows

Implementing this will require some non-trivial changes to the tests and the Dask module.

jameslamb · 2021-01-24T03:43:21Z

I'm going to close this and add it to #2302. Thank you very much for writing this up, and for all of the Dask issues you've written up! Having this all documented in the backlog takes a lot of work but it's also critical to getting other contributors to come help.

Anyone who sees this is welcome to pick up this feature! Leave a comment and we can re-open the issue.

StrikerRUS · 2021-01-24T13:50:49Z

Implementing this will require some non-trivial changes to the tests and the Dask module.

Huh, I was suspecting this but shy to say it out loud! 🙂

Given that only data tree learner is actually supported right now, I strongly believe we should make it explicit and raise errors for these values

LightGBM/python-package/lightgbm/dask.py

Lines 254 to 257 in ac706e1

    
           'feature', 
        
           'feature_parallel', 
        
           'voting', 
        
           'voting_parallel'

jameslamb · 2021-01-25T03:42:42Z

I strongly believe we should make it explicit and raise errors for these values

I've proposed a PR: #3848

StrikerRUS · 2021-02-19T18:49:34Z

Seems that feature_parallel cannot be supported

LightGBM/src/c_api.cpp

Lines 130 to 131 in 7880b79

    
           if (config_.tree_learner == std::string("feature")) { 
        
             Log::Fatal("Do not support feature parallel in c api");

jameslamb · 2021-02-19T19:09:37Z

Oh wow!

For context, looks like that comment was added in #986.

So does this mean feature-parallel learning is only supported in the CLI?

StrikerRUS · 2021-02-19T19:20:26Z

So does this mean feature-parallel learning is only supported in the CLI?

I think so.

jameslamb · 2021-02-19T19:33:55Z

ok I can make a separate pull request to update the parameter documentation for that, then

jameslamb · 2021-03-19T04:17:56Z

Adding some information for anyone looking to add voting_parallel support.

You can see a description at https://lightgbm.readthedocs.io/en/latest/Features.html#optimization-in-network-communication. And much more details in https://proceedings.neurips.cc/paper/2016/file/10a5ab2db37feedfdeaab192ead4ac0e-Paper.pdf if you're curious.

From the paper:

Data-parallel: Training data are horizontally partitioned according to the samples and allocated to
different machines. Then the machines communicate with each other the local histograms of all
attributes (according to their own data samples) in order to obtain the global attribute distributions and
identify the best attribute and split point [12] [14]

In this paper, we proposed a new data-parallel algorithm for decision tree, called Parallel Voting
Decision Tree (PV-Tree), which can achieve much better balance between communication efficiency
and accuracy. The key difference between conventional data-parallel decision tree algorithm and
PV-Tree lies in that the former only trusts the globally aggregated histogram information, while the
latter leverages the local statistical information contained in each machine through a two-stage voting
process, thus can significantly reduce the communication cost. Specifically, PV-Tree contains the
following steps in each iteration. 1) Local voting. On each machine, we select the top-k attributes
based on its local data according to the informativeness scores (e.g., risk reduction for regression,
and information gain for classification). 2) Global voting. We determine global top-2k attributes
by a majority voting among the local candidates selected in the previous step. That is, we rank the
attributes according to the number of local machines who select them, and choose the top 2k attributes
from the ranked list. 3) Best attribute identification. We collect the full-grained histograms of the
globally top-2k attributes from local machines in order to compute their global distributions.

In theory, adding voting_parallel to lightgbm.dask might be as simple as adding it to the unit tests (basically pytest.mark.parametrize("tree_learner", distributed_training_algorithms) on test_classifier, test_regressor, and test_ranker in https://github.com/microsoft/LightGBM/blob/4580393f604d825c318c053891f2870e6a40347f/tests/python_package_test/test_dask.py) and the altering this check:

LightGBM/python-package/lightgbm/dask.py

Line 274 in 4580393

if params['tree_learner'] not in {'data', 'data_parallel'}:

Voting parallel uses horizontally-partitioned data just like data_parallel, so none of the data distribution stuff in lightgbm.dask should need to change.

jmoralez · 2021-03-20T01:16:53Z

So does this mean feature-parallel learning is only supported in the CLI?

I think so.

Should I remove this?

LightGBM/python-package/lightgbm/dask.py

Lines 265 to 266 in 971b548

    
           'feature', 
        
           'feature_parallel',

jameslamb · 2021-03-20T01:21:37Z

No, I'd prefer to leave that and rely on LightGBM to throw the error mentioned above.

If you try to do feature parallel and your data are vertically partitioned, it wouldn't make sense for LightGBM to try to perform data parallel learning (which requires horizontally partitioned data).

jmoralez · 2021-03-20T01:33:26Z

Adding voting and voting_parallel here:

LightGBM/python-package/lightgbm/dask.py

Line 274 in 971b548

if params['tree_learner'] not in {'data', 'data_parallel'}:

Just leaves the feature parallel to go into the if and trigger this warning first:

UserWarning: Support for tree_learner feature_parallel in lightgbm.dask is experimental and may break in a future release.

and then the error. Maybe we should remove the if and go straight to the error? The warning kind of suggests that there is support for feature_parallel in the dask api but there isn't.

jameslamb · 2021-03-20T01:39:05Z

oh good point! Ok I would support removing that warning entirely at the same time you add support for voting parallel.

StrikerRUS added good first issue dask labels Jan 24, 2021

jameslamb added feature request and removed good first issue labels Jan 24, 2021

jameslamb changed the title ~~Run tests for different parallel tree learners in Dask tests~~ Support all LightGBM parallel tree learners in Dask Jan 24, 2021

jameslamb closed this as completed Jan 24, 2021

jameslamb mentioned this issue Jan 24, 2021

Feature Requests & Voting Hub #2302

Open

jameslamb mentioned this issue Jan 25, 2021

[dask] warn if attempting to use tree_learner other than data parallel #3848

Merged

jameslamb mentioned this issue Feb 9, 2021

[dask] Support asynchronous workflows #3929

Closed

jameslamb mentioned this issue Mar 19, 2021

[dask] Include support for raw_score in predict (fixes #3793) #4024

Merged

jameslamb reopened this Mar 20, 2021

jameslamb mentioned this issue Mar 21, 2021

[tests][dask] Add voting_parallel algorithm in tests (fixes #3834) #4088

Merged

jameslamb closed this as completed in d517ba1 Apr 1, 2021

falcon8241 mentioned this issue Mar 6, 2023

Questions regarding Feature Parallel distributed learning algorithm #5768

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support all LightGBM parallel tree learners in Dask #3834

Support all LightGBM parallel tree learners in Dask #3834

StrikerRUS commented Jan 24, 2021

jameslamb commented Jan 24, 2021

jameslamb commented Jan 24, 2021 •

edited

Loading

StrikerRUS commented Jan 24, 2021

jameslamb commented Jan 25, 2021

StrikerRUS commented Feb 19, 2021

jameslamb commented Feb 19, 2021

StrikerRUS commented Feb 19, 2021

jameslamb commented Feb 19, 2021

jameslamb commented Mar 19, 2021

jmoralez commented Mar 20, 2021

jameslamb commented Mar 20, 2021 •

edited

Loading

jmoralez commented Mar 20, 2021

jameslamb commented Mar 20, 2021

Support all LightGBM parallel tree learners in Dask #3834

Support all LightGBM parallel tree learners in Dask #3834

Comments

StrikerRUS commented Jan 24, 2021

jameslamb commented Jan 24, 2021

jameslamb commented Jan 24, 2021 • edited Loading

StrikerRUS commented Jan 24, 2021

jameslamb commented Jan 25, 2021

StrikerRUS commented Feb 19, 2021

jameslamb commented Feb 19, 2021

StrikerRUS commented Feb 19, 2021

jameslamb commented Feb 19, 2021

jameslamb commented Mar 19, 2021

jmoralez commented Mar 20, 2021

jameslamb commented Mar 20, 2021 • edited Loading

jmoralez commented Mar 20, 2021

jameslamb commented Mar 20, 2021

jameslamb commented Jan 24, 2021 •

edited

Loading

jameslamb commented Mar 20, 2021 •

edited

Loading