Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support forced splits with data and voting parallel versions of LightGBM #4260

Closed
imatiach-msft opened this issue May 5, 2021 · 3 comments
Closed

Comments

@imatiach-msft
Copy link
Contributor

Summary

I'm unable to add forced splits to the data and voting parallel versions of lightgbm in mmlspark, I see the error:

https://github.com/microsoft/LightGBM/blob/master/src/io/config.cpp#L318

Don't support forcedsplits in data tree learne

Motivation

I would like to add this feature but I'm not sure why I can't just remove that line in the config and enable it. What is special about the data and voting parallel learners that wouldn't allow this config to be specified on each node?

Description

The fix would be to remove that thrown exception. Also, it would be great if we could specify the forced splits as a string directly instead of a file.

https://github.com/microsoft/LightGBM/blob/master/src/boosting/gbdt.cpp#L769

@imatiach-msft
Copy link
Contributor Author

imatiach-msft commented May 5, 2021

it looks like the ForceSplits function is also only implemented in the serial tree learner:

https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L450

seems I would need to do something similar in the data parallel tree learner:

https://github.com/microsoft/LightGBM/blob/master/src/treelearner/data_parallel_tree_learner.cpp

and voting parallel learner as well

@shiyu1994
Copy link
Collaborator

For data and voting distributed training, we need to synchronize the histogram before GatherInfoForThreshold is called. Otherwise each machine would get a different leaf value and leaf gain, given the incomplete features histograms which sums over only partial data.

leaf_histogram_array[left_inner_feature_index].GatherInfoForThreshold(
left_leaf_splits->sum_gradients(),
left_leaf_splits->sum_hessians(),
left_threshold,
left_leaf_splits->num_data_in_leaf(),
left_leaf_splits->weight(),
&left_split);

So the logic for implementing ForceSplits in data and voting distributed training is not quite straightforward as in feature distributed training.

@StrikerRUS StrikerRUS changed the title forced splits with data and voting parallel versions of lightgbm Support forced splits with data and voting parallel versions of LightGBM Jun 9, 2021
@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants