Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add random uniform sentinels to avoid overfitting #4622

Closed
Shulito opened this issue Sep 23, 2021 · 4 comments
Closed

Add random uniform sentinels to avoid overfitting #4622

Shulito opened this issue Sep 23, 2021 · 4 comments

Comments

@Shulito
Copy link

Shulito commented Sep 23, 2021

Preface

Sorry if this feature is being added to the framework. I looked everywhere and it doesn't seem so, but it's difficult to search because searching for "random + tree" 99.99% of the time leads to random forest.

Summary

Instead of using the taditional hyperparameters to control overfitting (like max_depth), add random uniform feature variables that act as sentinels to check if the split of a node is going to lead to an overfitted tree.

Motivation and Description

Create N random (therefore, uncorrelated) uniform feature variables between 0 and 1 and add them to the dataset. If, when constructing one of the trees, one of this sentinel features is selected as the best feature to split the node over the real features of the dataset, that means that this node shouldn't be split because it found a spurious correlation that's better that any split of the real features. If this happens at the root, stop creating trees.

Alternatives

Enable the possibility to add user-defined predicate callbacks (with access to the environment) before a split happens and before a new tree is created for user defined behaviour to stop node splitting and tree creation.

References

https://www.kdnuggets.com/2019/10/feature-selection-beyond-feature-importance.html -> Feature Importance + Random Features section.

@Shulito Shulito changed the title Adding random uniform sentinels to avoid overfitting Add random uniform sentinels to avoid overfitting Sep 23, 2021
@shiyu1994
Copy link
Collaborator

Hi @Shulito, thanks for using LightGBM. Using random uniform sentinel features to set implicit threshold for minimum split gain (as an implicit min_gain_to_split) seems valuable to me. That is an interesting idea. Is it widely used in Kaggle competitions with other models?

@Shulito
Copy link
Author

Shulito commented Sep 24, 2021

I don't know if it's widely used. Since this is not supported out of the box by frameworks, what people do is add this sentinel features manually to the dataset, train a model and then check which features are below a group of sentinels in the feature importance list (those below, are discarded), and the continue to build the "real" model.

@shiyu1994
Copy link
Collaborator

It seems that sentinel features are used to filtering some features after a model finish training, according to the final feature importance. In that case, we don't have to give a direct support in LightGBM, since it is easy to implement, and doesn't interfere with the training process.
As for stopping splitting node when the best split condition of real features is no better than that of a random feature, this requires treating the random feature as a real feature during training, and construct histograms for it, which would increase some training cost. I'm not sure whether it is worthy, if no more evidence showing the effectiveness.
But I think we can have it in the Feature Requests and Voting Hub (#2302).

@shiyu1994
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants