-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to set min_data_in_leaf as a share of n_samples #5194
Comments
Thanks very much for the idea and for writing this up @gorokhovnik ! I'm concerned that overloading the meaning of this parameter this way would introduce a maintenance burden and risk of bugs that isn't worth the benefit, given that it should be possible for users' code to choose possible sample code (click me)import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import GridSearchCV
X, y = make_regression(n_samples=12345)
min_data_in_leaf_values = [
int(0.001 * X.shape[0]),
int(0.01 * X.shape[0]),
int(0.10 * X.shape[0]),
int(0.15 * X.shape[0]),
int(0.25 * X.shape[0]),
]
grid_search = GridSearchCV(
estimator=lgb.LGBMRegressor(),
param_grid={
"n_estimators": [10],
"learning_rate": [0.1],
"min_data_in_leaf": min_data_in_leaf_values,
"verbosity": [-1]
},
n_jobs=1,
verbose=1,
)
grid_search.fit(X, y)
pd.DataFrame({
"min_data_in_leaf": grid_search.cv_results_["param_min_data_in_leaf"].data,
"mean_test_score": grid_search.cv_results_["mean_test_score"]
}) ![]() Unlike the It could be changed in LightGBM's C++ code to be a
Alternatively, wrappers like the R and Python package could accept values in the range But I am only one voice. Would love to hear your perspective and the perspective of other maintainers here on how LightGBM supporting this directly might be preferable to user code computing a range of |
Hello @jameslamb! I faced the problem with min_data_in_leaf parameter, but probably this feature will be useful for other ones also. Sklearn provides such multitype input options for a wide range of parameters. Python/R wrapper seems not so bad since the operation is not very resource-intensive. The obvious disadvantage is the necessity to maintain those wrappers separately. Probably, it could be solved with a C++ entity like parameters_preprocessing, which could be used by both R and Python APIs. Unfortunately, because of a lack of knowledge about LightGBM architecture, I can't answer the question about versions support, but it seems that there is no problem reading old version files since the functionality is going to extend without old functionality losses. Nevertheless, thank you for your answer and I am waiting for your colleagues' opinion. |
Summary
min_data_in_leaf parameter might be set both as number of elements in leaf or as share of number of samples in data.
Motivation
The motivation to add this feature is an ability to use unified parameter value independently from number of rows in training dataset. This is especially important in AutoML researches about hyperparameters search spaces.
Description
The idea is simple: if min_data_in_leaf parameter is greater than one, leave it as it is and use an absolute value of parameter. Otherwise parameter values lies between 0 and 0.5 and reflects the share of rows that the leaf must contain. The easiest way to implement feature is the calculation of the absolute value during initialization if parameter value is less than 0.5.
References
The way I see the implementation is smth like that
self.min_data_in_leaf = int(min_data_in_leaf) if min_data_in_leaf > 0.5 else int(self.n_samples * min_data_in_leaf)
The implementation of the similar feature in sklearn:
https://github.com/scikit-learn/scikit-learn/blob/baf828ca126bcb2c0ad813226963621cafe38adb/sklearn/tree/_classes.py#L233
The text was updated successfully, but these errors were encountered: