Add ability to set min_data_in_leaf as a share of n_samples #5194

gorokhovnik · 2022-05-03T14:50:54Z

Summary

min_data_in_leaf parameter might be set both as number of elements in leaf or as share of number of samples in data.

Motivation

The motivation to add this feature is an ability to use unified parameter value independently from number of rows in training dataset. This is especially important in AutoML researches about hyperparameters search spaces.

Description

The idea is simple: if min_data_in_leaf parameter is greater than one, leave it as it is and use an absolute value of parameter. Otherwise parameter values lies between 0 and 0.5 and reflects the share of rows that the leaf must contain. The easiest way to implement feature is the calculation of the absolute value during initialization if parameter value is less than 0.5.

References

The way I see the implementation is smth like that
self.min_data_in_leaf = int(min_data_in_leaf) if min_data_in_leaf > 0.5 else int(self.n_samples * min_data_in_leaf)

The implementation of the similar feature in sklearn:
https://github.com/scikit-learn/scikit-learn/blob/baf828ca126bcb2c0ad813226963621cafe38adb/sklearn/tree/_classes.py#L233

jameslamb · 2022-05-04T05:17:32Z

Thanks very much for the idea and for writing this up @gorokhovnik !

I'm concerned that overloading the meaning of this parameter this way would introduce a maintenance burden and risk of bugs that isn't worth the benefit, given that it should be possible for users' code to choose possible min_data_in_leaf values based on the size of the data prior to training, like this:

sample code (click me)

import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import GridSearchCV

X, y = make_regression(n_samples=12345)
min_data_in_leaf_values = [
    int(0.001 * X.shape[0]),
    int(0.01 * X.shape[0]),
    int(0.10 * X.shape[0]),
    int(0.15 * X.shape[0]),
    int(0.25 * X.shape[0]),
]

grid_search = GridSearchCV(
    estimator=lgb.LGBMRegressor(),
    param_grid={
        "n_estimators": [10],
        "learning_rate": [0.1],
        "min_data_in_leaf": min_data_in_leaf_values,
        "verbosity": [-1]
    },
    n_jobs=1,
    verbose=1,
)

grid_search.fit(X, y)
pd.DataFrame({
    "min_data_in_leaf": grid_search.cv_results_["param_min_data_in_leaf"].data,
    "mean_test_score": grid_search.cv_results_["mean_test_score"]
})

Unlike the scikit-learn code linked in this issue's description (which is written in Python), LightGBM is written in C++. Right now parameter min_data_in_leaf is an int, which means it can't accept values like 0.25.

It could be changed in LightGBM's C++ code to be a double instead, but that would have to be done very carefully, with consideration for at least the following concerns:

can model text files produced from previous versions of LightGBM, which contain values like min_data_in_leaf: 5, be read into newer versions of LightGBM successfully?
where and when should it be checked that min_data_in_leaf is either integer-valued or in the interval [0, 0.5)?
for consistency, should other dataset-size-sensitive parameters like min_data_in_bin, bin_construct_sample_cnt, and min_data_per_group support a similar miix of interpretations?
- would doing that mean that binary Dataset files produced from previous versions of LightGBM couldn't be read by newer versions of LightGBM successfully?

Alternatively, wrappers like the R and Python package could accept values in the range [0, 0.5), then do the translation to the corresponding integer before passing that through to LightGBM's C++ code. But that would introduce other maintenance concerns.

But I am only one voice. Would love to hear your perspective and the perspective of other maintainers here on how LightGBM supporting this directly might be preferable to user code computing a range of min_data_in_leaf values based on the shape of the input data.

gorokhovnik · 2022-05-04T10:16:36Z

Hello @jameslamb!
Yes, I understand that data typing in C++ is more strict than in Python and that is possible to complete such preprocessing manually. So this issue is more about idea suggestions for making LightGBM a bit more user-friendly.

I faced the problem with min_data_in_leaf parameter, but probably this feature will be useful for other ones also. Sklearn provides such multitype input options for a wide range of parameters.

Python/R wrapper seems not so bad since the operation is not very resource-intensive. The obvious disadvantage is the necessity to maintain those wrappers separately. Probably, it could be solved with a C++ entity like parameters_preprocessing, which could be used by both R and Python APIs.

Unfortunately, because of a lack of knowledge about LightGBM architecture, I can't answer the question about versions support, but it seems that there is no problem reading old version files since the functionality is going to extend without old functionality losses.

Nevertheless, thank you for your answer and I am waiting for your colleagues' opinion.

jameslamb added the feature request label May 4, 2022

jameslamb added the awaiting response label May 4, 2022

github-actions bot removed the awaiting response label May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to set min_data_in_leaf as a share of n_samples #5194

Add ability to set min_data_in_leaf as a share of n_samples #5194

gorokhovnik commented May 3, 2022 •

edited

Loading

jameslamb commented May 4, 2022

gorokhovnik commented May 4, 2022

Add ability to set min_data_in_leaf as a share of n_samples #5194

Add ability to set min_data_in_leaf as a share of n_samples #5194

Comments

gorokhovnik commented May 3, 2022 • edited Loading

Summary

Motivation

Description

References

jameslamb commented May 4, 2022

gorokhovnik commented May 4, 2022

gorokhovnik commented May 3, 2022 •

edited

Loading