Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoiding Exception "Check failed: (best_split_info.right_count) > (0) at ..." with a regression task #3679

Open
Tracked by #5153
ch3rn0v opened this issue Dec 25, 2020 · 56 comments
Assignees
Labels

Comments

@ch3rn0v
Copy link
Contributor

ch3rn0v commented Dec 25, 2020

How you are using LightGBM?

  • Python package

Environment info

  • Operating System: Ubuntu 20.04.1 LTS
  • Python version: 3.8.5
  • GCC 7.3.0
  • LightGBM version or commit hash: 3.1.1

Steps to reproduce

  1. In jupyter lab's notebook, prepare train and validation datasets. (They are huge and private, so can't share a reproducible example).
  2. Train lgbm with the data with different sets of features.
  3. Observe an exception looking like this:

Check failed: (best_split_info.right_count) > (0) at [...]
Sometimes it says left_count instead of right_count.
Other times it doesn't occur at all, depending on the features I use.

Other details

Apparently this is the start of the piece of code initiating the exception: https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L652.

I tried setting min_data_in_leaf to a value greater than zero. It helps sometimes, but not reliably. Same with feature_fraction. I also tried changing min_sum_hessian_in_leaf, to no avail. Also tried setting min_data_in_leaf and min_sum_hessian_in_leaf simultaneously, no difference.

This (or a similar) issue is mentioned a few times here:

None of them suggests an approach that allowed me to avoid these exceptions. Would you please share any ideas how to fix this, or at least why does this issue happen at all? If I understand correctly, one could simply trim the split leading to this error and stop branching further. Please correct me if I'm wrong. Thank you.

@guolinke
Copy link
Collaborator

guolinke commented Dec 25, 2020

you can use larger min_data_per_leaf or min_hessian_per_leaf. non-zero may is not enough.
Regression objective should be safe in most cases, so I guess you may use the sample weight? If yes, it is better to avoid.
And which objective function you used?

@ch3rn0v
Copy link
Contributor Author

ch3rn0v commented Dec 25, 2020

you can use larger min_data_per_leaf or min_hessian_per_leaf
Sure, I tried a few values. If I increase these too much though it results in the model being under-fitted.

I use "regression" as the value for the "objective" param. Metric is "l1".

you may use the sample weight?

Do you mean specifying different weights for different samples? If so, I do not use this in the example we are discussing here.

Thank you very much for the rapid reply!

@guolinke
Copy link
Collaborator

If no sample weight and with regression, I think it may due to another problem, not related to min_data and min_hessian.
Did it only happen in large-scale data? if yes, I think you can try deterministic=true.

@ch3rn0v
Copy link
Contributor Author

ch3rn0v commented Dec 25, 2020

Did it only happen in large-scale data?

Although I didn't conduct the same experiment with a smaller fraction of my dataset, I tried bagging fraction before (perhaps with different set of features) and if I remember correctly it did not result in the above exception. Will try it again, thank you.

you can try deterministic=true.

I appreciate the suggestion! I'm already using this param since your helpful advice in #3654 :)

@guolinke
Copy link
Collaborator

interesting, I guess there may be a bug.
Did you use missing value handling? By default, it will enable if feature values contain NaN.
you can also try use_missing=false .

@ch3rn0v
Copy link
Contributor Author

ch3rn0v commented Dec 25, 2020

Did you use missing value handling? By default, it will enable if feature values contain NaN.
you can also try use_missing=false.

Yes, I do use missing value handling. Will try with use_missing=false now and report back.

@ch3rn0v
Copy link
Contributor Author

ch3rn0v commented Dec 25, 2020

I tried running the same script as before (so without setting min_data_in_leaf, min_sum_hessian_in_leaf) with the only change being the addition of "use_missing": False, to the model's params. And the same exception still occurs.

Any other suggestions in relation to how this can be fixed are very welcome.

@guolinke
Copy link
Collaborator

@ch3rn0v did you use categorical features?

@ch3rn0v
Copy link
Contributor Author

ch3rn0v commented Dec 25, 2020

@guolinke , nope, all features are numerical.

@guolinke
Copy link
Collaborator

is that possible to provide an exproduce example, by a sub-feature (or even with subrow), so that we can debug with.

@ch3rn0v
Copy link
Contributor Author

ch3rn0v commented Dec 26, 2020

I'll start an internal discussion about this, but I doubt any particular data or even a piece of it will be shared.

In the meantime I ran a few other tests.

  1. Tried the same dataset, this time with bagging_fraction and bagging_freq. The exception still happens.
  2. Suppose I have a dataset D1 that works ok. When I add a feature F2, I get an exception. If I keep F2, but remove any single feature from D1, the exception does not happen. So the reason is not adding F2, but rather it's in some interaction between the features.

@ch3rn0v
Copy link
Contributor Author

ch3rn0v commented Dec 26, 2020

Interestingly, the error still happens even with "max_depth" and "num_leaves" both being set to zero. Perhaps it occurs during some preliminary data verification?

@shiyu1994 shiyu1994 self-assigned this Dec 26, 2020
@shiyu1994
Copy link
Collaborator

A potential bug in histogram offset assignment may cause this error. I will create a PR for this.

@shiyu1994
Copy link
Collaborator

@ch3rn0v Can you please try https://github.com/shiyu1994/LightGBM/tree/fix-3679 to see if the same error occurs?

@ch3rn0v
Copy link
Contributor Author

ch3rn0v commented Dec 28, 2020

Hello @shiyu1994 , appreciate your rapid response! Do I understand it correctly that the only way to try this version is to do this: https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#linux ? And if so, will I be able to remove this temporary version later? Will it result in any conflict if another version is already installed? Thanks in advance.

@shiyu1994
Copy link
Collaborator

shiyu1994 commented Dec 28, 2020

@ch3rn0v Yes you have to install the python package by building from the source files as in the link.

If you are using the Python API, you can create a virtualenv or conda to create a new python environment, and install the python package with the branch shiyu1994/fix-3679 in the new environment.

You may also install the python package from the branch directly. If you want to recover a standard released python package of LightGBM, just use pip to remove the branch package, and reinstall the latest released python package.

@ch3rn0v
Copy link
Contributor Author

ch3rn0v commented Dec 28, 2020

I tried a few different ways to install this version in a new conda env. Alas, none of them worked.

For instance, pip install git+git://github.com/shiyu1994/LightGBM@fix-3679 results in:

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-req-build-[...]/setup.py'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

And yes, I did conda install git pip before that. Searching for any similar errors didn't help much.

I also don't happen to have cmake and can't install it right now.

Would you please suggest any other steps I can take right now, or should I just obtain cmake?
Regardless, I'll post an update once I have any news.

@shiyu1994
Copy link
Collaborator

shiyu1994 commented Dec 28, 2020

Can you please install cmake? You have to build LightGBM before install the python package, when installing from source code.

@shiyu1994
Copy link
Collaborator

The steps to install python package from source code is
git clone --recursive https://github.com/microsoft/LightGBM ; cd LightGBM
mkdir build ; cd build
cmake ..
make -j4
cd ../python-package
python setup.py install

@ch3rn0v
Copy link
Contributor Author

ch3rn0v commented Dec 28, 2020

While I'd be able to test this locally, it'll only make sense to run the experiment on a remote machine that has enough processing power. And I'm unable to install cmake there. While I could build it locally and scp the result to the server, it'd still require python setup.py install or similar. As far as I understand, the latter doesn't guarantee isolation within the current conda's env. And that's something we can't risk. I'm afraid we'll have to wait until this fix is released in order to be able to test it.

I can still run tests locally, but I don't have a dataset tiny enough that still allows to reproduce the bug with the current (3.1.1) release. Apologies for not returning to you with more meaningful feedback.

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Dec 28, 2020

The steps to install python package from source code is

The last step should be

python setup.py install --precompile

if you'd like Python package installation pick up already compiled dynamic library.

@shiyu1994 Could you please transfer your changes from your fork to this repository? I believe you have enough rights to do this as a collaborator. Then we can trigger Azure Pipelines to build Python wheel file with your changes. And after that @ch3rn0v will be able to install patched version with simple pip install ... in isolated env without any other requirements.

@StrikerRUS
Copy link
Collaborator

Another option will be simply find current LightGBM installation folder, rename lib_lightgbm.so file to something like lib_lightgbm_backup.so and download only patched dynamic library file instead of the whole wheel in case you are not able to take risks of not fully isolated environments. It will work because as I can see fix includes changes only at cpp side but doesn't touch Python wrapper.

@jungsooyun
Copy link

Same issue occurred in GPU lightgbm.

In my case, if i do not use both max_depth, num_leaves params together, and use only num_leaves params (max_depth as default), error doesn't come out.

Hope this bug fixed soon,

@shiyu1994
Copy link
Collaborator

#3694 is opened to potentially fix these errors, but it is only related to CPU version. We need further investigation if the errors are not fully eliminated after this PR is merged.

@wonghang
Copy link

wonghang commented Jan 17, 2021

@shiyu1994 Just for your information.

I also got a similar error and then I google on the web and found this issue.

I am using LightGBM 3.1.1 (the version that I can install from "pip3 install lightgbm")
I run it with missing_data=True, regression task, least-square error, no GPU, with categorical features

I got the following error at some point:

lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /__w/1/s/python-package/compile/src/treelearner/serial_tree_learner.cpp, line 651 .

I saw #3694 had been merged. Therefore, I compile the latest version from github master and it currently works.

My data is also private and cannot be shared. Sorry about that.

@pseudotensor
Copy link

Hi @guolinke I hit the same problem:

 File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 794, in fit
    categorical_feature=categorical_feature, callbacks=callbacks, init_model=init_model)
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 637, in fit
    callbacks=callbacks, init_model=init_model)
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 251, in train
    booster.update(fobj=fobj)
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2505, in update
    ctypes.byref(is_finished)))
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 52, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /workspace/LightGBM/src/treelearner/serial_tree_learner.cpp, line 653 .

Happens when trying to use mape on simple random data.

@shiyu1994
Copy link
Collaborator

Hi @pseudotensor, are you using the released version of LightGBM or building from source?

@pseudotensor
Copy link

pseudotensor commented Jan 26, 2021

Building from source like:

    rm -rf build ; mkdir -p build ; cd build && \
    cmake $(GPU_FLAG) $(CUDA_FLAG) -DCMAKE_INSTALL_PREFIX=$$PYTHONPREFIX -DBoost_NO_BOOST_CMAKE=TRUE -DBoost_NO_SYSTEM_PATHS=TRUE -DBOOST_ROOT:PATHNAME=$$BOOSTPREFIX -DBoost_LIBRARY_DIRS:FILEPATH=$BOOSTPREFIX/lib -DOpenCL_LIBRARY=$$CUDA_HOME/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=$$CUDA_HOME/include/ -DBoost_USE_STATIC_LIBS=ON .. && \
    make -j 8 && \
	cd ../python-package && rm -rf dist && \
    $(PYTHON) setup.py bdist_wheel --precompile --gpu --cuda --hdfs

(the --gpu etc. options aren't really needed since --precompile is shtere)

Note that I only started seeing this problem when upgrading from 2.2.4 to master.

I'm trying to repro the event seen in our jenkins testing, but so far no luck.

@tandav
Copy link

tandav commented Aug 6, 2021

For me updating 3.1.1 -> 3.2.1 fixes the issue (CPU, macbook pro 16", macos catalina)

@mshivers
Copy link

mshivers commented Oct 5, 2021

I'm using 3.2.1, and still get this error, but it only seems to happen when bagging_fraction < 1.0.

@shiyu1994
Copy link
Collaborator

@mshivers Thanks for using LightGBM. Are you using the CPU for training when encountering the error. It would be really appreciated if you could provide a reproducible example for the error.

@mshivers
Copy link

mshivers commented Oct 8, 2021

Hi @shiyu1994, I'm using CPUs. I've managed to reproduce the error just using randomly generated data. I'm on a corporate network that restricts data upload, however when I run the script below, it usually only takes a few minutes before it throws the error:

import lightgbm as lgb
import pandas as pd
import numpy as np

while 1:
    R,C = 100000, 10
    data = pd.DataFrame(np.random.randn(R,C))
    for i in range(1,C):
        data[i] += data[0] * np.random.randn()
    N = int(0.6*len(data))
    train_data = data.loc[:N]
    test_data = data.loc[N:]

    train = lgb.Dataset(train_data.iloc[:, 1:], train_data.iloc[:,0], free_raw_data=True)
    test = lgb.Dataset(test_data.iloc[:, 1:], test_data.iloc[:,0], free_raw_data=True, reference=train)

    params = {
        'boosting_type': 'gbdt',
        'objective': 'regression',
        'max_tree_output':0.03,
        'max_bin': 20,
        'max_depth': 10,
        'num_leaves': 127,
        'seed': 8,
        'learning_rate': 0.01,
        'bagging_fraction': 0.5,
        'bagging_freq': 1,
        'min_data_in_leaf': 5,
        'verbose': -1,
        'min_split_gain':0.1,
        'cegb_penalty_feature_coupled': 5 * np.ones(C-1),
        'cegb_penalty_split': 0.0000002,
    }
    gbm = lgb.train(params, train, num_boost_round=5000, valid_sets=test)

@shiyu1994
Copy link
Collaborator

@mshivers Thanks! Given that reproducible example, we should reopen this issue. I'll investigate it further in next few days.

@shiyu1994 shiyu1994 reopened this Oct 8, 2021
@noobxinyu
Copy link

noobxinyu commented Oct 11, 2021

same bug for 3.2.1

  • cpu
  • large-scale data
  • dataset weighted (no bug if removed)

possibly this is related to ill-conditioned problem

@kabartay
Copy link

kabartay commented Nov 2, 2021

What is the status of this bug?

@kabartay
Copy link

kabartay commented Nov 2, 2021

Probably with some tiny data size some parameters can cause this error, might not enough data in splits, etc.
If parameters are set carefully, this might prevent the check failed error.

@kabartay
Copy link

kabartay commented Nov 2, 2021

Thank you @ZFTurbo, does the error happen only under GPU settings?

I've checked. It only fails with GPU, while running ok on CPU. I only changed 'device': 'gpu' => 'device': 'cpu'

What can be behind of such difference?

@shiyu1994
Copy link
Collaborator

@kabartay We are investing this bug. Progress will be posted once we have findings. Thanks for your patience!

@chixujohnny
Copy link

Please don't close this issue, I tried all solutions mentioned in this issue, but didn't work.

It is very instersting that only my GPU A100-40G load more than 17G memory, this error will be happend. For more details, please check #4946

@ch3rn0v
Copy link
Contributor Author

ch3rn0v commented Jan 23, 2022

Hello again. I happen to stumble upon this issue again. This time it's LightGBM v3.3.1 (OS is Ubuntu 20.04.3 LTS).
I don't specify the parameters because I tried different sets, and the error happened regardless of param's values.

I use the data from this competition:
http://www.topcoder.com/challenges/74c9ea5d-62f5-4168-8f2e-f05d2694988a

At the moment the data can be accessed here: https://drive.google.com/drive/folders/1pJgHq-xo0LNCmVxEWmnGX4zMBv8t48VX (see data_training.zip).

I can't share the specific preprocessing steps and features I make, but at least the source data is public. One of the classes is extremely rare, perhaps this can be part of the reason.

I can also say that another pipeline produces different preprocessing steps and computes different features, where the error doesn't occur.

When both sets of features are combined, the error does happen again.

UPD: This time the approach used in #3603 (increasing min_child_weight to a positive number) worked.

@jameslamb jameslamb mentioned this issue Apr 14, 2022
60 tasks
jameslamb pushed a commit that referenced this issue Jun 8, 2022
…ery iteration (fix partially #3679) (#5164)

* clear split info buffer in cegb_ before every iteration

* check nullable of cegb_ in serial_tree_learner.cpp

* add a test case for checking the split buffer in CEGB

* swith to Threading::For instead of raw OpenMP

* apply review suggestions

* apply review comments

* remove device cpu
@ahbon123
Copy link

ahbon123 commented Jul 4, 2022

I set min_data_in_leaf to 6, which is smaller than default value 20, it works for small dataset.

@wellswei
Copy link

wellswei commented Jul 5, 2022

Similar error happens when I run GPU build, while it works fine on CPU. I tried different envs and LGBM versions. So confused.

@wellswei
Copy link

wellswei commented Jul 6, 2022

UPD: Set min_child_weight to 1 solved the problems. For both left_count and right_count errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests