Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Segmentation fault with CUDA version in Python interface (core dumped) #6300

Open
leedchou opened this issue Feb 5, 2024 · 12 comments
Labels

Comments

@leedchou
Copy link

leedchou commented Feb 5, 2024

Description

I installed lightgbm-4.3.0.0, cuda version. After data loaded and transported to GPU, execution just stopped. Below is the log.
GPU memory is about 12GB while the data is 6GB.

[LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
[LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Warning] Using customized objective with cuda. This requires copying gradients from CPU to GPU, which can be slow.
[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Total Bins 2438
[LightGBM] [Info] Number of data points in the train set: 35322835, number of used features: 48
[LightGBM] [Warning] Using customized objective with cuda. This requires copying gradients from CPU to GPU, which can be slow.
[LightGBM] [Info] Using self-defined objective function
Segmentation fault (core dumped)

Reproducible example

params = {
    'task': 'train',
    'objective': focal_loss_obj,
    'max_bin': 63,
    'num_leaves': 255,
    'min_data_in_leaf': 20,
    'max_depth': 15,
    'learning_rate': 0.05,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'num_class': 4,
    'n_jobs': -1,
    'random_state': 42,
    'boosting_type': 'gbdt',
    'device': 'cuda'
}

gbm = lgb.train(
    params,
    train_set=lgb_train,
    valid_sets=(lgb_train, lgb_eval),
    valid_names=('fit', 'eval'),
    num_boost_round=10000,
    callbacks=[lgb.early_stopping(stopping_rounds=100), lgb.record_evaluation(eval_result)],
    feval=gmean_score
)

Environment info

LightGBM version or commit hash: 4.3.0.0
Command(s) you used to install LightGBM

pip install --no-binary lightgbm --config-settings=cmake.define.USE_CUDA=ON lightgbm

Additional Comments

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM, and for the well-formatted report.

We'd be happy to help, but there are some things you can do to narrow down the issue further and reduce the effort that'll be required to find the root cause.

  • Can you please provide the code for focal_loss_obj() and gmean_score()?
  • If you use LightGBM built-in loss function and metrics, does LightGBM still segfault? If not, then the issue might be somewhere in your implementations of those functions.
  • Alternatively... if you can't share the dataset you're using, can you try with the exact same parameters, loss function, metrics, etc. but a public dataset, like those available from scikit-learn via sklearn.datasets? And report what happens?
  • Can you try removing parameters from params one-by-one and try to reduce it to the smallest set of non-default values that still produces the problem? For example, if you remove bagging_fraction and feature_fraction and still see a segfault, that's very helpful because it tells us the issue is not related to subsampling of rows and columns inside LightGBM.

@jameslamb
Copy link
Collaborator

Also note that I've reformatted your original post slightly to make the difference between code, your own words, and text printed by code clearer. You can click ... -> Edit in GitHub to see what that looks like in raw markdown form.

If you're unsure how I did that, please review https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax.

@leedchou
Copy link
Author

Thank you @jameslamb for the kindly reply. I took a new year's leave and back to work today. There is one import thing that I forgot to post here, that is when I reduced the number of data points in the train set to a smaller one (e.g. 100,000), it worked. So maybe it is the data problem?

@jameslamb
Copy link
Collaborator

Please provide the details I asked for at #6300 (comment) to help us eliminate possible causes.

@leedchou
Copy link
Author

Please provide the details I asked for at #6300 (comment) to help us eliminate possible causes.

Hi @jameslamb , I re-run my code with nothing changed but train data, it was replaced by iris data from sklearn.datasets.load_iris. Suprisingly, it worked.

@shiyu1994
Copy link
Collaborator

Seems related to the cuda version. I will investigate this.

@shiyu1994 shiyu1994 changed the title [python-package] Segmentation fault (core dumped) [python-package] Segmentation fault with CUDA version in Python interface (core dumped) Feb 20, 2024
@shiyu1994
Copy link
Collaborator

@leedchou Could you provide the implementation of focal_loss_obj?

@shiyu1994
Copy link
Collaborator

In addition, if you could provide a minimal example for reproducing the error, that would be very helpful.

@leedchou
Copy link
Author

Thank you @shiyu1994 , I'd love to show you the implementaion of focal_loss_obj and an example. It would be great if I can get your email address, so I can send you an example by email.

@shiyu1994
Copy link
Collaborator

@leedchou Thanks. You may send that to my personal email shiyu_k1994@qq.com. It would also be great if you could post the example here for clear and open discussion.

@jameslamb
Copy link
Collaborator

It would also be great if you could post the example here for clear and open discussion.

Please do this, @leedchou, so that everyone finding this discussion from search in the future can learn from it and so that others can contribute to helping.

@leedchou
Copy link
Author

It would also be great if you could post the example here for clear and open discussion.

Please do this, @leedchou, so that everyone finding this discussion from search in the future can learn from it and so that others can contribute to helping.

Ok, I'll post it here @shiyu1994 .

focal_loss_obj:

def focal_loss_lgb(y_pred, dtrain, alpha, gamma=2, num_class=4):
    target = dtrain.get_label()
    grad = np.zeros((len(target), num_class), dtype=float)
    hess = np.zeros((len(target), num_class), dtype=float)

    y_true = np.eye(num_class)[target.astype('int')]  # one-hot
    y_pred = y_pred.reshape(len(target), num_class, order='F')
    softmax_p = special.softmax(y_pred, axis=-1)

    for c in range(num_class):
        pc = softmax_p[:, c]
        pt = softmax_p[:][y_true == 1]
        grad[:, c][y_true[:, c] == 1] = (gamma * np.power(1 - pt[y_true[:, c] == 1], gamma - 1) * pt[y_true[:, c] == 1] * np.log(pt[y_true[:, c] == 1]) - np.power(1 - pt[y_true[:, c] == 1], gamma)) * (1 - pc[y_true[:, c] == 1])
        grad[:, c][y_true[:, c] == 0] = (gamma * np.power(1 - pt[y_true[:, c] == 0], gamma - 1) * pt[y_true[:, c] == 0] * np.log(pt[y_true[:, c] == 0]) - np.power(1 - pt[y_true[:, c] == 0], gamma)) * (0 - pc[y_true[:, c] == 0])
        hess[:, c][y_true[:, c] == 1] = (-4 * (1 - pt[y_true[:, c] == 1]) * pt[y_true[:, c] == 1] * np.log(pt[y_true[:, c] == 1]) + np.power(1 - pt[y_true[:, c] == 1], 2) * (2 * np.log(pt[y_true[:, c] == 1]) + 5)) * pt[y_true[:, c] == 1] * (1 - pt[y_true[:, c] == 1])
        hess[:, c][y_true[:, c] == 0] = pt[y_true[:, c] == 0] * np.power(pc[y_true[:, c] == 0], 2) * (-2 * pt[y_true[:, c] == 0] * np.log(pt[y_true[:, c] == 0]) + 2 * (1 - pt[y_true[:, c] == 0]) * np.log(pt[y_true[:, c] == 0]) + 4 * (1 - pt[y_true[:, c] == 0])) - pc[y_true[:, c] == 0] * (1 - pc[y_true[:, c] == 0]) * (1 - pt[y_true[:, c] == 0]) * (2 * pt[y_true[:, c] == 0] * np.log(pt[y_true[:, c] == 0]) - (1 - pt[y_true[:, c] == 0]))

    alpha = np.array([alpha[i] for i in target.astype('int')])[:, np.newaxis]
    grad = alpha * grad
    hess = alpha * hess

    return grad.flatten('F'), hess.flatten('F')

train example:

    class_weights = [1, 1, 1, 1]
    focal_loss_obj = lambda x, y: focal_loss_lgb(x, y, alpha=class_weights, gamma=2, num_class=4)
    gmean_score = lambda x, y: gmean_metric(x, y, num_class=4)

    params = {
        'objective': focal_loss_obj,
        'task': 'train',
        'max_bin': 255,
        'num_leaves': 255,
        'min_data_in_leaf': 20,
        'max_depth': 15,
        'learning_rate': 0.05,
        'feature_fraction': 0.8,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'num_class': 4,
        'n_jobs': -1,
        'random_state': 42,
        'boosting_type': 'gbdt',
        'device': 'cuda',
        # 'gpu_platform_id': 0,
        # 'gpu_device_id': 0,
    }
    eval_result = {}
    gbm = lgb.train(params,
                    train_set=lgb_train,
                    valid_sets=(lgb_train, lgb_eval),
                    valid_names=('fit', 'eval'),
                    num_boost_round=10000,
                    callbacks=[lgb.early_stopping(stopping_rounds=100), lgb.record_evaluation(eval_result)],
                    feval=gmean_score
                    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants