[python-package] Segmentation fault with CUDA version in Python interface (core dumped) #6300

leedchou · 2024-02-05T09:26:07Z

Description

I installed lightgbm-4.3.0.0, cuda version. After data loaded and transported to GPU, execution just stopped. Below is the log.
GPU memory is about 12GB while the data is 6GB.

[LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
[LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Warning] Using customized objective with cuda. This requires copying gradients from CPU to GPU, which can be slow.
[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Total Bins 2438
[LightGBM] [Info] Number of data points in the train set: 35322835, number of used features: 48
[LightGBM] [Warning] Using customized objective with cuda. This requires copying gradients from CPU to GPU, which can be slow.
[LightGBM] [Info] Using self-defined objective function
Segmentation fault (core dumped)

Reproducible example

params = {
    'task': 'train',
    'objective': focal_loss_obj,
    'max_bin': 63,
    'num_leaves': 255,
    'min_data_in_leaf': 20,
    'max_depth': 15,
    'learning_rate': 0.05,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'num_class': 4,
    'n_jobs': -1,
    'random_state': 42,
    'boosting_type': 'gbdt',
    'device': 'cuda'
}

gbm = lgb.train(
    params,
    train_set=lgb_train,
    valid_sets=(lgb_train, lgb_eval),
    valid_names=('fit', 'eval'),
    num_boost_round=10000,
    callbacks=[lgb.early_stopping(stopping_rounds=100), lgb.record_evaluation(eval_result)],
    feval=gmean_score
)

Environment info

LightGBM version or commit hash: 4.3.0.0
Command(s) you used to install LightGBM

pip install --no-binary lightgbm --config-settings=cmake.define.USE_CUDA=ON lightgbm

Additional Comments

The text was updated successfully, but these errors were encountered:

jameslamb · 2024-02-05T15:15:48Z

Thanks for using LightGBM, and for the well-formatted report.

We'd be happy to help, but there are some things you can do to narrow down the issue further and reduce the effort that'll be required to find the root cause.

Can you please provide the code for focal_loss_obj() and gmean_score()?
If you use LightGBM built-in loss function and metrics, does LightGBM still segfault? If not, then the issue might be somewhere in your implementations of those functions.
Alternatively... if you can't share the dataset you're using, can you try with the exact same parameters, loss function, metrics, etc. but a public dataset, like those available from scikit-learn via sklearn.datasets? And report what happens?
Can you try removing parameters from params one-by-one and try to reduce it to the smallest set of non-default values that still produces the problem? For example, if you remove bagging_fraction and feature_fraction and still see a segfault, that's very helpful because it tells us the issue is not related to subsampling of rows and columns inside LightGBM.

jameslamb · 2024-02-05T15:18:45Z

Also note that I've reformatted your original post slightly to make the difference between code, your own words, and text printed by code clearer. You can click ... -> Edit in GitHub to see what that looks like in raw markdown form.

If you're unsure how I did that, please review https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax.

leedchou · 2024-02-19T03:38:02Z

Thank you @jameslamb for the kindly reply. I took a new year's leave and back to work today. There is one import thing that I forgot to post here, that is when I reduced the number of data points in the train set to a smaller one (e.g. 100,000), it worked. So maybe it is the data problem?

jameslamb · 2024-02-19T05:47:15Z

Please provide the details I asked for at #6300 (comment) to help us eliminate possible causes.

leedchou · 2024-02-20T09:08:14Z

Please provide the details I asked for at #6300 (comment) to help us eliminate possible causes.

Hi @jameslamb , I re-run my code with nothing changed but train data, it was replaced by iris data from sklearn.datasets.load_iris. Suprisingly, it worked.

shiyu1994 · 2024-02-20T09:10:06Z

Seems related to the cuda version. I will investigate this.

shiyu1994 · 2024-02-20T10:52:54Z

@leedchou Could you provide the implementation of focal_loss_obj?

shiyu1994 · 2024-02-20T10:53:35Z

In addition, if you could provide a minimal example for reproducing the error, that would be very helpful.

leedchou · 2024-02-21T02:44:38Z

Thank you @shiyu1994 , I'd love to show you the implementaion of focal_loss_obj and an example. It would be great if I can get your email address, so I can send you an example by email.

shiyu1994 · 2024-02-21T02:47:52Z

@leedchou Thanks. You may send that to my personal email shiyu_k1994@qq.com. It would also be great if you could post the example here for clear and open discussion.

jameslamb · 2024-02-21T02:53:59Z

It would also be great if you could post the example here for clear and open discussion.

Please do this, @leedchou, so that everyone finding this discussion from search in the future can learn from it and so that others can contribute to helping.

leedchou · 2024-02-21T09:04:50Z

It would also be great if you could post the example here for clear and open discussion.

Please do this, @leedchou, so that everyone finding this discussion from search in the future can learn from it and so that others can contribute to helping.

Ok, I'll post it here @shiyu1994 .

focal_loss_obj:

def focal_loss_lgb(y_pred, dtrain, alpha, gamma=2, num_class=4):
    target = dtrain.get_label()
    grad = np.zeros((len(target), num_class), dtype=float)
    hess = np.zeros((len(target), num_class), dtype=float)

    y_true = np.eye(num_class)[target.astype('int')]  # one-hot
    y_pred = y_pred.reshape(len(target), num_class, order='F')
    softmax_p = special.softmax(y_pred, axis=-1)

    for c in range(num_class):
        pc = softmax_p[:, c]
        pt = softmax_p[:][y_true == 1]
        grad[:, c][y_true[:, c] == 1] = (gamma * np.power(1 - pt[y_true[:, c] == 1], gamma - 1) * pt[y_true[:, c] == 1] * np.log(pt[y_true[:, c] == 1]) - np.power(1 - pt[y_true[:, c] == 1], gamma)) * (1 - pc[y_true[:, c] == 1])
        grad[:, c][y_true[:, c] == 0] = (gamma * np.power(1 - pt[y_true[:, c] == 0], gamma - 1) * pt[y_true[:, c] == 0] * np.log(pt[y_true[:, c] == 0]) - np.power(1 - pt[y_true[:, c] == 0], gamma)) * (0 - pc[y_true[:, c] == 0])
        hess[:, c][y_true[:, c] == 1] = (-4 * (1 - pt[y_true[:, c] == 1]) * pt[y_true[:, c] == 1] * np.log(pt[y_true[:, c] == 1]) + np.power(1 - pt[y_true[:, c] == 1], 2) * (2 * np.log(pt[y_true[:, c] == 1]) + 5)) * pt[y_true[:, c] == 1] * (1 - pt[y_true[:, c] == 1])
        hess[:, c][y_true[:, c] == 0] = pt[y_true[:, c] == 0] * np.power(pc[y_true[:, c] == 0], 2) * (-2 * pt[y_true[:, c] == 0] * np.log(pt[y_true[:, c] == 0]) + 2 * (1 - pt[y_true[:, c] == 0]) * np.log(pt[y_true[:, c] == 0]) + 4 * (1 - pt[y_true[:, c] == 0])) - pc[y_true[:, c] == 0] * (1 - pc[y_true[:, c] == 0]) * (1 - pt[y_true[:, c] == 0]) * (2 * pt[y_true[:, c] == 0] * np.log(pt[y_true[:, c] == 0]) - (1 - pt[y_true[:, c] == 0]))

    alpha = np.array([alpha[i] for i in target.astype('int')])[:, np.newaxis]
    grad = alpha * grad
    hess = alpha * hess

    return grad.flatten('F'), hess.flatten('F')

train example:

    class_weights = [1, 1, 1, 1]
    focal_loss_obj = lambda x, y: focal_loss_lgb(x, y, alpha=class_weights, gamma=2, num_class=4)
    gmean_score = lambda x, y: gmean_metric(x, y, num_class=4)

    params = {
        'objective': focal_loss_obj,
        'task': 'train',
        'max_bin': 255,
        'num_leaves': 255,
        'min_data_in_leaf': 20,
        'max_depth': 15,
        'learning_rate': 0.05,
        'feature_fraction': 0.8,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'num_class': 4,
        'n_jobs': -1,
        'random_state': 42,
        'boosting_type': 'gbdt',
        'device': 'cuda',
        # 'gpu_platform_id': 0,
        # 'gpu_device_id': 0,
    }
    eval_result = {}
    gbm = lgb.train(params,
                    train_set=lgb_train,
                    valid_sets=(lgb_train, lgb_eval),
                    valid_names=('fit', 'eval'),
                    num_boost_round=10000,
                    callbacks=[lgb.early_stopping(stopping_rounds=100), lgb.record_evaluation(eval_result)],
                    feval=gmean_score
                    )

jameslamb added question awaiting response labels Feb 5, 2024

github-actions bot removed the awaiting response label Feb 20, 2024

shiyu1994 changed the title ~~[python-package] Segmentation fault (core dumped)~~ [python-package] Segmentation fault with CUDA version in Python interface (core dumped) Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] Segmentation fault with CUDA version in Python interface (core dumped) #6300

[python-package] Segmentation fault with CUDA version in Python interface (core dumped) #6300

leedchou commented Feb 5, 2024 •

edited by jameslamb

Loading

jameslamb commented Feb 5, 2024

jameslamb commented Feb 5, 2024

leedchou commented Feb 19, 2024

jameslamb commented Feb 19, 2024

leedchou commented Feb 20, 2024

shiyu1994 commented Feb 20, 2024

shiyu1994 commented Feb 20, 2024

shiyu1994 commented Feb 20, 2024

leedchou commented Feb 21, 2024

shiyu1994 commented Feb 21, 2024

jameslamb commented Feb 21, 2024

leedchou commented Feb 21, 2024

[python-package] Segmentation fault with CUDA version in Python interface (core dumped) #6300

[python-package] Segmentation fault with CUDA version in Python interface (core dumped) #6300

Comments

leedchou commented Feb 5, 2024 • edited by jameslamb Loading

Description

Reproducible example

Environment info

Additional Comments

jameslamb commented Feb 5, 2024

jameslamb commented Feb 5, 2024

leedchou commented Feb 19, 2024

jameslamb commented Feb 19, 2024

leedchou commented Feb 20, 2024

shiyu1994 commented Feb 20, 2024

shiyu1994 commented Feb 20, 2024

shiyu1994 commented Feb 20, 2024

leedchou commented Feb 21, 2024

shiyu1994 commented Feb 21, 2024

jameslamb commented Feb 21, 2024

leedchou commented Feb 21, 2024

leedchou commented Feb 5, 2024 •

edited by jameslamb

Loading