Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Large difference between builtin softmax and custom softmax objective #6219

Closed
qianyun210603 opened this issue Nov 29, 2023 · 8 comments
Labels

Comments

@qianyun210603
Copy link

Description

I'm using the sklearn interface to solve some 3-category classification problem.
I tried to benchmark the custom softmax objective function sklearn_multiclass_custom_objective copied from tests/python_package_test/utils.py for multi-category classification with the builtin one to verify it's accuracy so that I can further customise it to fit my own needs.
However, I see large difference of predicted result on original train data. I want to figure out whether it is expected and is it possible to align the prediction result of the two?

Reproducible example

# copied from tests/python_package_test/utils.py, softmax
def softmax1(x):
    row_wise_max = np.max(x, axis=1).reshape(-1, 1)
    exp_x = np.exp(x - row_wise_max)
    return exp_x / np.sum(exp_x, axis=1).reshape(-1, 1)

# copied from tests/python_package_test/utils.py, custom loss
def sklearn_multiclass_custom_objective(y_true, y_pred, weight=None):
    num_rows, num_class = y_pred.shape
    prob = softmax1(y_pred)
    grad_update = np.zeros_like(prob)
    grad_update[np.arange(num_rows), y_true.astype(np.int32)] = -1.0
    grad = prob + grad_update
    factor = num_class / (num_class - 1)
    hess = factor * prob * (1 - prob)
    if weight is not None:
        weight2d = weight.reshape(-1, 1)
        grad *= weight2d
        hess *= weight2d
    return grad, hess

if __name__ == "__main__":
    with open("test_data.bin", "rb") as f:
        df_x, df_y = pickle.load(f)


    X_train = df_x.loc['2018-01-01':'2018-12-28'].values
    y_train = df_y.loc[pd.IndexSlice['2018-01-01':'2018-12-28', :], "LABEL1"].values

    # params = {"lambda_l1": 208.6999, "lambda_l2": 508.9768, "learning_rate": 0.01, "num_leaves": 15, "num_threads": 20}
    params = {"n_estimators": 100, "learning_rate": 0.01, "num_leaves": 15}
    lgbt = LGBMClassifier(objective="multiclass", num_class=3, **params)

    lgbt.fit(X_train, y_train, callbacks=[log_evaluation(1)])
    res = lgbt.predict_proba(X_train)
    lgbt_custom = LGBMClassifier(objective=sklearn_multiclass_custom_objective, num_class=3, **params)

    lgbt_custom.fit(X_train, y_train, callbacks=[log_evaluation(1)])
    res_custom = softmax1(lgbt_custom.predict(X_train, raw_score=True))

    # lgbt_custom2 = LGBMClassifier(objective=custom_softmax_loss, num_class=3, **params)
    #
    # lgbt_custom2.fit(X_train, y_train, callbacks=[log_evaluation(1)])
    #
    # res_custom2 = softmax1(lgbt_custom2.predict(X_train, raw_score=True))

    print("Built-in softmax prediction")
    print(res)

    print("Custom softmax prediction")
    print(res_custom)

The result is:

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004852 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36083
[LightGBM] [Info] Number of data points in the train set: 16732, number of used features: 167
[LightGBM] [Info] Start training from score -3.058122
[LightGBM] [Info] Start training from score -0.846581
[LightGBM] [Info] Start training from score -0.645986
[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003220 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36083
[LightGBM] [Info] Number of data points in the train set: 16732, number of used features: 167
[LightGBM] [Info] Using self-defined objective function
Built-in softmax prediction
[[0.0306033  0.42807938 0.54131732]
 [0.02928444 0.53404836 0.4366672 ]
 [0.04774353 0.31160226 0.64065421]
 ...
 [0.02961229 0.50432063 0.46606708]
 [0.06464424 0.4681716  0.46718416]
 [0.02936818 0.46290391 0.50772791]]
Custom softmax prediction
[[0.14959751 0.37961963 0.47078286]
 [0.14552036 0.47568675 0.37879288]
 [0.1490515  0.285732   0.5652165 ]
 ...
 [0.13936158 0.46098811 0.39965031]
 [0.17287566 0.4127472  0.41437715]
 [0.14778185 0.41248568 0.43973247]]

Process finished with exit code 0

We see

  1. the custom objective doesn't log the "Start training from score" as the builtin one does.
  2. most important, there are big difference between the result, e.g. prediction for the first record [0.0306033 0.42807938 0.54131732] v.s. [0.14959751 0.37961963 0.47078286].

The dataset I used (test_data.bin) is shared in MS OneDrive (Link: https://1drv.ms/u/s!AnPL7Q5hAP8rlBAIOMZpK_Q5z3EL?e=T4cAev)

Environment info

LightGBM version or commit hash:

Command(s) you used to install LightGBM

I used pip to install lightgbm

pip install -U lightgbm

The version installed is 4.1.0

lightgbm.__version__
Out[4]: '4.1.0'

Additional Comments

Could you kindly help me understand/investigate this discrepancy please?

@jmoralez
Copy link
Collaborator

Hey @qianyun210603, thanks for using LightGBM. The difference is due to the different init scores, you can find a detailed explanation in #5114 (comment).

Please let us know if you have further doubts.

@qianyun210603
Copy link
Author

@jmoralez Thanks a lot! Could you kindly point me where can I find the logic of setting init scores in builtin multiclassification please? Is it also mean of label? in this case 1/n_class? Thanks again.

@jmoralez
Copy link
Collaborator

Sure. The averages of the labels are computed here

void Init(const Metadata& metadata, data_size_t num_data) override {
num_data_ = num_data;
label_ = metadata.label();
weights_ = metadata.weights();
label_int_.resize(num_data_);
class_init_probs_.resize(num_class_, 0.0);
double sum_weight = 0.0;
for (int i = 0; i < num_data_; ++i) {
label_int_[i] = static_cast<int>(label_[i]);
if (label_int_[i] < 0 || label_int_[i] >= num_class_) {
Log::Fatal("Label must be in [0, %d), but found %d in label", num_class_, label_int_[i]);
}
if (weights_ == nullptr) {
class_init_probs_[label_int_[i]] += 1.0;
} else {
class_init_probs_[label_int_[i]] += weights_[i];
sum_weight += weights_[i];
}
}
if (weights_ == nullptr) {
sum_weight = num_data_;
}
if (Network::num_machines() > 1) {
sum_weight = Network::GlobalSyncUpBySum(sum_weight);
for (int i = 0; i < num_class_; ++i) {
class_init_probs_[i] = Network::GlobalSyncUpBySum(class_init_probs_[i]);
}
}
for (int i = 0; i < num_class_; ++i) {
class_init_probs_[i] /= sum_weight;
}
}

And then the init scores are the logs of the averages (since those are the raw scores)
double BoostFromScore(int class_id) const override {
return std::log(std::max<double>(kEpsilon, class_init_probs_[class_id]));
}

@qianyun210603
Copy link
Author

Many thanks!

@qianyun210603
Copy link
Author

Hi @jmoralez, sorry to trouble you again.

Thanks for your previous instruction, I finally get the customised loss function aligned with the buildin one. However, another questions arised which I hope to get your help:
I noticed that I need to add the init_score back to the raw prediction from lgbt_custom.predict(X_train, raw_score=True) to get the correct probability, i.e., do softmax1(lgbt_custom.predict(X_train, raw_score=True)+init_score). I'm wondering if I want to use predict without raw_score=True to predict the category directly, will I get correct result? Has init_score been considered in this case?
To my understanding the final prediction of category is the category which have the maximum probability, i.e., argmax(exp(raw_score)). So if the raw score didn't incorporate the init_score, it should also affect the final classification. Does this mean that when I use custom loss + init_score, I'll need to retrieve the raw_score using .predict(..., raw_score=True), then add back init_score and take the argmax(exp(.)) manually outside lightGBM?

Also if my guess is really the case, what's the reason that lightGBM doesn't add back the init_score inside predict function? It looks not a difficult thing to do, or am I missing something?

Please kindly help me understand above questions. Thanks a lot!

@qianyun210603 qianyun210603 reopened this Dec 1, 2023
@jmoralez
Copy link
Collaborator

jmoralez commented Dec 1, 2023

Since you're using the scikit-learn API, if you use a custom objective the output of predict and predict(raw_score=True) is the same

if callable(self._objective) or raw_score or pred_leaf or pred_contrib:
return result
else:
class_index = np.argmax(result, axis=1)
return self._le.inverse_transform(class_index)

Does this mean that when I use custom loss + init_score, I'll need to retrieve the raw_score using .predict(..., raw_score=True), then add back init_score and take the argmax(exp(.)) manually outside lightGBM?

Exactly

Also if my guess is really the case, what's the reason that lightGBM doesn't add back the init_score inside predict function?

I think it's because with the built-in objectives the boosting starts from a single number for each class, so that's the value at the root of the first tree, whereas when you provide a custom objective it starts from zero and then you can provide an init score for each sample. At inference time we may have a different number of samples, so we wouldn't know which value to add to each sample.

@qianyun210603
Copy link
Author

Thanks a lot! That's very clear.

@jameslamb
Copy link
Collaborator

jameslamb commented May 2, 2024

Found myself writing up a Python implementation of calculating the multiclass init score (based on the code @jmoralez shared in #6219 (comment)), thought it would be useful to post the snippet here for others finding this from search.

import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_blobs

# generate multiclass dataset with 5 classes
X, y = make_blobs(n_samples=1_000, centers=5, random_state=773)

# fit a small multiclass classification model
clf = lgb.LGBMClassifier(n_estimators=3, num_leaves=4, seed=708)
clf.fit(X, y)

# for the builtin multiclass objective, LightGBM
# begins boosting from the weighted mean of the label
_, counts = np.unique(np.sort(y), return_counts=True)
init_score = np.log(counts/y.shape[0])
print(init_score)

That init_score value matches the one LightGBM prints at the beginning of boosting:

[LightGBM] [Info] Start training from score -1.609438

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants