[Question] Large difference between builtin softmax and custom softmax objective #6219

qianyun210603 · 2023-11-29T01:30:33Z

Description

I'm using the sklearn interface to solve some 3-category classification problem.
I tried to benchmark the custom softmax objective function sklearn_multiclass_custom_objective copied from tests/python_package_test/utils.py for multi-category classification with the builtin one to verify it's accuracy so that I can further customise it to fit my own needs.
However, I see large difference of predicted result on original train data. I want to figure out whether it is expected and is it possible to align the prediction result of the two?

Reproducible example

# copied from tests/python_package_test/utils.py, softmax
def softmax1(x):
    row_wise_max = np.max(x, axis=1).reshape(-1, 1)
    exp_x = np.exp(x - row_wise_max)
    return exp_x / np.sum(exp_x, axis=1).reshape(-1, 1)

# copied from tests/python_package_test/utils.py, custom loss
def sklearn_multiclass_custom_objective(y_true, y_pred, weight=None):
    num_rows, num_class = y_pred.shape
    prob = softmax1(y_pred)
    grad_update = np.zeros_like(prob)
    grad_update[np.arange(num_rows), y_true.astype(np.int32)] = -1.0
    grad = prob + grad_update
    factor = num_class / (num_class - 1)
    hess = factor * prob * (1 - prob)
    if weight is not None:
        weight2d = weight.reshape(-1, 1)
        grad *= weight2d
        hess *= weight2d
    return grad, hess

if __name__ == "__main__":
    with open("test_data.bin", "rb") as f:
        df_x, df_y = pickle.load(f)


    X_train = df_x.loc['2018-01-01':'2018-12-28'].values
    y_train = df_y.loc[pd.IndexSlice['2018-01-01':'2018-12-28', :], "LABEL1"].values

    # params = {"lambda_l1": 208.6999, "lambda_l2": 508.9768, "learning_rate": 0.01, "num_leaves": 15, "num_threads": 20}
    params = {"n_estimators": 100, "learning_rate": 0.01, "num_leaves": 15}
    lgbt = LGBMClassifier(objective="multiclass", num_class=3, **params)

    lgbt.fit(X_train, y_train, callbacks=[log_evaluation(1)])
    res = lgbt.predict_proba(X_train)
    lgbt_custom = LGBMClassifier(objective=sklearn_multiclass_custom_objective, num_class=3, **params)

    lgbt_custom.fit(X_train, y_train, callbacks=[log_evaluation(1)])
    res_custom = softmax1(lgbt_custom.predict(X_train, raw_score=True))

    # lgbt_custom2 = LGBMClassifier(objective=custom_softmax_loss, num_class=3, **params)
    #
    # lgbt_custom2.fit(X_train, y_train, callbacks=[log_evaluation(1)])
    #
    # res_custom2 = softmax1(lgbt_custom2.predict(X_train, raw_score=True))

    print("Built-in softmax prediction")
    print(res)

    print("Custom softmax prediction")
    print(res_custom)

The result is:

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004852 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36083
[LightGBM] [Info] Number of data points in the train set: 16732, number of used features: 167
[LightGBM] [Info] Start training from score -3.058122
[LightGBM] [Info] Start training from score -0.846581
[LightGBM] [Info] Start training from score -0.645986
[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003220 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36083
[LightGBM] [Info] Number of data points in the train set: 16732, number of used features: 167
[LightGBM] [Info] Using self-defined objective function
Built-in softmax prediction
[[0.0306033  0.42807938 0.54131732]
 [0.02928444 0.53404836 0.4366672 ]
 [0.04774353 0.31160226 0.64065421]
 ...
 [0.02961229 0.50432063 0.46606708]
 [0.06464424 0.4681716  0.46718416]
 [0.02936818 0.46290391 0.50772791]]
Custom softmax prediction
[[0.14959751 0.37961963 0.47078286]
 [0.14552036 0.47568675 0.37879288]
 [0.1490515  0.285732   0.5652165 ]
 ...
 [0.13936158 0.46098811 0.39965031]
 [0.17287566 0.4127472  0.41437715]
 [0.14778185 0.41248568 0.43973247]]

Process finished with exit code 0

We see

the custom objective doesn't log the "Start training from score" as the builtin one does.
most important, there are big difference between the result, e.g. prediction for the first record [0.0306033 0.42807938 0.54131732] v.s. [0.14959751 0.37961963 0.47078286].

The dataset I used (test_data.bin) is shared in MS OneDrive (Link: https://1drv.ms/u/s!AnPL7Q5hAP8rlBAIOMZpK_Q5z3EL?e=T4cAev)

Environment info

LightGBM version or commit hash:

Command(s) you used to install LightGBM

I used pip to install lightgbm

pip install -U lightgbm

The version installed is 4.1.0

lightgbm.__version__
Out[4]: '4.1.0'

Additional Comments

Could you kindly help me understand/investigate this discrepancy please?

The text was updated successfully, but these errors were encountered:

jmoralez · 2023-11-29T03:25:09Z

Hey @qianyun210603, thanks for using LightGBM. The difference is due to the different init scores, you can find a detailed explanation in #5114 (comment).

Please let us know if you have further doubts.

qianyun210603 · 2023-11-29T06:55:12Z

@jmoralez Thanks a lot! Could you kindly point me where can I find the logic of setting init scores in builtin multiclassification please? Is it also mean of label? in this case 1/n_class? Thanks again.

jmoralez · 2023-11-29T15:45:55Z

Sure. The averages of the labels are computed here

LightGBM/src/objective/multiclass_objective.hpp

Lines 53 to 84 in 2ee3ec8

    
           void Init(const Metadata& metadata, data_size_t num_data) override { 
        
             num_data_ = num_data; 
        
             label_ = metadata.label(); 
        
             weights_ = metadata.weights(); 
        
             label_int_.resize(num_data_); 
        
             class_init_probs_.resize(num_class_, 0.0); 
        
             double sum_weight = 0.0; 
        
             for (int i = 0; i < num_data_; ++i) { 
        
               label_int_[i] = static_cast<int>(label_[i]); 
        
               if (label_int_[i] < 0 || label_int_[i] >= num_class_) { 
        
                 Log::Fatal("Label must be in [0, %d), but found %d in label", num_class_, label_int_[i]); 
        
               } 
        
               if (weights_ == nullptr) { 
        
                 class_init_probs_[label_int_[i]] += 1.0; 
        
               } else { 
        
                 class_init_probs_[label_int_[i]] += weights_[i]; 
        
                 sum_weight += weights_[i]; 
        
               } 
        
             } 
        
             if (weights_ == nullptr) { 
        
               sum_weight = num_data_; 
        
             } 
        
             if (Network::num_machines() > 1) { 
        
               sum_weight = Network::GlobalSyncUpBySum(sum_weight); 
        
               for (int i = 0; i < num_class_; ++i) { 
        
                 class_init_probs_[i] = Network::GlobalSyncUpBySum(class_init_probs_[i]); 
        
               } 
        
             } 
        
             for (int i = 0; i < num_class_; ++i) { 
        
               class_init_probs_[i] /= sum_weight; 
        
             } 
        
           }

And then the init scores are the logs of the averages (since those are the raw scores)

LightGBM/src/objective/multiclass_objective.hpp

Lines 155 to 157 in 2ee3ec8

    
           double BoostFromScore(int class_id) const override { 
        
             return std::log(std::max<double>(kEpsilon, class_init_probs_[class_id])); 
        
           }

qianyun210603 · 2023-11-30T00:42:37Z

Many thanks!

qianyun210603 · 2023-12-01T01:37:19Z

Hi @jmoralez, sorry to trouble you again.

Thanks for your previous instruction, I finally get the customised loss function aligned with the buildin one. However, another questions arised which I hope to get your help:
I noticed that I need to add the init_score back to the raw prediction from lgbt_custom.predict(X_train, raw_score=True) to get the correct probability, i.e., do softmax1(lgbt_custom.predict(X_train, raw_score=True)+init_score). I'm wondering if I want to use predict without raw_score=True to predict the category directly, will I get correct result? Has init_score been considered in this case?
To my understanding the final prediction of category is the category which have the maximum probability, i.e., argmax(exp(raw_score)). So if the raw score didn't incorporate the init_score, it should also affect the final classification. Does this mean that when I use custom loss + init_score, I'll need to retrieve the raw_score using .predict(..., raw_score=True), then add back init_score and take the argmax(exp(.)) manually outside lightGBM?

Also if my guess is really the case, what's the reason that lightGBM doesn't add back the init_score inside predict function? It looks not a difficult thing to do, or am I missing something?

Please kindly help me understand above questions. Thanks a lot!

jmoralez · 2023-12-01T16:17:20Z

Since you're using the scikit-learn API, if you use a custom objective the output of predict and predict(raw_score=True) is the same

LightGBM/python-package/lightgbm/sklearn.py

Lines 1233 to 1237 in 5083df1

    
           if callable(self._objective) or raw_score or pred_leaf or pred_contrib: 
        
               return result 
        
           else: 
        
               class_index = np.argmax(result, axis=1) 
        
               return self._le.inverse_transform(class_index)

Does this mean that when I use custom loss + init_score, I'll need to retrieve the raw_score using .predict(..., raw_score=True), then add back init_score and take the argmax(exp(.)) manually outside lightGBM?

Exactly

Also if my guess is really the case, what's the reason that lightGBM doesn't add back the init_score inside predict function?

I think it's because with the built-in objectives the boosting starts from a single number for each class, so that's the value at the root of the first tree, whereas when you provide a custom objective it starts from zero and then you can provide an init score for each sample. At inference time we may have a different number of samples, so we wouldn't know which value to add to each sample.

qianyun210603 · 2023-12-06T11:09:38Z

Thanks a lot! That's very clear.

jameslamb · 2024-05-02T04:47:12Z

Found myself writing up a Python implementation of calculating the multiclass init score (based on the code @jmoralez shared in #6219 (comment)), thought it would be useful to post the snippet here for others finding this from search.

import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_blobs

# generate multiclass dataset with 5 classes
X, y = make_blobs(n_samples=1_000, centers=5, random_state=773)

# fit a small multiclass classification model
clf = lgb.LGBMClassifier(n_estimators=3, num_leaves=4, seed=708)
clf.fit(X, y)

# for the builtin multiclass objective, LightGBM
# begins boosting from the weighted mean of the label
_, counts = np.unique(np.sort(y), return_counts=True)
init_score = np.log(counts/y.shape[0])
print(init_score)

That init_score value matches the one LightGBM prints at the beginning of boosting:

[LightGBM] [Info] Start training from score -1.609438

jmoralez added question awaiting response labels Nov 29, 2023

github-actions bot removed the awaiting response label Nov 29, 2023

jmoralez added the awaiting response label Nov 29, 2023

qianyun210603 closed this as completed Nov 30, 2023

github-actions bot removed the awaiting response label Nov 30, 2023

qianyun210603 reopened this Dec 1, 2023

qianyun210603 closed this as completed Dec 6, 2023

jmoralez mentioned this issue Dec 7, 2023

Weight Parameter in Dataset method #6183

Closed

jameslamb mentioned this issue May 2, 2024

[RFC] provide Python/R implementations of all the built-in objectives? #6440

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Large difference between builtin softmax and custom softmax objective #6219

[Question] Large difference between builtin softmax and custom softmax objective #6219

qianyun210603 commented Nov 29, 2023

jmoralez commented Nov 29, 2023

qianyun210603 commented Nov 29, 2023

jmoralez commented Nov 29, 2023

qianyun210603 commented Nov 30, 2023

qianyun210603 commented Dec 1, 2023

jmoralez commented Dec 1, 2023 •

edited

Loading

qianyun210603 commented Dec 6, 2023

jameslamb commented May 2, 2024 •

edited

Loading

[Question] Large difference between builtin softmax and custom softmax objective #6219

[Question] Large difference between builtin softmax and custom softmax objective #6219

Comments

qianyun210603 commented Nov 29, 2023

Description

Reproducible example

Environment info

Additional Comments

jmoralez commented Nov 29, 2023

qianyun210603 commented Nov 29, 2023

jmoralez commented Nov 29, 2023

qianyun210603 commented Nov 30, 2023

qianyun210603 commented Dec 1, 2023

jmoralez commented Dec 1, 2023 • edited Loading

qianyun210603 commented Dec 6, 2023

jameslamb commented May 2, 2024 • edited Loading

jmoralez commented Dec 1, 2023 •

edited

Loading

jameslamb commented May 2, 2024 •

edited

Loading