Feature fraction gives different column selections when using a custom objective #6053

jrvmalik · 2023-08-20T01:52:44Z

Description

I ran into this because I got different column subsampling behaviour when using a built-in objective like MSE and when using a custom objective. I thought either: the reported seed in the booster.txt was wrong, or that columns are subsampled twice each iteration when using a custom objective, but only once when using a built-in objective (via perhaps an extra call to 'ResetByTree').

Reproducible example

import lightgbm as lgb
import numpy as np
import pandas as pd

np.random.seed(1)
X = np.random.randn(300, 300)
y = np.random.randn(300)

data = lgb.Dataset(X, label=y, init_score=np.zeros_like(y))
params = {'max_depth': 1, 'learning_rate': 0.001, 'verbose': -1, 'objective': 'l2', 'feature_fraction': 1.0, 'seed': 0}

model = lgb.train(params, data, num_boost_round=1000)
model_dict = model.dump_model(num_iteration=3)['tree_info']

def fobj(predictions, dataset):
    return predictions - dataset.get_label(), np.ones_like(predictions)

model_custom = lgb.train(params, data, num_boost_round=1000, fobj=fobj)
model_custom_dict = model_custom.dump_model(num_iteration=3)['tree_info']

model_custom_dict == model_dict

This works when feature fraction is equal to 1.0 only. It breaks when feature_fraction = 0.5 for instance.

I can hack around the seeds to make it work:

import lightgbm as lgb
import numpy as np
import pandas as pd

np.random.seed(1)
X = np.random.randn(300, 300)
y = np.random.randn(300)

data = lgb.Dataset(X, label=y, init_score=np.zeros_like(y))
params = {'max_depth': 1, 'learning_rate': 0.001, 'verbose': -1, 'objective': 'l2', 'feature_fraction': 0.5, 'seed': 0, 'feature_fraction_seed': 974891790}

model = lgb.train(params, data, num_boost_round=1000)
model_dict = model.dump_model(num_iteration=3)['tree_info']

def fobj(predictions, dataset):
    return predictions - dataset.get_label(), np.ones_like(predictions)

params['feature_fraction_seed'] = 2
model_custom = lgb.train(params, data, num_boost_round=1000, fobj=fobj)
model_custom_dict = model_custom.dump_model(num_iteration=3)['tree_info']

model_custom_dict == model_dict

Environment info

LightGBM version or commit hash: 8d01d648942a427f6bb4962dc3f4330e005fa495

Command(s) you used to install LightGBM

pip install lightgbm

jameslamb · 2023-09-06T03:24:52Z

Thanks for using LightGBM and for your thorough report.

What is commit 8d01d648942a427f6bb4962dc3f4330e005fa495? I don't see that commit in LightGBM's history: 8d01d64.

I strongly suspect that that is from some version of LightGBM prior to v4.0.0. The parameter fobj was removed from lgb.train() in #5052, so your example code yields the following error on latest master (8203306).

Traceback (most recent call last):
File "", line 1, in
TypeError: train() got an unexpected keyword argument 'fobj'

I ran the code from your report today on that latest commit, with the following modifications:

passing fobj through params (the new encouraged pattern, see [python-package] Where has the "fobj" parameter gone in lightgbm.cv() in v4.0.0? #6072 for example)
passing "deterministic": True and "force_row_wise": True", to eliminate more possible sources of randomness
setting "verbose": 1
- we are debugging here, getting logs might be helpful
reducing num_boost_round too 3
- it's not necessary to train for 1000 rounds if the issue is detectable after 3

code (click me)

import lightgbm as lgb
import numpy as np
import pandas as pd

np.random.seed(1)
X = np.random.randn(300, 300)
y = np.random.randn(300)

data = lgb.Dataset(
    X,
    label=y,
    init_score=np.zeros_like(y),
)
params = {
    'deterministic': True,
    'force_row_wise': True,
    'max_depth': 1,
    'learning_rate': 0.001,
    'verbose': -1,
    'objective': 'l2',
    'feature_fraction': 0.5,
    'seed': 123
}

model = lgb.train(params, data, num_boost_round=1000)
model_dict = model.dump_model()


def fobj(predictions, dataset):
    return predictions - dataset.get_label(), np.ones_like(predictions)


model_custom = lgb.train(
    params={**params, "objective": fobj},
    train_set=data,
    num_boost_round=1000
)
model_custom_dict = model_custom.dump_model()

assert model_custom_dict == model_dict

I saw that the first splits were identical

assert model_custom_dict["tree_info"][0] == model_dict["tree_info"][0]

But then they started to diverge starting with the second split.

assert model_custom_dict["tree_info"][1] == model_dict["tree_info"][1]

I'm not sure exactly what's happening, but that narrows it down a bit.

I can hack around the seeds to make it work:

I did not understand this comment and example. Do you just mean that you experimented with different values of seeds until you found a combination of 2 different ones led to the models being identical?

jameslamb added the bug label Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature fraction gives different column selections when using a custom objective #6053

Feature fraction gives different column selections when using a custom objective #6053

jrvmalik commented Aug 20, 2023 •

edited by jameslamb

Loading

jameslamb commented Sep 6, 2023

Feature fraction gives different column selections when using a custom objective #6053

Feature fraction gives different column selections when using a custom objective #6053

Comments

jrvmalik commented Aug 20, 2023 • edited by jameslamb Loading

Description

Reproducible example

Environment info

jameslamb commented Sep 6, 2023

jrvmalik commented Aug 20, 2023 •

edited by jameslamb

Loading