Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature fraction gives different column selections when using a custom objective #6053

Open
jrvmalik opened this issue Aug 20, 2023 · 1 comment
Labels

Comments

@jrvmalik
Copy link

jrvmalik commented Aug 20, 2023

Description

I ran into this because I got different column subsampling behaviour when using a built-in objective like MSE and when using a custom objective. I thought either: the reported seed in the booster.txt was wrong, or that columns are subsampled twice each iteration when using a custom objective, but only once when using a built-in objective (via perhaps an extra call to 'ResetByTree').

Reproducible example

import lightgbm as lgb
import numpy as np
import pandas as pd

np.random.seed(1)
X = np.random.randn(300, 300)
y = np.random.randn(300)

data = lgb.Dataset(X, label=y, init_score=np.zeros_like(y))
params = {'max_depth': 1, 'learning_rate': 0.001, 'verbose': -1, 'objective': 'l2', 'feature_fraction': 1.0, 'seed': 0}

model = lgb.train(params, data, num_boost_round=1000)
model_dict = model.dump_model(num_iteration=3)['tree_info']

def fobj(predictions, dataset):
    return predictions - dataset.get_label(), np.ones_like(predictions)

model_custom = lgb.train(params, data, num_boost_round=1000, fobj=fobj)
model_custom_dict = model_custom.dump_model(num_iteration=3)['tree_info']

model_custom_dict == model_dict

This works when feature fraction is equal to 1.0 only. It breaks when feature_fraction = 0.5 for instance.

I can hack around the seeds to make it work:

import lightgbm as lgb
import numpy as np
import pandas as pd

np.random.seed(1)
X = np.random.randn(300, 300)
y = np.random.randn(300)

data = lgb.Dataset(X, label=y, init_score=np.zeros_like(y))
params = {'max_depth': 1, 'learning_rate': 0.001, 'verbose': -1, 'objective': 'l2', 'feature_fraction': 0.5, 'seed': 0, 'feature_fraction_seed': 974891790}

model = lgb.train(params, data, num_boost_round=1000)
model_dict = model.dump_model(num_iteration=3)['tree_info']

def fobj(predictions, dataset):
    return predictions - dataset.get_label(), np.ones_like(predictions)

params['feature_fraction_seed'] = 2
model_custom = lgb.train(params, data, num_boost_round=1000, fobj=fobj)
model_custom_dict = model_custom.dump_model(num_iteration=3)['tree_info']

model_custom_dict == model_dict

Environment info

LightGBM version or commit hash: 8d01d648942a427f6bb4962dc3f4330e005fa495

Command(s) you used to install LightGBM

pip install lightgbm
@jameslamb
Copy link
Collaborator

Thanks for using LightGBM and for your thorough report.

What is commit 8d01d648942a427f6bb4962dc3f4330e005fa495? I don't see that commit in LightGBM's history: 8d01d64.

I strongly suspect that that is from some version of LightGBM prior to v4.0.0. The parameter fobj was removed from lgb.train() in #5052, so your example code yields the following error on latest master (8203306).

Traceback (most recent call last):
File "", line 1, in
TypeError: train() got an unexpected keyword argument 'fobj'


I ran the code from your report today on that latest commit, with the following modifications:

  • passing fobj through params (the new encouraged pattern, see [python-package] Where has the "fobj" parameter gone in lightgbm.cv() in v4.0.0? #6072 for example)
  • passing "deterministic": True and "force_row_wise": True", to eliminate more possible sources of randomness
  • setting "verbose": 1
    • we are debugging here, getting logs might be helpful
  • reducing num_boost_round too 3
    • it's not necessary to train for 1000 rounds if the issue is detectable after 3
code (click me)
import lightgbm as lgb
import numpy as np
import pandas as pd

np.random.seed(1)
X = np.random.randn(300, 300)
y = np.random.randn(300)

data = lgb.Dataset(
    X,
    label=y,
    init_score=np.zeros_like(y),
)
params = {
    'deterministic': True,
    'force_row_wise': True,
    'max_depth': 1,
    'learning_rate': 0.001,
    'verbose': -1,
    'objective': 'l2',
    'feature_fraction': 0.5,
    'seed': 123
}

model = lgb.train(params, data, num_boost_round=1000)
model_dict = model.dump_model()


def fobj(predictions, dataset):
    return predictions - dataset.get_label(), np.ones_like(predictions)


model_custom = lgb.train(
    params={**params, "objective": fobj},
    train_set=data,
    num_boost_round=1000
)
model_custom_dict = model_custom.dump_model()

assert model_custom_dict == model_dict

I saw that the first splits were identical

assert model_custom_dict["tree_info"][0] == model_dict["tree_info"][0]

But then they started to diverge starting with the second split.

assert model_custom_dict["tree_info"][1] == model_dict["tree_info"][1]

I'm not sure exactly what's happening, but that narrows it down a bit.


I can hack around the seeds to make it work:

I did not understand this comment and example. Do you just mean that you experimented with different values of seeds until you found a combination of 2 different ones led to the models being identical?

@jameslamb jameslamb added the bug label Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants