Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] support init_score #3807

Closed
jameslamb opened this issue Jan 21, 2021 · 9 comments
Closed

[dask] support init_score #3807

jameslamb opened this issue Jan 21, 2021 · 9 comments

Comments

@jameslamb
Copy link
Collaborator

Summary

LightGBM allows you to provide initial scores to boost from (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Dataset.html?highlight=init_score#lightgbm.Dataset.set_init_score).

This option should be supported in the Dask interface. Dask model classes should accept a dask array, dask dataframe, or Dask dataframe series as input for init_score.

Motivation

This change would bring the Dask interface closer to full feature parity with the non-Dask interface, so that users who'd otherwise like to use Dask don't have to avoid it because init_score is missing.

References

Created from this conversation: #3708 (comment)

@jameslamb
Copy link
Collaborator Author

Closing this in favor of being in #2302 with other feature requests. Anyone is welcome to contribute this feature. Leave a comment below and it can be re-opened.

@jmoralez
Copy link
Collaborator

I see this is a method of lgb.Dataset, should I attempt to make a Dask equivalent? I guess this will have to be done eventually to avoid having lots of parts all over the place.

@jameslamb
Copy link
Collaborator Author

Since you're working on this, I'll open the issue back up. Thanks for looking into it!

I see this is a method of lgb.Dataset, should I attempt to make a Dask equivalent

I'm not sure what you mean by "method of Dataset", sorry. For a training dataset of shape [n, k], init_score is an array of shape [n] with initial scores for each sample.

init_score : array-like of shape = [n_samples] or None, optional (default=None)

Since the Dask interface only supports LightGBM's scikit-learn interface today, you don't need to do anything with the Dataset object to add this feature. I think it'll work to handle init_score exactly the way we handle sample_weight:

if sample_weight is not None:
weight_parts = _split_to_parts(data=sample_weight, is_matrix=False)
for i in range(len(parts)):
parts[i]['weight'] = weight_parts[i]

  • it should be passed to .fit() as a Dask Array, Dask DataFrame, or Dask series
  • it should have the same partitioning as X and y
  • pass it around in parts just like we currently do with sample_weight

@jameslamb jameslamb reopened this Feb 12, 2021
@jmoralez
Copy link
Collaborator

Oh I meant the link in the summary points to lgb.Dataset.set_init_score. Ok, will try your proposed approach.

@jmoralez
Copy link
Collaborator

I believe init_score is deprecated in the scikit-learn interface. I'm getting

/home/josemz/programs/anaconda3/envs/lightgbm/lib/python3.7/site-packages/lightgbm/basic.py:1224: UserWarning: init_score keyword has been found in `params` and will be ignored.
Please use init_score argument of the Dataset constructor to pass this parameter.

@jameslamb
Copy link
Collaborator Author

It is not deprecated. init_score should be passed to fit(), not the estimator's constructor: https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/sklearn.py#L543

If you get that same warning even when doing that, please ignore it.

@jameslamb
Copy link
Collaborator Author

fixed by #3950

@jmoralez
Copy link
Collaborator

jmoralez commented Mar 5, 2021

Hi @jameslamb. I'm including the multiclass-classification task and the test for init_score fails because you have to specify an init_score for each class. However, you can't send an (n_samples, n_classes) array like it is suggested here because the init_score is expected to be a 1d collection, so you have to reshape that to an (n_samples * n_classes, ) collection, which for the dask case will be hard because of the partitioning.

So I see two ways of fixing this:

  1. Handling this reshaping inside the _train_part function and sending an (n_samples, n_classes) collection as init_score to the fit method of the dask models (which creates an inconsistency between the interfaces).
  2. Allow for init_score to be of shape (n_samples, n_classes) everywhere in LightGBM, which currently fails here. This seems to be more intuitive but is also very invasive since it involves modifying the basic module.

What do you think?

@jameslamb
Copy link
Collaborator Author

Hi @jameslamb. I'm including the multiclass-classification task and the test for init_score fails because you have to specify an init_score for each class. However, you can't send an (n_samples, n_classes) array like it is suggested here because the init_score is expected to be a 1d collection, so you have to reshape that to an (n_samples * n_classes, ) collection, which for the dask case will be hard because of the partitioning.

So I see two ways of fixing this:

  1. Handling this reshaping inside the _train_part function and sending an (n_samples, n_classes) collection as init_score to the fit method of the dask models (which creates an inconsistency between the interfaces).
  2. Allow for init_score to be of shape (n_samples, n_classes) everywhere in LightGBM, which currently fails here. This seems to be more intuitive but is also very invasive since it involves modifying the basic module.

What do you think?

Please open a new issue describing the problem, with a reproducible example of what happens when you pass an init_score of shape (n_samples, n_classes), including the specific error message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants