[dask] support init_score #3807

jameslamb · 2021-01-21T20:31:34Z

Summary

LightGBM allows you to provide initial scores to boost from (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Dataset.html?highlight=init_score#lightgbm.Dataset.set_init_score).

This option should be supported in the Dask interface. Dask model classes should accept a dask array, dask dataframe, or Dask dataframe series as input for init_score.

Motivation

This change would bring the Dask interface closer to full feature parity with the non-Dask interface, so that users who'd otherwise like to use Dask don't have to avoid it because init_score is missing.

References

Created from this conversation: #3708 (comment)

The text was updated successfully, but these errors were encountered:

jameslamb · 2021-01-21T20:31:58Z

Closing this in favor of being in #2302 with other feature requests. Anyone is welcome to contribute this feature. Leave a comment below and it can be re-opened.

jmoralez · 2021-02-12T02:37:50Z

I see this is a method of lgb.Dataset, should I attempt to make a Dask equivalent? I guess this will have to be done eventually to avoid having lots of parts all over the place.

jameslamb · 2021-02-12T03:33:23Z

Since you're working on this, I'll open the issue back up. Thanks for looking into it!

I see this is a method of lgb.Dataset, should I attempt to make a Dask equivalent

I'm not sure what you mean by "method of Dataset", sorry. For a training dataset of shape [n, k], init_score is an array of shape [n] with initial scores for each sample.

LightGBM/python-package/lightgbm/sklearn.py

Line 194 in 84c4b75

init_score : array-like of shape = [n_samples] or None, optional (default=None)

Since the Dask interface only supports LightGBM's scikit-learn interface today, you don't need to do anything with the Dataset object to add this feature. I think it'll work to handle init_score exactly the way we handle sample_weight:

LightGBM/python-package/lightgbm/dask.py

Lines 293 to 296 in 84c4b75

    
           if sample_weight is not None: 
        
               weight_parts = _split_to_parts(data=sample_weight, is_matrix=False) 
        
               for i in range(len(parts)): 
        
                   parts[i]['weight'] = weight_parts[i]

it should be passed to .fit() as a Dask Array, Dask DataFrame, or Dask series
it should have the same partitioning as X and y
pass it around in parts just like we currently do with sample_weight

jmoralez · 2021-02-12T04:36:46Z

Oh I meant the link in the summary points to lgb.Dataset.set_init_score. Ok, will try your proposed approach.

jmoralez · 2021-02-13T03:25:42Z

I believe init_score is deprecated in the scikit-learn interface. I'm getting

/home/josemz/programs/anaconda3/envs/lightgbm/lib/python3.7/site-packages/lightgbm/basic.py:1224: UserWarning: init_score keyword has been found in `params` and will be ignored.
Please use init_score argument of the Dataset constructor to pass this parameter.

jameslamb · 2021-02-13T03:30:44Z

It is not deprecated. init_score should be passed to fit(), not the estimator's constructor: https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/sklearn.py#L543

If you get that same warning even when doing that, please ignore it.

jameslamb · 2021-03-04T17:50:24Z

fixed by #3950

jmoralez · 2021-03-05T03:52:27Z

Hi @jameslamb. I'm including the multiclass-classification task and the test for init_score fails because you have to specify an init_score for each class. However, you can't send an (n_samples, n_classes) array like it is suggested here because the init_score is expected to be a 1d collection, so you have to reshape that to an (n_samples * n_classes, ) collection, which for the dask case will be hard because of the partitioning.

So I see two ways of fixing this:

Handling this reshaping inside the _train_part function and sending an (n_samples, n_classes) collection as init_score to the fit method of the dask models (which creates an inconsistency between the interfaces).
Allow for init_score to be of shape (n_samples, n_classes) everywhere in LightGBM, which currently fails here. This seems to be more intuitive but is also very invasive since it involves modifying the basic module.

What do you think?

jameslamb · 2021-03-05T04:06:04Z

Hi @jameslamb. I'm including the multiclass-classification task and the test for init_score fails because you have to specify an init_score for each class. However, you can't send an (n_samples, n_classes) array like it is suggested here because the init_score is expected to be a 1d collection, so you have to reshape that to an (n_samples * n_classes, ) collection, which for the dask case will be hard because of the partitioning.

So I see two ways of fixing this:

Handling this reshaping inside the _train_part function and sending an (n_samples, n_classes) collection as init_score to the fit method of the dask models (which creates an inconsistency between the interfaces).

Allow for init_score to be of shape (n_samples, n_classes) everywhere in LightGBM, which currently fails here. This seems to be more intuitive but is also very invasive since it involves modifying the basic module.

What do you think?

Please open a new issue describing the problem, with a reproducible example of what happens when you pass an init_score of shape (n_samples, n_classes), including the specific error message.

jameslamb added feature request dask labels Jan 21, 2021

jameslamb closed this as completed Jan 21, 2021

This was referenced Jan 21, 2021

Feature Requests & Voting Hub #2302

Open

[python-package] [dask] Add DaskLGBMRanker #3708

Merged

jameslamb mentioned this issue Feb 9, 2021

[dask] Support asynchronous workflows #3929

Closed

jameslamb reopened this Feb 12, 2021

jmoralez mentioned this issue Feb 13, 2021

[dask] Include support for init_score #3950

Merged

jameslamb closed this as completed Mar 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] support init_score #3807

[dask] support init_score #3807

jameslamb commented Jan 21, 2021

jameslamb commented Jan 21, 2021

jmoralez commented Feb 12, 2021

jameslamb commented Feb 12, 2021

jmoralez commented Feb 12, 2021

jmoralez commented Feb 13, 2021

jameslamb commented Feb 13, 2021

jameslamb commented Mar 4, 2021

jmoralez commented Mar 5, 2021

jameslamb commented Mar 5, 2021

[dask] support init_score #3807

[dask] support init_score #3807

Comments

jameslamb commented Jan 21, 2021

Summary

Motivation

References

jameslamb commented Jan 21, 2021

jmoralez commented Feb 12, 2021

jameslamb commented Feb 12, 2021

jmoralez commented Feb 12, 2021

jmoralez commented Feb 13, 2021

jameslamb commented Feb 13, 2021

jameslamb commented Mar 4, 2021

jmoralez commented Mar 5, 2021

jameslamb commented Mar 5, 2021