Add option to keep cv predicted values #283

JackStat · 2017-02-05T04:17:29Z

Forgive me if I missed something but I have review the code and documentation and didn't see a way to keep the cv probabilities.

guolinke · 2017-02-14T02:48:20Z

@JackStat
I think a simple solution is saving all cv models. Then get the predictions by these models.
Welcome to contribute for this, I think it is easy to implement.

JackStat · 2017-02-14T04:15:24Z

Absolutely. I looked through the code and thought that would be a good strategy as well but I could not find an object that had the holdout data.frame.

So it looks like this chunk creates the 3 boosters (assuming 3-fold cv)

# construct booster
bst_folds <- lapply(seq_along(folds), function(k) {
  dtest   <- slice(data, folds[[k]])
  dtrain  <- slice(data, unlist(folds[-k]))
  booster <- Booster$new(params, dtrain)
  booster$add_valid(dtest, "valid")
  list(booster = booster)
})

Then you can run something with lapply to get predictions from each booster using bst_folds[[1]]$booster$predict Now I just need to know where the cv data.frames are kept so I can apply the predictions to those. I dug into the objects and couldn't see them.

Any help would be appreciated and I will open the pull req.
Thanks

yanyachen · 2017-05-17T00:53:06Z

@guolinke I looked through the code and I found that predicting form lgb.Dataset hasn't been supported yet. Could you support that when you got time? Otherwise we can not use all cv models to predict on each fold.

Below is a simple function that generating cv predictions from original dataset, @JackStat you can use that for your problem, though I think you had figured it out by yourself.

LGB_CV_Predict <- function(lgb_cv, data, num_iteration = NULL, folds) {
  if (is.null(num_iteration)) {
    num_iteration <- lgb_cv$best_iter
  }
  cv_pred_mat <- foreach::foreach(i = seq_along(lgb_cv$boosters), .combine = "rbind") %do% {
    lgb_tree <- lgb_cv$boosters[[i]][[1]]
    predict(lgb_tree, 
            data[folds[[i]],], 
            num_iteration = num_iteration, 
            rawscore = FALSE, predleaf = FALSE, header = FALSE, reshape = TRUE)
  }
  if (ncol(cv_pred_mat) == 1) {
    as.double(cv_pred_mat)[order(unlist(folds))]
  } else {
    cv_pred_mat[order(unlist(folds)), , drop = FALSE]
  }
}

guolinke · 2017-08-16T13:01:48Z

@yanyachen
Actually, we can get the prediction of training dataset and validation dataset by using this function:
R: https://github.com/Microsoft/LightGBM/blob/master/R-package/R/lgb.Booster.R#L454-L495
python: https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L1768-L1793

I think use them is enough to achieve the CV prediction score.

fulldecent · 2017-08-16T14:12:40Z

Is this related to #828 ?

mayer79 · 2017-11-10T13:59:20Z

lgb.cv would indeed be much more useful if it would return the final predictions. That would e.g. allow to do stacking.

programmersims · 2019-03-03T23:22:04Z

Here is an R function that will do it if you pass in a obj from lgb.cv:

get_lgbm_cv_preds <- function(cv){
        rows <- length(cv$boosters[[1]]$booster$.__enclos_env__$private$valid_sets[[1]]$.__enclos_env__$private$used_indices)+length(cv$boosters[[1]]$booster$.__enclos_env__$private$train_set$.__enclos_env__$private$used_indices)
        preds <- numeric(rows)
        for(i in 1:length(cv$boosters)){
                preds[
                cv$boosters[[i]]$booster$.__enclos_env__$private$valid_sets[[1]]$.__enclos_env__$private$used_indices] <-
                cv$boosters[[i]]$booster$.__enclos_env__$private$inner_predict(2)
        }
        return(preds)
}

NamLQ · 2019-03-25T16:52:09Z

Great job, @programmersims !

Does the function get the best cv prediction?

programmersims · 2019-03-25T16:54:10Z

Pretty sure it gets the CV prediction from the last stopping round

…

On Mon, Mar 25, 2019, 11:52 AM Nam Lê Quang ***@***.***> wrote: Great job, @programmersims <https://github.com/programmersims> ! Does the function get the best cv prediction? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#283 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOT-C7zd1FCVMN9crl4WKSM5NAOQGNwfks5vaP7NgaJpZM4L3Yff> .

NamLQ · 2019-03-25T17:04:09Z

What a pity!

How can I just keep the best cv prediction, @programmersims ?

StrikerRUS · 2019-08-01T16:48:18Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

momijiame · 2020-06-11T01:30:16Z

My sincere thanks to @StrikerRUS for unlocking.

Motivation and requirements

I know that people (especially, included some Kagglers) want this feature and I want to fix it. There are probably two reasons why people might want to get prediction values of trained models from cv() function.

req1. to analyze out-of-fold predictions for training data in more detail.
req2. to do some ensemble techniques (stacking, averaging, etc) using the trained models from the cv() function

How to fix it

I agree with @guolinke mentioned plan. In other words, add a simple way to get trained models.

req1: cv() function can accept 'folds' (context of data split), therefore users can predict of out-of-fold with trained models.
req2: users are free to enjoy any ensemble techniques with trained models.

Steps to fix it

I want to follow scikit-learn way. In other words, trained models are included to the dictionary of return value.
ref: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

I suggest the following steps:

Add an option as named of 'return_cvbooster' to cv() function.

Add trained '_CVBooster' object (cvfolds) to the dict of return value (results) with the key 'cvbooster'
NOTE: I am not particular about parameter names.

Change the name of '_CVBooster' to 'CVBooster'

In other words, _CVBooster will be treated as public API
This step also fix Why is _CVBooster object hidden class? #2105

I would like to have your opinion.

StrikerRUS · 2020-06-11T18:43:44Z

@momijiame Thank you very much for your detailed plan! It looks good to me! Looking forward to your PR.

@matsuken92 Maybe you have something in the mind that can improve the proposed PR's plan?

matsuken92 · 2020-06-12T08:58:59Z

@StrikerRUS Okay, I will review this plan !

…283,#2105,#1445) (#3204) * [python] add return_cvbooster flag to cv function and rename _CVBooster to make public (#283,#2105) * [python] Reduce expected metric of unit testing * [docs] add the CVBooster to the documentation * [python] reflect the review comments - Add some clarifications to the documentation - Rename CVBooster.append to make private - Decrease iteration rounds of testing to save CI time - Use CVBooster as root member of lgb * [python] add more checks in testing for cv Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * [python] add docstring for instance attributes of CVBooster Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * [python] fix docstring Co-authored-by: Nikita Titov <nekit94-08@mail.ru> Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

StrikerRUS · 2020-08-02T16:10:32Z

Implemented in #3204.

Laurae2 mentioned this issue Feb 14, 2017

Plans to support export PMML model in R package? #296

Closed

guolinke added help wanted feature request labels Aug 16, 2017

guolinke mentioned this issue Aug 16, 2017

[R package question] Is that possible to retrieve the score for each sample from lgb.cv? #808

Closed

guolinke added this to the v3.0 milestone Aug 16, 2017

dkivaranovic mentioned this issue Apr 11, 2018

Keep predictions from valid_sets using lgb.train #1309

Closed

StrikerRUS mentioned this issue Nov 25, 2018

Please add reset_data to lgb.cv #1854

Closed

StrikerRUS mentioned this issue Apr 16, 2019

Why is _CVBooster object hidden class? #2105

Closed

guolinke mentioned this issue Aug 1, 2019

Feature Requests & Voting Hub #2302

Open

guolinke closed this as completed Aug 1, 2019

StrikerRUS mentioned this issue Mar 5, 2020

lightgbm cv #2871

Closed

momijiame mentioned this issue Jul 3, 2020

[python] add return_cvbooster flag to cv func and publish _CVBooster (#283,#2105,#1445) #3204

Merged

toshihikoyanase mentioned this issue Aug 28, 2020

[Feature Request] LightGBMTunerCV model object optuna/optuna#1740

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to keep cv predicted values #283

Add option to keep cv predicted values #283

JackStat commented Feb 5, 2017

guolinke commented Feb 14, 2017

JackStat commented Feb 14, 2017

yanyachen commented May 17, 2017

guolinke commented Aug 16, 2017 •

edited

Loading

fulldecent commented Aug 16, 2017

mayer79 commented Nov 10, 2017

programmersims commented Mar 3, 2019 •

edited

Loading

NamLQ commented Mar 25, 2019

programmersims commented Mar 25, 2019 via email

NamLQ commented Mar 25, 2019

StrikerRUS commented Aug 1, 2019

momijiame commented Jun 11, 2020 •

edited

Loading

StrikerRUS commented Jun 11, 2020

matsuken92 commented Jun 12, 2020

StrikerRUS commented Aug 2, 2020

Add option to keep cv predicted values #283

Add option to keep cv predicted values #283

Comments

JackStat commented Feb 5, 2017

guolinke commented Feb 14, 2017

JackStat commented Feb 14, 2017

yanyachen commented May 17, 2017

guolinke commented Aug 16, 2017 • edited Loading

fulldecent commented Aug 16, 2017

mayer79 commented Nov 10, 2017

programmersims commented Mar 3, 2019 • edited Loading

NamLQ commented Mar 25, 2019

programmersims commented Mar 25, 2019 via email

NamLQ commented Mar 25, 2019

StrikerRUS commented Aug 1, 2019

momijiame commented Jun 11, 2020 • edited Loading

Motivation and requirements

How to fix it

Steps to fix it

StrikerRUS commented Jun 11, 2020

matsuken92 commented Jun 12, 2020

StrikerRUS commented Aug 2, 2020

guolinke commented Aug 16, 2017 •

edited

Loading

programmersims commented Mar 3, 2019 •

edited

Loading

momijiame commented Jun 11, 2020 •

edited

Loading