Target and Count Encoding for Categorical Features with Ensemble #4779

shiyu1994 · 2021-11-08T05:28:51Z

Summary

In #3234, we've mentioned that the target encoding of categorical features are done in a cross-validation style. That means we randomly partition the training dataset into folds.

Through experiments, we found that a simple ensemble trick related to this dataset partitioning can boost the effectiveness of #3234.

In the ensemble trick, instead of partitioning the training dataset with only 1 random seed, we partition the training dataset with m different random seeds. For each random seed, we will have a different set of encoded categorical feature values. For each seed, we'll have a separate GBDT model. But all these models share the same tree structure in the same iteration, which is the tree structure trained with seed 0.

This ensemble trick is used by CatBoost, and described in their paper (https://arxiv.org/pdf/1706.09516.pdf).

Motivation

The best illustration of the value of this trick is the experiments result (values are AUC of binary classification tasks):

Dataset	New Encoding with Ensemble	New Encoding	Old Approach
Amazon	0.862934±0.002636	0.858445±0.003533	0.854134±0.002725
Appetency	0.853366±0.002710	0.849800±0.002979	0.838189±0.004230
Click	0.741250±0.000291	0.740799±0.000269	0.720182±0.000323
Internet	0.960595±0.000515	0.959814±0.001270	0.959849±0.000314
Upselling	0.864214±0.001046	0.862846±0.001735	0.863376±0.001305
AutoML B	0.624295±0.002524	0.615565±0.006900	0.617608±0.002437

The experiment setting is similar with #3234 (comment). But with fewer rounds of tuning in the hyperparameter search.

Description

A previous implementation of ensemble trick can be found in https://github.com/shiyu1994/LightGBM/tree/ctr-multi-partition. However, since the GBDT models share the same tree structure, we think it can be easily achieved by leverage existing refit methods. So currently we have two choices:

Continue with the branch https://github.com/shiyu1994/LightGBM/tree/ctr-multi-partition after Target and Count encodings for categorical features #3234 is merged.
Support this feature by leverage refit API's of R and python, and the implementation can be done without touching C++ code.

Pros and Cons of the above methods:

Implementation in the C++ side can save some time for data construction, because some information between different seeds can be shared. For example, the binning of numerical features is not affected by the seed. The drawback is that we may further enlarge the code base of currently complex data preprocessing procedure, and incur slightly heavier maintenance burden in the future.
Implementation in the R and Python package side by leveraging refit methods is simple, and should be good for maintenance, but with lower efficiency.

References

CatBoost paper link: https://arxiv.org/pdf/1706.09516.pdf
Existing C++ implemenation: #3234 (comment)

The text was updated successfully, but these errors were encountered:

shiyu1994 · 2021-11-08T05:29:25Z

Gently ping @jameslamb @StrikerRUS @guolinke @hzy46 @tongwu-msft @btrotta for discussion of the choice.

StrikerRUS · 2021-11-08T15:04:02Z

Thanks a lot the detailed description!
I'm for the choice #1. I believe that maintenance burden for two separate implementations in language wrappers can be even heavier compared to the unified cpp implementation. Also, we may end up with not exactly the same implementations and receive reports in the future about that the same code in Python and R packages doesn't produce identical results.

In addition, third-party libraries will benefit from cpp implementation.

jameslamb · 2021-11-08T23:48:12Z

Thanks so much for the detailed write-up! The approach you're talking about seems useful.

I agree with @StrikerRUS , I favor #1. I think language wrappers' main responsibilities should be:

translating input data and parameters from a particular language's data structures (e.g. Spark DataFrame in SynapseML, R list for parameters in the R package) into a format that can be understood by LightGBM's C++ library
integrating with other tools in the language's ecosystem (e.g. scikit-learn and Dask in Python)
providing an API into LightGBM that can be installed the way users install other libraries written in that language, so they don't need to understand how to build large C++ projects to use LightGBM

And that core training and prediction logic should be pushed down into C++ as much as possible, to minimize the amount of duplicate implementations across different wrappers.

But there's an important operational concern here too. @StrikerRUS and I do our best but we aren't as experienced or confident in C++ as you or @guolinke or @btrotta are. So I guess this choice also is somewhat about which maintainers will see an increased responsibility, as much as it is about how much added maintenance responsibility would result from this change.

shiyu1994 · 2021-11-16T10:56:25Z

@StrikerRUS @jameslamb Thank you! We've decided to keep the ensemble trick in the C++ side.

shiyu1994 added the feature request label Nov 8, 2021

shiyu1994 mentioned this issue Nov 8, 2021

[Draft] Oct~Nov iteration Plan #4677

Closed

16 tasks

shiyu1994 mentioned this issue Nov 15, 2021

Target and Count encodings for categorical features #3234

Closed

tongwu-sh self-assigned this Nov 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Target and Count Encoding for Categorical Features with Ensemble #4779

Target and Count Encoding for Categorical Features with Ensemble #4779

shiyu1994 commented Nov 8, 2021 •

edited

Loading

shiyu1994 commented Nov 8, 2021

StrikerRUS commented Nov 8, 2021 •

edited

Loading

jameslamb commented Nov 8, 2021

shiyu1994 commented Nov 16, 2021

Target and Count Encoding for Categorical Features with Ensemble #4779

Target and Count Encoding for Categorical Features with Ensemble #4779

Comments

shiyu1994 commented Nov 8, 2021 • edited Loading

Summary

Motivation

Description

References

shiyu1994 commented Nov 8, 2021

StrikerRUS commented Nov 8, 2021 • edited Loading

jameslamb commented Nov 8, 2021

shiyu1994 commented Nov 16, 2021

shiyu1994 commented Nov 8, 2021 •

edited

Loading

StrikerRUS commented Nov 8, 2021 •

edited

Loading