Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target and Count Encoding for Categorical Features with Ensemble #4779

Open
shiyu1994 opened this issue Nov 8, 2021 · 4 comments
Open

Target and Count Encoding for Categorical Features with Ensemble #4779

shiyu1994 opened this issue Nov 8, 2021 · 4 comments
Assignees

Comments

@shiyu1994
Copy link
Collaborator

shiyu1994 commented Nov 8, 2021

Summary

In #3234, we've mentioned that the target encoding of categorical features are done in a cross-validation style. That means we randomly partition the training dataset into folds.

Through experiments, we found that a simple ensemble trick related to this dataset partitioning can boost the effectiveness of #3234.

In the ensemble trick, instead of partitioning the training dataset with only 1 random seed, we partition the training dataset with m different random seeds. For each random seed, we will have a different set of encoded categorical feature values. For each seed, we'll have a separate GBDT model. But all these models share the same tree structure in the same iteration, which is the tree structure trained with seed 0.

This ensemble trick is used by CatBoost, and described in their paper (https://arxiv.org/pdf/1706.09516.pdf).

Motivation

The best illustration of the value of this trick is the experiments result (values are AUC of binary classification tasks):

Dataset New Encoding with Ensemble New Encoding Old Approach
Amazon 0.862934±0.002636 0.858445±0.003533 0.854134±0.002725
Appetency 0.853366±0.002710 0.849800±0.002979 0.838189±0.004230
Click 0.741250±0.000291 0.740799±0.000269 0.720182±0.000323
Internet 0.960595±0.000515 0.959814±0.001270 0.959849±0.000314
Upselling 0.864214±0.001046 0.862846±0.001735 0.863376±0.001305
AutoML B 0.624295±0.002524 0.615565±0.006900 0.617608±0.002437

The experiment setting is similar with #3234 (comment). But with fewer rounds of tuning in the hyperparameter search.

Description

A previous implementation of ensemble trick can be found in https://github.com/shiyu1994/LightGBM/tree/ctr-multi-partition. However, since the GBDT models share the same tree structure, we think it can be easily achieved by leverage existing refit methods. So currently we have two choices:

  1. Continue with the branch https://github.com/shiyu1994/LightGBM/tree/ctr-multi-partition after Target and Count encodings for categorical features #3234 is merged.
  2. Support this feature by leverage refit API's of R and python, and the implementation can be done without touching C++ code.

Pros and Cons of the above methods:

  1. Implementation in the C++ side can save some time for data construction, because some information between different seeds can be shared. For example, the binning of numerical features is not affected by the seed. The drawback is that we may further enlarge the code base of currently complex data preprocessing procedure, and incur slightly heavier maintenance burden in the future.
  2. Implementation in the R and Python package side by leveraging refit methods is simple, and should be good for maintenance, but with lower efficiency.

References

CatBoost paper link: https://arxiv.org/pdf/1706.09516.pdf
Existing C++ implemenation: #3234 (comment)

@shiyu1994
Copy link
Collaborator Author

Gently ping @jameslamb @StrikerRUS @guolinke @hzy46 @tongwu-msft @btrotta for discussion of the choice.

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Nov 8, 2021

Thanks a lot the detailed description!
I'm for the choice #1. I believe that maintenance burden for two separate implementations in language wrappers can be even heavier compared to the unified cpp implementation. Also, we may end up with not exactly the same implementations and receive reports in the future about that the same code in Python and R packages doesn't produce identical results.

In addition, third-party libraries will benefit from cpp implementation.

@jameslamb
Copy link
Collaborator

Thanks so much for the detailed write-up! The approach you're talking about seems useful.

I agree with @StrikerRUS , I favor #1. I think language wrappers' main responsibilities should be:

  • translating input data and parameters from a particular language's data structures (e.g. Spark DataFrame in SynapseML, R list for parameters in the R package) into a format that can be understood by LightGBM's C++ library
  • integrating with other tools in the language's ecosystem (e.g. scikit-learn and Dask in Python)
  • providing an API into LightGBM that can be installed the way users install other libraries written in that language, so they don't need to understand how to build large C++ projects to use LightGBM

And that core training and prediction logic should be pushed down into C++ as much as possible, to minimize the amount of duplicate implementations across different wrappers.


But there's an important operational concern here too. @StrikerRUS and I do our best but we aren't as experienced or confident in C++ as you or @guolinke or @btrotta are. So I guess this choice also is somewhat about which maintainers will see an increased responsibility, as much as it is about how much added maintenance responsibility would result from this change.

@shiyu1994
Copy link
Collaborator Author

@StrikerRUS @jameslamb Thank you! We've decided to keep the ensemble trick in the C++ side.

@tongwu-sh tongwu-sh self-assigned this Nov 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants