-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Target and Count Encoding for Categorical Features with Ensemble #4779
Comments
Gently ping @jameslamb @StrikerRUS @guolinke @hzy46 @tongwu-msft @btrotta for discussion of the choice. |
Thanks a lot the detailed description! In addition, third-party libraries will benefit from cpp implementation. |
Thanks so much for the detailed write-up! The approach you're talking about seems useful. I agree with @StrikerRUS , I favor
And that core training and prediction logic should be pushed down into C++ as much as possible, to minimize the amount of duplicate implementations across different wrappers. But there's an important operational concern here too. @StrikerRUS and I do our best but we aren't as experienced or confident in C++ as you or @guolinke or @btrotta are. So I guess this choice also is somewhat about which maintainers will see an increased responsibility, as much as it is about how much added maintenance responsibility would result from this change. |
@StrikerRUS @jameslamb Thank you! We've decided to keep the ensemble trick in the C++ side. |
Summary
In #3234, we've mentioned that the target encoding of categorical features are done in a cross-validation style. That means we randomly partition the training dataset into folds.
Through experiments, we found that a simple ensemble trick related to this dataset partitioning can boost the effectiveness of #3234.
In the ensemble trick, instead of partitioning the training dataset with only 1 random seed, we partition the training dataset with
m
different random seeds. For each random seed, we will have a different set of encoded categorical feature values. For each seed, we'll have a separate GBDT model. But all these models share the same tree structure in the same iteration, which is the tree structure trained with seed 0.This ensemble trick is used by CatBoost, and described in their paper (https://arxiv.org/pdf/1706.09516.pdf).
Motivation
The best illustration of the value of this trick is the experiments result (values are AUC of binary classification tasks):
The experiment setting is similar with #3234 (comment). But with fewer rounds of tuning in the hyperparameter search.
Description
A previous implementation of ensemble trick can be found in https://github.com/shiyu1994/LightGBM/tree/ctr-multi-partition. However, since the GBDT models share the same tree structure, we think it can be easily achieved by leverage existing
refit
methods. So currently we have two choices:refit
API's of R and python, and the implementation can be done without touching C++ code.Pros and Cons of the above methods:
refit
methods is simple, and should be good for maintenance, but with lower efficiency.References
CatBoost paper link: https://arxiv.org/pdf/1706.09516.pdf
Existing C++ implemenation: #3234 (comment)
The text was updated successfully, but these errors were encountered: