Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Better Regularization for Categorical features #1934

Closed
guolinke opened this issue Jan 2, 2019 · 12 comments
Closed

[Enhancement] Better Regularization for Categorical features #1934

guolinke opened this issue Jan 2, 2019 · 12 comments

Comments

@guolinke
Copy link
Collaborator

guolinke commented Jan 2, 2019

It seems current split finding algorithm for categorical features often results in over-fitting.
We need a better solution to reduce the over-fitting.

@guolinke
Copy link
Collaborator Author

guolinke commented Jan 2, 2019

code of the current solution:
https://github.com/Microsoft/LightGBM/blob/fb28070e1daa500b087d3102145ae48988030195/src/treelearner/feature_histogram.hpp#L112-L234

The process:

  1. sort the bins in the feature histogram by the sum_gradients / sum_hessians
  2. find the split points in the sorted histogram

Current Regularizations:

  1. convert to one-vs-rest when num_cats <= config->max_cat_to_onehot
  2. add cat_smooth to sum_hessians the before sorting the histogram
  3. only enumerate max_cat_threshold times, to avoid searching too many times.
  4. use min_data_per_group to avoid too little data in each split point.
  5. an additional cat_l2 for L2 regularization.

New Regularizations (proposal):

  1. Some randomize in sorting
  2. don't use the cat(bin) with too little data as split point.

Your idea is very welcome here 😄 .

@guolinke
Copy link
Collaborator Author

guolinke commented Jan 2, 2019

ping @Laurae2 @StrikerRUS @henry0312

@StrikerRUS
Copy link
Collaborator

@guolinke
Copy link
Collaborator Author

guolinke commented Jan 4, 2019

I remember CatBoost's solutions didn't related to GBDT's algorithm, as they preprocess the features. Thus, these solutions could be in other GBDT tool as well.

@henry0312
Copy link
Contributor

@guolinke

I remember CatBoost's solutions didn't related to GBDT's algorithm, as they preprocess the features. > Thus, these solutions could be in other GBDT tool as well.

That's right.

To be honest, I don't understand the current way very well, however, I think less parameter is better because more paramers leads to overfitting.
And there is a famous solution for categorical (I don't remember it now).
I'll tell you later.

@henry0312
Copy link
Contributor

henry0312 commented Jan 9, 2019

As you know, CART is one of the famous algorithms, which is used in scikit-learn.
(A Step by Step CART Decision Tree Example may be useful for understanding)

scikit-learn uses an optimised version of the CART algorithm; however, scikit-learn implementation does not support categorical variables for now.

from https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart.

However, scikit-learn doesn't support categorical variables 😢

@julioasotodv
Copy link
Contributor

julioasotodv commented Feb 25, 2019

In some experiments with private data, the regularization hyperparameters for the categorical variables (max_cat_threshold, cat_l2, cat_smooth and max_cat_to_onehot) can dramatically change the bias and/or variance of the final model if the dataset contains lots of categorical features. However, I have not found an obvious way to tune these hyperparameters (other than exhaustive grid search).

@ekerazha
Copy link

ekerazha commented Mar 20, 2019

@guolinke

I remember CatBoost's solutions didn't related to GBDT's algorithm, as they preprocess the features. Thus, these solutions could be in other GBDT tool as well.

If you look at this https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html it says "Before each split is selected in the tree...", so it's not really a preprocessing step, because the encoding is performed before each split, so it's something that needs to be implemented in the GBDT/RF algorithm. Because the encoding is performed on different data splits, this reduces overfitting.

@guolinke guolinke unpinned this issue Apr 10, 2019
@btrotta
Copy link
Collaborator

btrotta commented Jul 9, 2019

I think one of the reasons for overfitting categorical features is that the categories are ordered by weight before finding splits. This means that the gain from a split on a categorical feature will in general be larger than for a numerical feature, and categorical features are disproportionately likely to be chosen for a split vs numerical features.

For example, if we make only one split in some categorical feature (after this ordering has been done), then the feature is already optimally divided into top n and bottom num_bins - n weights. But if instead we consider the feature as numerical (using some random integer encoding of the categories), then in general the weight of the bins will not be monotone increasing, so it will take several splits to achieve this same separation in weights. Therefore the gains will in general be larger from splitting on categorical vs numerical features, so categorical features will more often be chosen as the best split by the algorithm.

L2 regularization (lambda_l2, cat_l2, cat_smooth) helps somewhat with this problem, since it reduces the magnitude of the weights (especially for small categories) and the potential gain of each split. But it might also be useful to have an L1 regularization for categorical features (i.e. a cat_l1), which would directly penalize the number of splits on a categorical feature.

@chaupmcs
Copy link

chaupmcs commented Jul 9, 2019

@btrotta I agree with " overfitting categorical features" part and I also think it's because we already optimize the order for splitting. But the example is not clear yet. Comparing "ordering categorical features" and "random integer encoding of the categories" in the example is somewhat not fair. It should be a comparison between "ordering categorical features" and some "true numerical features". I think the numerical themselves have some attributes of "ordering". For example, with "age" features, sorting the age makes sense and could be considered as a way of ordering, though it's not optimized as categorical ones.

@btrotta
Copy link
Collaborator

btrotta commented Jul 9, 2019

@chaupmcs Yes, that's a fair point. A true numerical feature would probably have bin weights with longer monotone intervals, and fewer turning points, compared to a random encoding of a categorical feature. But still, unless the numerical feature is completely monotone, the categorical feature offers easier gains from splitting.

@StrikerRUS
Copy link
Collaborator

Closing in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open (or post a comment if you are not a topic starter) this issue if you are actively working on implementing this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants