Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ignoring some features during training on constructed dataset #4317

Closed
wangmn93 opened this issue May 24, 2021 · 3 comments
Closed

Comments

@wangmn93
Copy link

wangmn93 commented May 24, 2021

I want to get a subset of features from a contructed dataset because reconstructing the dataset is time-consuming. Is there any way to do this or any suggestion on how to modify the source code?

Another way is ignoring some features on constructed dataset, but this feature does not work on constructed dataset.

@shiyu1994
Copy link
Collaborator

@wangmn93 Thanks for using LightGBM. Currently LightGBM don't support extracting a subset of features from a constructed data. But I think there's a quick hack to do this.

In addition to ignore_column, which is used when constructing dataset, we can add a new parameter ignore_column_training, which is dedicated to ignore some features during training. Then we can parse ignore_column_training in the same way as ignore_column, and store it in the config_ object. Then in src/treelearner/serial_tree_learner.cpp, we can set the is_feature_used according to ignore_column_training in the SerialTreeLearner::FindBestSplits method.

In this way, ignore_column_training is not involved in the dataset constructing process. Instead, it only affects the training process, and can be safely changed through setting params in the lgb.train method.

If you have any further problem about the implementation, or need any other help, please feel free to post here.

@shiyu1994
Copy link
Collaborator

BTW, when setting is_feature_used, we should notice that is_feature_used uses the so called inner feature index of LightGBM, which is different from the real feature index in the input data. So we need to remap the real feature index specified by ignore_column_training to inner feature index through train_data_->InnerFeatureIndex method.

@StrikerRUS StrikerRUS changed the title How to get subset of features from constructed dataset? Support ignoring some features during training on constructed dataset Jun 9, 2021
@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants