Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsampling rows with replacement #1038

Closed
mayer79 opened this issue Nov 4, 2017 · 9 comments
Closed

Subsampling rows with replacement #1038

mayer79 opened this issue Nov 4, 2017 · 9 comments

Comments

@mayer79
Copy link
Contributor

mayer79 commented Nov 4, 2017

As far as I understand, the random forest (rf) mode differs from a genuine rf in three key aspects:

  1. Column subsampling is done per tree instead of per split.
  2. Row subsampling is done without replacement instead of with replacement.
  3. No OOB predictions

How realistic would it be to add a "bagging_with_replacement" option? If set to True, then the rows would be subsampled with replacement, mimicking the idea of bagging. This might even be an interesting option for non-rf application.

@StrikerRUS
Copy link
Collaborator

Related issue #883.

@mayer79
Copy link
Contributor Author

mayer79 commented Nov 6, 2017

bootstrap seems to be a suitable name. Any form of row subsampling would thus be required if either bagging_fraction < 1 or bootstrap = True.

@guolinke
Copy link
Collaborator

guolinke commented Nov 6, 2017

It is not trivial to have this in the core algorithm.
However, a simple solution is using weight, that is, giving weight 0 to the no-sampled data, 1 to the “one-sample” data, and k to the “k-sample” data...
It is easy to have this in python package, since you can change the weight on each iteration.

@mayer79
Copy link
Contributor Author

mayer79 commented Nov 7, 2017

Good hint. I was actually not aware that case weights could be updated during training. The Poisson distribution with mean 1 will provide an efficient and approximately correct weight distribution.

@mayer79 mayer79 closed this as completed Nov 7, 2017
@mayer79 mayer79 reopened this Mar 18, 2020
@mayer79
Copy link
Contributor Author

mayer79 commented Mar 18, 2020

I am reopening this as

  1. I am still interested in this feature in order to be able to emulate random forests. Together with the relatively new "colsample_bynode", it would be very close to a native random forest.

  2. Sampling with replacement should be computationally more efficient than without.

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@rdbuf
Copy link

rdbuf commented Jun 10, 2020

Hi @guolinke,

However, a simple solution is using weight, that is, giving weight 0 to the no-sampled data, 1 to the “one-sample” data, and k to the “k-sample” data...
It is easy to have this in python package, since you can change the weight on each iteration.

I understand that this is an old and closed issue, but may I ask you to elaborate on this solution a little bit more? How can one change sample weights for each tree in the random forest?

One solution could be using callbacks I suppose, is it the only way?

@guolinke
Copy link
Collaborator

hi @rdbuf , yeah, callback is the most convenient way to do this, I think.

@rdbuf
Copy link

rdbuf commented Jun 11, 2020

I see, thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants