Subsampling rows with replacement #1038

mayer79 · 2017-11-04T14:42:22Z

As far as I understand, the random forest (rf) mode differs from a genuine rf in three key aspects:

Column subsampling is done per tree instead of per split.
Row subsampling is done without replacement instead of with replacement.
No OOB predictions

How realistic would it be to add a "bagging_with_replacement" option? If set to True, then the rows would be subsampled with replacement, mimicking the idea of bagging. This might even be an interesting option for non-rf application.

StrikerRUS · 2017-11-05T10:12:55Z

Related issue #883.

mayer79 · 2017-11-06T13:39:07Z

bootstrap seems to be a suitable name. Any form of row subsampling would thus be required if either bagging_fraction < 1 or bootstrap = True.

guolinke · 2017-11-06T23:39:31Z

It is not trivial to have this in the core algorithm.
However, a simple solution is using weight, that is, giving weight 0 to the no-sampled data, 1 to the “one-sample” data, and k to the “k-sample” data...
It is easy to have this in python package, since you can change the weight on each iteration.

mayer79 · 2017-11-07T08:01:21Z

Good hint. I was actually not aware that case weights could be updated during training. The Poisson distribution with mean 1 will provide an efficient and approximately correct weight distribution.

mayer79 · 2020-03-18T13:03:00Z

I am reopening this as

I am still interested in this feature in order to be able to emulate random forests. Together with the relatively new "colsample_bynode", it would be very close to a native random forest.
Sampling with replacement should be computationally more efficient than without.

StrikerRUS · 2020-03-18T17:53:22Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

rdbuf · 2020-06-10T19:49:36Z

Hi @guolinke,

However, a simple solution is using weight, that is, giving weight 0 to the no-sampled data, 1 to the “one-sample” data, and k to the “k-sample” data...
It is easy to have this in python package, since you can change the weight on each iteration.

I understand that this is an old and closed issue, but may I ask you to elaborate on this solution a little bit more? How can one change sample weights for each tree in the random forest?

One solution could be using callbacks I suppose, is it the only way?

guolinke · 2020-06-11T00:25:01Z

hi @rdbuf , yeah, callback is the most convenient way to do this, I think.

rdbuf · 2020-06-11T19:41:57Z

I see, thanks :)

wxchan added the feature request label Nov 5, 2017

mayer79 closed this as completed Nov 7, 2017

mayer79 reopened this Mar 18, 2020

StrikerRUS added the help wanted label Mar 18, 2020

guolinke mentioned this issue Mar 18, 2020

Feature Requests & Voting Hub #2302

Open

StrikerRUS closed this as completed Mar 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subsampling rows with replacement #1038

Subsampling rows with replacement #1038

mayer79 commented Nov 4, 2017

StrikerRUS commented Nov 5, 2017

mayer79 commented Nov 6, 2017

guolinke commented Nov 6, 2017

mayer79 commented Nov 7, 2017

mayer79 commented Mar 18, 2020 •

edited

Loading

StrikerRUS commented Mar 18, 2020

rdbuf commented Jun 10, 2020

guolinke commented Jun 11, 2020

rdbuf commented Jun 11, 2020

Subsampling rows with replacement #1038

Subsampling rows with replacement #1038

Comments

mayer79 commented Nov 4, 2017

StrikerRUS commented Nov 5, 2017

mayer79 commented Nov 6, 2017

guolinke commented Nov 6, 2017

mayer79 commented Nov 7, 2017

mayer79 commented Mar 18, 2020 • edited Loading

StrikerRUS commented Mar 18, 2020

rdbuf commented Jun 10, 2020

guolinke commented Jun 11, 2020

rdbuf commented Jun 11, 2020

mayer79 commented Mar 18, 2020 •

edited

Loading