[RFC] [doc] Add long-form documentation on sampling in LightGBM #5070

jameslamb · 2022-03-13T05:03:34Z

Summary

There are several points in the process of training a LightGBM model where less than the full training data is used.

I think it would be valuable to add a section called "Sampling" or similar at https://lightgbm.readthedocs.io/en/latest/Features.html, describing these concepts.

Motivation

There are many parameters available to control the different types of sampling, and the interactions between them are more complex than can be clearly expressed in the documentation in any individual parameter's docs at https://lightgbm.readthedocs.io/en/latest/Parameters.html.

I believe such documentation would significantly improve users' understanding of how LightGBM works, and help them to make informed decisions about values for LightGBM's parameters.

Description

My idea for this is several paragraphs like the following, mixing an explanation of LightGBM processes with the names of specific parameters that can be used to control it.

LightGBM does not perform boosting directly on the raw values in input data. Instead, it performs some pre-processing such as binning continuous features into histograms, bundling sparse features together, and performing target encoding on categorical features.

This pre-processing creates an object called a Dataset. To improve the speed of Dataset construction, LightGBM samples the input data to determine characteristics like histogram bin boundaries. Use parameter bin_construct_sample_cnt (default=200000) to control how many observations are sample during this process, and data_random_seed to make the process reproducible.

Other themes that I think should be covered

explaining that the sampling during dataset construction and sampling during boosting are different and separate from each other
explaining the difference between goss and bagging
- why choose one over the other?
- how do the relevant parameters affect the process? (e.g. bagging_fraction)
sampling features (feature_fraction)
sampling splits to evaluate (extra_trees)
how sampling is a core part of distributed training
- e.g. with tree_learner=data_parallel, the work of determining bin boundaries for features is split up over partitions of the data

References

Created this based on the discussion in #4827.

The text was updated successfully, but these errors were encountered:

shiyu1994 · 2022-03-24T02:36:23Z

@jameslamb Thanks for writing up this. I think after we merging #5091, we can add the section for sampling algorithms once for all. WDYT?

jameslamb added the doc label Mar 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] [doc] Add long-form documentation on sampling in LightGBM #5070

[RFC] [doc] Add long-form documentation on sampling in LightGBM #5070

jameslamb commented Mar 13, 2022

shiyu1994 commented Mar 24, 2022

[RFC] [doc] Add long-form documentation on sampling in LightGBM #5070

[RFC] [doc] Add long-form documentation on sampling in LightGBM #5070

Comments

jameslamb commented Mar 13, 2022

Summary

Motivation

Description

References

shiyu1994 commented Mar 24, 2022