Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] [doc] Add long-form documentation on sampling in LightGBM #5070

Open
jameslamb opened this issue Mar 13, 2022 · 1 comment
Open

[RFC] [doc] Add long-form documentation on sampling in LightGBM #5070

jameslamb opened this issue Mar 13, 2022 · 1 comment
Labels

Comments

@jameslamb
Copy link
Collaborator

Summary

There are several points in the process of training a LightGBM model where less than the full training data is used.

I think it would be valuable to add a section called "Sampling" or similar at https://lightgbm.readthedocs.io/en/latest/Features.html, describing these concepts.

Motivation

There are many parameters available to control the different types of sampling, and the interactions between them are more complex than can be clearly expressed in the documentation in any individual parameter's docs at https://lightgbm.readthedocs.io/en/latest/Parameters.html.

I believe such documentation would significantly improve users' understanding of how LightGBM works, and help them to make informed decisions about values for LightGBM's parameters.

Description

My idea for this is several paragraphs like the following, mixing an explanation of LightGBM processes with the names of specific parameters that can be used to control it.

LightGBM does not perform boosting directly on the raw values in input data. Instead, it performs some pre-processing such as binning continuous features into histograms, bundling sparse features together, and performing target encoding on categorical features.

This pre-processing creates an object called a Dataset. To improve the speed of Dataset construction, LightGBM samples the input data to determine characteristics like histogram bin boundaries. Use parameter bin_construct_sample_cnt (default=200000) to control how many observations are sample during this process, and data_random_seed to make the process reproducible.

Other themes that I think should be covered

  • explaining that the sampling during dataset construction and sampling during boosting are different and separate from each other
  • explaining the difference between goss and bagging
    • why choose one over the other?
    • how do the relevant parameters affect the process? (e.g. bagging_fraction)
  • sampling features (feature_fraction)
  • sampling splits to evaluate (extra_trees)
  • how sampling is a core part of distributed training
    • e.g. with tree_learner=data_parallel, the work of determining bin boundaries for features is split up over partitions of the data

References

Created this based on the discussion in #4827.

@jameslamb jameslamb added the doc label Mar 13, 2022
@shiyu1994
Copy link
Collaborator

@jameslamb Thanks for writing up this. I think after we merging #5091, we can add the section for sampling algorithms once for all. WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants